Free language processing service and NLP C# code
Build NLP Class Library

NLP Task

If you have been following along the sample code projects, you may have noticed that we are repeating some of the methods such as Tokenize Text, Tokenize File, Get Word Frequency, etc, again and again. It would be good if we can write the methods only once and reuse them if we want. On the other hand, you may also have noticed that we can reuse methods like Substring and Replace of the String class. This is because Microsoft has created those libraries for us. To allow our methods to be reused, we need to create our own library. For our NLP purpose, there are two types of class library we need to create: instance classes and static classes. This page shows you how to create the instance class library of Tokenizer.

Algorithms and Implementation

  1. From the New Project dialog, select the Class Library template, set its name to MyLibrary, and from the dropdown menu in the upper-right corner, select .NET Framework 3.5.
  2. Click OK
  3. From the Solution Window, rename the initial class from Class1.cs to Tokenizer.cs
  4. Click Yes in the popup renaming dialog
  5. Replace the Tokenizer class code with the code shown in this page.
  6. Build -> Build Tokenizer

Code (Download)


using System;
using System.Collections.Generic;
using System.Text;
using System.IO;

namespace MyLibrary
{
    public class Tokenizer
    {
         // Use this to keep digits. 
        private char[] delimiters_keep_digits = new char[] {
            '{ ', ' }', '( ', ' )', ' [', ']', '>', '<','-', '_', '= ', '+',
            '|', '\\', ':', ';', ' ', ',', '.', '/', '?', '~', '!',
            '@', '#', '$', '%', '^', '&', '*', ' ', '\r', '\n', '\t' };

         // This will discard digits 
        private char[] delimiters_no_digits = new char[] {
            '{ ', ' }', '( ', ' )', ' [', ']', '>', '<','-', '_', '= ', '+',
            '|', '\\', ':', ';', ' ', ',', '.', '/', '?', '~', '!',
            '@', '#', '$', '%', '^', '&', '*', ' ', '\r', '\n', '\t',
            '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' };

        /// <summary>
        ///  Tokenizes text into an array of words, using whitespace and
        ///  all punctuation as delimiters.
        /// </summary>
        /// <param name="text"> the text to tokenize</param>
        /// <returns> an array of resulted tokens</returns>
        public string[] GreedyTokenize( string text )
        {
            char[] delimiters = new char[] {
            '{ ', ' }', '( ', ' )', ' [', ']', '>', '<','-', '_', '= ', '+',
            '|', '\\', ':', ';', ' ', ',', '.', '/', '?', '~', '!',
            '@', '#', '$', '%', '^', '&', '*', ' ', '\r', '\n', '\t' };

            return text.Split( delimiters, StringSplitOptions.RemoveEmptyEntries );
        }
        
        /// <summary>
        ///  Tokenizes a text into an array of words, using whitespace and
        ///  all punctuation except the apostrophe "'" as delimiters. Digits
        ///  are handled based on user choice.
        /// </summary>
        /// <param name="text"> the text to tokenize</param>
        /// <param name="keepDigits"> true to keep digits; false to discard digits.</param>
        /// <returns> an array of resulted tokens</returns>
        public string[] Tokenize( string text, bool keepDigits )
        {
            string[] tokens = null;

            if ( keepDigits )
                tokens = text.Split( delimiters_keep_digits, StringSplitOptions.RemoveEmptyEntries );
            else
                tokens = text.Split( delimiters_no_digits, StringSplitOptions.RemoveEmptyEntries );

            for ( int i = 0; i < tokens.Length; i++ )
            {
                string token = tokens [i];

                 // Change token only when it starts and/or ends with "'" and the 
                 // toekn has at least 2 characters. 
                if ( token.Length > 1 )
                {
                    if ( token.StartsWith( "'" ) && token.EndsWith( "'" ) )
                        tokens [i] = token.Substring( 1, token.Length - 2 ); // remove the starting and ending "'" 

                    else if ( token.StartsWith( "'" ) )
                        tokens [i] = token.Substring( 1 ); // remove the starting "'" 

                    else if ( token.EndsWith( "'" ) )
                        tokens [i] = token.Substring( 0, token.Length - 1 ); // remove the last "'" 
                }
            }

            return tokens;
        }

        /// <summary>
        ///  Tokenizes a text into an array of words, using whitespace and
        ///  all punctuation except the apostrophe "'" as delimiters. Digits
        ///  are handled based on user choice.
        /// </summary>
        /// <param name="filePaht"> the path of the file to tokenize to tokenize</param>
        /// <param name="keepDigits"> true to keep digits; false to discard digits.</param>
        /// <returns> an array of resulted tokens</returns>
        public string[] TokenizeFile( string filePath, bool keepDigits )
        {
            if ( string.IsNullOrEmpty( filePath ) )
                return null;

            if( !( new FileInfo ( filePath ) ).Exists )
                return null;

            string[] tokens = null;

            try
            {
                string text = File.ReadAllText( filePath );
                tokens = Tokenize( text, keepDigits );
            }
            catch
            {
            }

            return tokens;
        }
    }
}

Figures

Note

As shown from the code, we simply copied the code from several previous sample code projects. In a later sample code project, I'll show you how to use this library.


Skip Navigation LinksHome > Nlp Sample Code > Create NLP Library > Build NLP Class Library