Free language processing service and NLP C# code
Improve Greedy Tokenizer

NLP Task

We want to improve our greedy tokenizer by not allowing the "'" character to join the separator group; instead, it will be handled case by case. We also want to let the user decide how to deal with digits.

Algorithms and Implementation

- Modify the Greedy Tokenizer user interface by adding a group box containing two radio buttons for the user to decide how to deal with digits.

- Create two different splitting character arrays. One has the digit characters which is used when the user decides to throw away numbers. The other does not have the digits which is used when the user decides to keep the numbers. But none of them contains the "'" (apostrophe) character.

- Split the text into arrays using a particular separator based on the check-state of the radio button.

- Examine each resulted token discriminatively. See the code comment for details.

Code (Download)


using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Text;
using System.Windows.Forms;
using System.IO;

namespace ImprovedTokenizer
{
    public partial class Form1 : Form
    {
         // Use this to keep digits. 
        private static char[] delimiters_keep_digits = new char[] {
            '{ ', ' }', '( ', ' )', ' [', ']', '>', '<','-', '_', '= ', '+',
            '|', '\\', ':', ';', ' ', ',', '.', '/', '?', '~', '!',
            '@', '#', '$', '%', '^', '&', '*', ' ', '\r', '\n', '\t' };

         // This will discard digits 
        private static char[] delimiters_no_digits = new char[] {
            '{ ', ' }', '( ', ' )', ' [', ']', '>', '<','-', '_', '= ', '+',
            '|', '\\', ':', ';', ' ', ',', '.', '/', '?', '~', '!',
            '@', '#', '$', '%', '^', '&', '*', ' ', '\r', '\n', '\t',
            '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' };

        public Form1( )
        {
            InitializeComponent( );
        }

        private void btnGo_Click( object sender, EventArgs e )
        {
            textBoxOutput.Clear( );

            string text = textBoxInput.Text;

             // Call the improved tokenizer 
            string[] tokens = Tokenize( text, radioKeepDigits.Checked );

            foreach ( string token in tokens )
            {
                textBoxOutput.AppendText( token + "\r\n" );
            }
            
        }

        /// <summary>
        ///  Tokenizes a text into an array of words, using whitespace and
        ///  all punctuation except the apostrophe "'" as delimiters. Digits
///  are handled based on user choice.
        /// </summary>
        /// <param name="text"> the text to tokenize</param>
        /// <param name="keepDigits"> true to keep digits; false to discard digits.</param>
        /// <returns> an array of resulted tokens</returns>
        public static string[] Tokenize( string text, bool keepDigits )
        {
            string[] tokens = null;

            if ( keepDigits )
                tokens = text.Split( delimiters_keep_digits, StringSplitOptions.RemoveEmptyEntries );
            else
                tokens = text.Split( delimiters_no_digits, StringSplitOptions.RemoveEmptyEntries );

            for ( int i = 0; i < tokens.Length; i++ )
            {
                string token = tokens [i];

                 // Change token only when it starts and/or ends with "'" and the 
                 // toekn has at least 2 characters. 
                if ( token.Length > 1 )
                {
                    if ( token.StartsWith( "'" ) && token.EndsWith( "'" ) )
                        tokens [i] = token.Substring( 1, token.Length - 2 ); // remove the starting and ending "'" 

                    else if ( token.StartsWith( "'" ) )
                        tokens [i] = token.Substring( 1 ); // remove the starting "'" 

                    else if ( token.EndsWith( "'" ) )
                        tokens [i] = token.Substring( 0, token.Length - 1 ); // remove the last "'" 
                }
            }

            return tokens;
        }
    }
}

Figures

Improve Greedy Tokenizer
Fig. Improved Tokenizer

Note

Similar case-by-case treatments can be applied to other punctuation marks. For example, '.' is not necessarily always a sentence-end marker; the ',' character, if it is between digits, should not be discarded; '$' should be kepted if it is immediately followed by a digit, and so on.


Skip Navigation LinksHome > Nlp Sample Code > Text Manipulation > Improve Greedy Tokenizer