Free language processing service and NLP C# code
Tokenize Text Ruthlessly

NLP Task

Given a piece of text, we want to split the text at all spaces (including new line characters and carriage returns) and punctuation marks.

Algorithms and Implementation

The string class has a 'Split' method to split a text into an array of tokens. This method has been overloaded several times. We'll use the following version:

string[] Split(char[] separator, StringSplitOptions options)

The StringSplitOptions has two values: None and RemoveEmptyEntries. After a string is split, it may leave some empty substrings. If you don't want them, choose RemoveEmptyEntries option.

You also need to create an array of characters serving as the splitters. In our case, they are all kinds of spaces and punctuation marks.

Now let's create a Windows Forms Application Project and drag two Textboxes, two Labels and one Button on the designer. Reference the follow figure for their texts.

Code (Download)


using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Text;
using System.Windows.Forms;

namespace GreedyTokenizer
{
    public partial class Form1 : Form
    {
        public Form1( )
        {
            InitializeComponent( );
        }

        private void btnGo_Click( object sender, EventArgs e )
        {
            string text = textBoxInput.Text;
            string[] tokens = GreedyTokenize( text );

            foreach ( string token in tokens )
            {
                textBoxOutput.AppendText( token + "\r\n" );
            }
            
        }

        /// <summary>
        ///  Tokenizes text into an array of words, using whitespace and
        ///  all punctuation as delimiters.
        /// </summary>
        /// <param name="text"> the text to tokenize</param>
        /// <returns> an array of resulted tokens</returns>
        public static string[] GreedyTokenize( string text )
        {
            char[] delimiters = new char[] {

                  '{ ', ' }', '( ', ' )', ' [', ']', '>', '<','-', '_', '= ', '+',
                  '|', '\\', ':', ';', ' ', '\'', ',', '.', '/', '?', '~', '!',
                  '@', '#', '$', '%', '^', '&', '*', ' ', '\r', '\n', '\t'
                  
                  };

            return text.Split( delimiters, StringSplitOptions.RemoveEmptyEntries );
        }
    }
}

Figures

Tokenize Text Ruthlessly
Fig. Greedy Tokenizer

Note

The usefulness of this tokenizer is limited as it simply throws away all spaces and punctuation marks. For example, it splits contracted words into non-words, such as shouldn't into shouldn and t, which is not acceptable in many situations. The Improve Greedy Tokenizer code shows you how to improve it, but before that you may need to find the Remove Non-alphanumeric-Characters code useful.


Skip Navigation LinksHome > Nlp Sample Code > Text Manipulation > Tokenize Text Ruthlessly