Free language processing service and NLP C# code
Compute Word Frequency

NLP Task

Given a piece of text, we want to find out the frequencies of its distinct words.

Algorithms and Implementation

For this task, we'll use the Dictionary<TKey, TValue> generic type. Just replace 'TKey' and 'TValue' with specif types: in our case, they are 'string' and 'int', respectively. To use this type, add using System.Collections.Generic directive.


 Split the text into an array of words.
 
 Create an empty Dictionary<string, int>, the key for a word and the value for its frequency.

 Search each word of the array in the dictionary: if it is in the dictionary, increase its value by 1;
   otherwise, add it to the dictionary as a new entry and set its value to 1.

 Use another generic type KeyValuePair<string, int> to dump out all of the entries of the dictionary. 
   Each KeyValuePair<string, int> represents an entry of our dictionary, the key being a word and the
   value being its frequency.

Code (Download)


using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Text;
using System.Windows.Forms;

namespace WordFrequency
{
    public partial class Form1 : Form
    {
         // This will discard digits 
        private static char[] delimiters_no_digits = new char[] {
            '{ ', ' }', '( ', ' )', ' [', ']', '>', '<','-', '_', '= ', '+',
            '|', '\\', ':', ';', ' ', ',', '.', '/', '?', '~', '!',
            '@', '#', '$', '%', '^', '&', '*', ' ', '\r', '\n', '\t',
            '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' };

        public Form1( )
        {
            InitializeComponent( );
        }

        private void btnGo_Click( object sender, EventArgs e )
        {
            if ( textBoxInput.Text != string.Empty )
            {
                string[] words = Tokenize( textBoxInput.Text );
                if ( words.Length > 0 )
                {
                    SortedDictionary<string, int> dict
                        = new SortedDictionary<string, int>( );

                    foreach ( string word in words )
                    {
                        if ( dict.ContainsKey( word ) )
                        {
                            dict [word]++;
                        }
                        else
                        {
                            dict.Add( word, 1 );
                        }
                    }

                     // Dump out the dict entries to the output box. 
                     // For efficiency, dump them to StringBuilder and set the 
                     // capacity of the StringBuilder to the number of entries 
                     // multipled by the average length of each entry plus 4 for 
                     // [number]. For more details, see .NET Framework SDK  
                     // documentation for StringBuilder. 
                    StringBuilder resultSb = new StringBuilder( dict.Count * 9 );
                    foreach ( KeyValuePair<string, int> entry in dict )
                    {
                        resultSb.AppendLine( string.Format( "{0} [{1}]" , entry.Key, entry.Value ) );
                    }

                    textBoxOutput.Text = resultSb.ToString( );
                }
            }
        }

        /// <summary>
        ///  Tokenizes a text into an array of words, using the improved
        ///  tokenizer with the discard-digit option.
        /// </summary>
        /// <param name="text"> the text to tokenize</param>
        public static string[] Tokenize( string text )
        {
            string[] tokens = text.Split( delimiters_no_digits,
                                    StringSplitOptions.RemoveEmptyEntries );

            for ( int i = 0; i < tokens.Length; i++ )
            {
                string token = tokens [i];

                 // Change token only when it starts and/or ends with "'" and  
                 // it has at least 2 characters. 

                if ( token.Length > 1 )
                {
                    if ( token.StartsWith( "'" ) && token.EndsWith( "'" ) )
                        tokens [i] = token.Substring( 1, token.Length - 2 ); // remove the starting and ending "'" 

                    else if ( token.StartsWith( "'" ) )
                        tokens [i] = token.Substring( 1 ); // remove the starting "'" 

                    else if ( token.EndsWith( "'" ) )
                        tokens [i] = token.Substring( 0, token.Length - 1 ); // remove the last "'" 
                }
            }

            return tokens;
        }
    }
}

Figures

Compute Word Frequency
Fig. Word Frequency

Note

- Try this program with texts of very big size.

- The algorithm can be improved to treat a word and its inflected forms as the same word. Therefore, work, works, working, and worked are treated as work having appeared 4 times. A stemmer is needed for this improvement.

- Replace SortedDictioary<string, int> with Dictionary<string, int> to see the result.

- Think about how to list words not by alphabetic order but by their frequency, so that we can get something like:


        the  [2975]
        and  [2073] 
        with [450]
        ...
   

Skip Navigation LinksHome > Nlp Sample Code > Linguistic Fingerprinting > Compute Word Frequency