Free language processing service and NLP C# code
Compute Type-Token-Ratio

NLP Task

Let me first define type and token in corpus linguistics. Type refers to all different types of words of a piece of text. For example, if a text has 100 words, but all of them are the same word, we say that it has only one type. If all of the 100 words are different from each other, we say that it has 100 types. Token, on the other hand, refers to all words of a piece of text. Therefore, a 100-word text has 100 tokens. Type-token-ratio is obtained by dividing the type count by the token count, which is always <= 1.

Our task is to compute the type-token-ratio of any text.

Algorithms and Implementation

To compute the type-token-ratio of a piece of text, we'll use another generic type: HashSet<T>, in our case, HashSet<string>. The algorithm is as follows:


  - Create a Windows project with a button, three labels and three text boxes.

  - Locate the OpenFileDialog in ToolBox and double click it and you will
      find  OpenFileDialog1 in the component tray of the designer.

  - Double click the button to prompt the user to select a text file.

  - Double click OpenFileDialog1 to use the following algorithm for the event handler: 

    + Use our improved tokenizer to tokenize the text into an array of words, the length of which is the number of tokens; 
    + Declare a HashSet<string>  variable but not assign any value to it, which means it is null;
    + Declare a double variable, but not assign any value to it, which means it is 0;

    + Pass these variables to the GetTypeTokenRatio method, which accepts two out parameters and instantiates them as follows:
       - It instantiates the HashSet<string> parameter out of the tokens parameter
       - It computes the type-token-ratio by dividing the count of the HashSet<string> by the count of the token array.

Code (Download)


using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Windows.Forms;
using System.IO;

namespace TypeTokenRatio
{
    public partial class Form1 : Form
    {
        public Form1( )
        {
            InitializeComponent( );
        }

         // When the go button is clicked, show the Open file dialog. 
        private void btnGo_Click( object sender, EventArgs e )
        {
            MessageBox.Show( "Open a text file." );
            openFileDialog1.ShowDialog( );
        }

        private void openFileDialog1_FileOk( object sender, CancelEventArgs e )
        {
             // Limit to file with .txt extension only. 
            string file = openFileDialog1.FileName;
            if ( Path.GetExtension( file ) != ".txt" )
            {
                MessageBox.Show( "Select text file only!" );
                return;
            }            

             // The length of this array is the number of tokens of the file. 
            string[] tokens = null;

             // This deals with the potential exception caused by files with .txt extensions 
             // but they are actually not text files. 
            try
            {
                string text = File.ReadAllText( file );
                if ( text == string.Empty )
                {
                    MessageBox.Show( "This is an empty file. No type token ratio can be computed." );
                    return;
                }

                tokens = Tokenize( text );
            }
            catch ( Exception ex )
            {
                MessageBox.Show( ex.Message );
            }

            textBoxFile.Text = Path.GetFileName( file );

             // HashSet has the feature that only unique items are accepted, which is exactly what we want for types. 
            HashSet<string> types;
            double typeTokenRatio;

             // The out keyword indicate that the parameter will be instantiated by the called method. 
             // In our case, the HashSet and the typeTokenRation that will be given a new value after 
             // the invocation of GetTypeTokenTation method is completed. 
            GetTypeTokenRatio( tokens, out types, out typeTokenRatio );

             // Fill in the text boxes 
            textBoxTokens.Text = tokens.Length.ToString( );
            textBoxTypes.Text = types.Count.ToString( );

             // How to read the format "{0:0.###}":  
             // the first 0 is a placeholder for the actual value (in our case,  
             //     the type token ratio number with a lot of digits);  
             // the second 0 tells the system to keep the 0 before the decimal point; 
             // ### means keep only three digits after the decimal point. 
             // 
            textBoxTTR.Text = string.Format( "{0:0.###}" , typeTokenRatio );
        }

        private static void GetTypeTokenRatio( string[] tokens, out HashSet<string> types, out double typeTokenRatio )
        {
             // dump array of words into a HashSet of string.  
            types = new HashSet<string>( );

             // HashSet ignores duplicated elements which ensures for us that duplicated words be counted only once. 
            foreach ( string token in tokens )
            {
                types.Add( token );
            }

             // A sanity check: if types set is empty, set typeTokenRatio = double.NaN, i.e. Not a Number.  
             // Otherwise, we'll get a "divided by 0" Exception. 

            if ( types.Count == 0 )
            {
                typeTokenRatio = double.NaN;
            }
            else
            {
                 // Be very aware that you need to cast either types.Count or tokens.Length into  
                 // double type; otherwise you'll always get 0 as the result. 
                typeTokenRatio = ( double )types.Count / tokens.Length;
            }
        }

        /// <summary>
        ///  Tokenizes a text into an array of words, using the improved
        ///  tokenizer with the discard-digit option.
        /// </summary>
        /// <param name="text"> the text to tokenize</param>
        private static string[] Tokenize( string text )
        {
              // This will discard digits 
            char[] delimiters_no_digits = new char[] {
            '{ ', ' }', '( ', ' )', ' [', ']', '>', '<','-', '_', '= ', '+',
            '|', '\\', ':', ';', ' ', ',', '.', '/', '?', '~', '!',
            '@', '#', '$', '%', '^', '&', '*', ' ', '\r', '\n', '\t',
            '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' };

            string[] tokens = text.Split( delimiters_no_digits,
                                    StringSplitOptions.RemoveEmptyEntries );

            for ( int i = 0; i < tokens.Length; i++ )
            {
                string token = tokens [i];

                 // Change token only when it starts and/or ends with "'" and  
                 // it has at least 2 characters. 

                if ( token.Length > 1 )
                {
                    if ( token.StartsWith( "'" ) && token.EndsWith( "'" ) )
                        tokens [i] = token.Substring( 1, token.Length - 2 ); // remove the starting and ending "'" 

                    else if ( token.StartsWith( "'" ) )
                        tokens [i] = token.Substring( 1 ); // remove the starting "'" 

                    else if ( token.EndsWith( "'" ) )
                        tokens [i] = token.Substring( 0, token.Length - 1 ); // remove the last "'" 
                }
            }

            return tokens;
        }
    }
}

Figures

Compute Type-Token-Ratio
Fig. Compute Type Token Ratio

Note

- Some people believe that type-token-ratio is an indicator of a person's vocabulary size.

- Type-token-ratio is also taken as an important indicator of an author's style.

- Let's try this out. Go to http://www.gutenberg.org/browse/scores/top to download several novels and compute the type-token-ratio's. It is said that Earnest Hemingway likes simple writing styles while Jack London is inclined to use more descriptive words in his novels. Try several novels by these two novelists and see what the results will be.


Skip Navigation LinksHome > Nlp Sample Code > Linguistic Fingerprinting > Compute Type-Token-Ratio