Free language processing service and NLP C# code
Average Type Token Ratio

NLP Task

As any language's vocabulary is finite but the length of a text can be infinitely long since we can, in theory, at least, append texts to texts an infinite number of times, type token ratio is affected by a text's length. In general, the longer a text, the smaller its type token ratio. We want to overcome this problem by computing the average type token ratio of a text independent of its length.

Algorithms and Implementation

The central algorithm of getting the average type token ratio is based on that of Dr. Michael A. Covington, a senior research scientist and my MS advisor, and Joe D. McFall, my alumnus, of Institute for Artificial Intelligence of The University of Georgia. Here is the abstract of their publication: Cutting the Gordian Knot: The Moving-Average Type-Token Ratio (MATTR) Journal of Quantitative Linguistics

To get the average type token ratio, we split the text into segments from the beginning to the end. Each segment, called a window, has the same number of tokens. If windowi and windowj are adjacent and i precedes j and the window size is n, then tokens from token2 to tokenn of windowi are the same as tokens from token1 to tokenn-1 of windowj. For example, for the text He said that that he likes, if we set the window size to 4, then we have the following windows:

          window                     type token ratio

      (1) He said that that              0.75
      (2)    said that that he           0.75
      (3)         that that he likes     0.75

The average type token ratio is the sum of those of (1), (2), and (3) divided by 3, which is 0.75. Programmatically, how do we compute the average type token ratio? The naive algorithm is:

  (1) Segment the text into n windows where n = tokens.Length - (window size) + 1  
  (2) Compute the type token ratio of each window  
  (3) Get the average of the type token ratios of these windows

This is not efficient, as any two adjacent windows can in fact have at most one token different, so it doesn't make sense to pre-cut the text into windows and keep all of them in memory. A better way is to always keep only one window of tokens and only one token-frequency table in memory. As we move along the tokens from the left to the right, we exclude the first token from the window and add the next available toke to the end of the window and re-compute the type token ratio of that slightly changed window of tokens. The type token ratio of any window of tokens is the number of the entries of its frequency table divided by the window size. Programmatically, the following steps are performed:

  (1) Put the first window size of tokens into a HashSet<string>
      and compute its type token ratio ttr
  (2) If window size >= tokens.Length, stop as this ttr is the average type
      token ratio; otherwise:
  (3) Record this ttr in a variable finalTTR, which will be changing  
  (4) Create a string-int token-frequency table for this window of tokens  
  (5) Set window counter = 1  
  (6) While new tokens are available:

        a. Increment window counter by 1        
        b. Remove the first token from the window
        c. Check the frequency of the removed token in the frequency table:        
           - if it is 1, it means that it appears in the moving window only once,
             so remove it from the frequency table             
           - otherwise, it means that it appears more than once, so we cannot delete it
             but we can reduce its frequency by 1
        d. Add the next available toke to the end of the moving window and change the frequency table:        
           - if it is in the moving frequency table, increase its frequency value by 1
           - otherwise, add it as a new entry and set its frequency to 1
        e. Re-compute the type token ratio of the slightly changed window, which is the number of the
           entries of its frequency table divided by window size
        f. Add the new ttr to finalTTR

   (7) Divide finalTTR by window counter to get the average final type token ratio

The methods for tokenizing strings and getting the general type token ratio are the same as those of Compute Type Token Ratio. I listed them here again for your convenience.

To compute the average type toke ratio, I used another generic type LinkedList<string>, which is more efficient than List<string> for our purpose, as we will remove the first token of the window and add a new token to it n times where n = (number of tokens) - (window size) + 1. For Lists, when you remove or add elements, the internal arrays representing the Lists need to be reshuffled as their elements occupy contiguous memory addresses, which is computationally expensive if the elements of the Lists need to be frequently added, deleted, moved around, or changed. LinkedLists are implemented in a different way to get around this problem.

This application used a new UI control called NumericUpDown for user to set the window size. When user clicks on it, its value increases or decreases by a preset value. User can also manually enter a value that is between the preset minimum and maximum values. We set the minimum value to 100 tokens, the maximum values to the number of the tokens of the text, and the increment value to (maximum - minimum ) / 100, where 100 is the number of maximal clicks.

For more information about LinkedList and NumericUpDown, look them up in the .NET Framework SDK documentation.

Code (Download)

using System;
using System.Collections.Generic;
using System.IO;
using System.Windows.Forms;

namespace AverageTypeTokenRatio
  public partial class Form1 : Form

    #region Form-wide variables

     // Do nothing if there are less than 100 tokens 
    const int MIN_WINDOW_SIZE = 100;

     // Assume we allow maximally 100 clicks on this control in one direction. 
    const int NUMERIC_UP_DOWN_CLICKS = 20;

    string[] tokens;

    double generalTypeTokenRatio;


    public Form1( )
      InitializeComponent( );

    #region Methods

     // This method is called after user selects a file or after user's mouse leaves the input 
     // textbox. It sets the maximum and increment values of the NumericUpDown control based on 
     // the number of the tokens and the pre-defined minimum and clicks values. It also fills 
     // the textbox for the number of the tokens. 
    private void SetNumericUpDownControl( )
        // The text in the input textbox may have been manually pasted or read from a file. In 
        // any case, if there is no text in the input textbox, do nothing. 
       string text = textBoxInput.Text.Trim( );
       if( text != string.Empty )
           // Tokenize text using the improved greedy tokenizer 
          tokens = Tokenize( text );

           // For this application, we are interested in texts with at least MIN_WINDOW_SIZE tokens 
          if ( tokens.Length < MIN_WINDOW_SIZE )
                  string.Format( "Too few tokens. Try text with more than {0} tokens." , MIN_WINDOW_SIZE ) );
              // Fill up the tokens count textbox 
             textBoxTokenCount.Text = tokens.Length.ToString( );

              // Dynamically set the maximum, the minimum, and the increment values 
             numericUpDownWindowSize.Maximum = tokens.Length;
             numericUpDownWindowSize.Increment = ( tokens.Length - MIN_WINDOW_SIZE ) / NUMERIC_UP_DOWN_CLICKS;
             numericUpDownWindowSize.Minimum = MIN_WINDOW_SIZE;

    /// <summary>
    ///  Tokenizes a text into an array of tokens, using the improved tokenizer with the discard-digit option.
    /// </summary>
    /// <param name="text"> the text to tokenize</param>
    private static string[] Tokenize( string text )
       // Discard digits 
      char[] delimiters_no_digits = new char[] {
          '{ ', ' }', '( ', ' )', ' [', ']', '>', '<','-', '_', '= ', '+',
          '|', '\\', ':', ';', ' ', ',', '.', '/', '?', '~', '!',
          '@', '#', '$', '%', '^', '&', '*', ' ', '\r', '\n', '\t',
          '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' };

      string[] tokens = text.Split( delimiters_no_digits,
      StringSplitOptions.RemoveEmptyEntries );

      for ( int i = 0; i < tokens.Length; i++ )
         string token = tokens [i];

          // Change token only when it starts and/or ends with "'" and it has at least 2 characters. 
         if ( token.Length > 1 )
            if ( token.StartsWith( "'" ) && token.EndsWith( "'" ) )
            tokens [i] = token.Substring( 1, token.Length - 2 ); // remove the starting and ending "'" 

            else if ( token.StartsWith( "'" ) )
            tokens [i] = token.Substring( 1 ); // remove the starting "'" 

            else if ( token.EndsWith( "'" ) )
            tokens [i] = token.Substring( 0, token.Length - 1 ); // remove the last "'" 

       return tokens;

    /// <summary>
    ///  Computes the average type token ratio of an array of tokens, based on the window size.
    /// </summary>
    /// <param name="tokens"> the array of the tokens to calculate the average type token ratio</param>
    /// <param name="windowSize"> the number of the tokens per window</param>
    private static double GetAverageTypeTokenRatio( string[] tokens, int windowSize )
        LinkedList<string> movingWindow = new LinkedList<string>( );

        int index = 0;
        while ( index < windowSize )
           movingWindow.AddLast( tokens [index] );

         // Build frequency table of this window of tokens 
        Dictionary<string, int> movingFreqTable = BuildFreqTable( movingWindow );

         // This type token ratio keeps changing 
        double finalTTR = ( double )movingFreqTable.Count / movingWindow.Count;

        int windowCount = 1;

         // Now index stops at windowSize position of the tokens. 
        while ( index < tokens.Length )
            // Check the first token of the moving window of tokens and remove it from 
            // the moving window. 
           string firstToken = movingWindow.First.Value;
           movingWindow.RemoveFirst( );

            // Check its frequency in the frequency table. If it is 1, it means that this token 
            // occurs in the moving window only once, so we can safely remove it from the moving 
            // frequency table; otherwise, it appears more than once, so we cannot delete it but 
            // we can reduce its frequency by 1. 
           if ( movingFreqTable [firstToken] == 1 )
               movingFreqTable.Remove( firstToken );
               movingFreqTable [firstToken]--;

            // Find the next available token. If it is in the moving frequency table, increase its 
            // frequency value by 1; otherwise, add it as a new entry and set its frequency to 1. 
           string newToken = tokens [index];
           if ( movingFreqTable.ContainsKey( newToken ) )
               movingFreqTable [newToken]++;
               movingFreqTable.Add( newToken, 1 );

            // Add this token to the moving window so that the window always has the same number of tokens. 
           movingWindow.AddLast( newToken );

            // Re-compute the type token ratio of this changed window. 
           double thisTTR = ( double )movingFreqTable.Count / windowSize;

            // Add this new type token ratio to the final type token ratio. 
           finalTTR += thisTTR;

            // Update index position and window counters 

        // We need to divided the final type token ratio by the number of windows 
       finalTTR = finalTTR / windowCount;

       return finalTTR;

    /// <summary>
    ///  Compute the general type token ratio of an array of tokens.
    /// </summary>
    /// <param name="tokens"> the array of tokens whose type and type token ratio values
    ///  are computed </param>
    /// <param name="types"> the HashSet to be updated</param>
    /// <param name="typeTokenRatio"> the type token ratio to be updated</param>
    private static void GetGeneralTypeTokenRatio( string[] tokens,
                         out HashSet<string> types, out double typeTokenRatio )
        // Dump array of tokens into a HashSet of string. By definition, 
        // HashSet has no duplicates, which is the what type means. 
       types = new HashSet<string>( );

       foreach ( string token in tokens )
       types.Add( token );

        // A sanity check: if types set is empty, set typeTokenRatio = double.NaN, 
        // i.e. Not a Number; otherwise, we'll get a "divided by 0" Exception. 
       if ( types.Count == 0 )
          typeTokenRatio = double.NaN;
           // Be very aware that you need to cast either types.Count or tokens.Length 
           // into double type; otherwise you'll always get 0 as the result. 
          typeTokenRatio = ( double )types.Count / tokens.Length;

        // After the completion of this method, a new value of types and typeTokenRatio 
        // will be sent back to the caller, which is what 'out parameter' means. 

    /// <summary>
    ///  Create a string-integer dictionary out of a linked list of tokens.
    /// </summary>
    /// <param name="tokens">  the tokens to create the frequency table. For this
    ///  method, there is no difference between List and LinkedList types. </param>
    private static Dictionary<string, int> BuildFreqTable( LinkedList<string> tokens )
       Dictionary<string, int> token_freq_table = new Dictionary<string, int>( );
       foreach ( string token in tokens )
           if ( token_freq_table.ContainsKey( token ) )
               token_freq_table [token]++;
               token_freq_table.Add( token, 1 );

       return token_freq_table;


    #region event-handlers

     // This event handler is invoked after the Select File button is clicked. 
    private void btnSelectFile_Click( object sender, EventArgs e )
         // Programmatically create an open file dialog and make it open *.txt files only. 
        OpenFileDialog openDlg = new OpenFileDialog( );
        openDlg.Filter = "Text files|*.txt" ;

         // If OK button of the open file dialog is clicked, put the text of 
         // the selected file in the input box 
        if ( openDlg.ShowDialog( ) == DialogResult.OK )
             // This deals with the exception caused by files that have 
             // .txt extensions but are actually not text files. 
                textBoxInput.Text = File.ReadAllText( openDlg.FileName );
            catch ( Exception ex )
                MessageBox.Show( ex.Message );

             // Set numeric up down control 
            SetNumericUpDownControl( );

     // This event handler is invoked after user's mouse leaves the input textbox. 
    private void textBoxInput_MouseLeave( object sender, EventArgs e )
         // Set numeric up down control 
        SetNumericUpDownControl( );

    private void btnGo_Click( object sender, EventArgs e )
         // Make sure some text has been put in the input textbox. 
        if ( tokens != null && tokens.Length > 0 )
             // Compute general type token ratio 
            GetGeneralTypeTokenRatio( tokens, out types, out generalTypeTokenRatio );

            textBoxTypeCount.Text = types.Count.ToString( );
            textBoxTTR.Text = string.Format( "{0:0.###}" , generalTypeTokenRatio );

             // Compute average type token ratio. 
             // Note that the Value of a NumericUpDown control is of decimal type 
             // so we need to convert it to int type to fit the called method. 
            double averageTypeTokenRatio
                = GetAverageTypeTokenRatio( tokens, decimal.ToInt32( numericUpDownWindowSize.Value ) );

            textBoxAverageTTR.Text = string.Format( "{0:0.###}" , averageTypeTokenRatio );



Average Type Token Ratio
Fig. Average Type Token Ratio

Average Type Token Ratio
Fig. Average Type Token Ratio Of Love Of Life


Figure 2 is the average type token ratios of Jack London's Love of Life when setting the window sizes from 100 to 52325 tokens. As shown from the figure, the bigger the window size, the smaller the type token ratio. The only exception is that when the window size is 49709 the average type token ratio is 0.117 while when the window sizes are 52320 and 52325 it is 0.118, contrary to the general trend.

Skip Navigation LinksHome > Nlp Sample Code > Linguistic Fingerprinting > Average Type Token Ratio