Free language processing service and NLP C# code
Convert Natural Paragraphs to Html Paragraphs

NLP Task

Given a multi-paragraph text that contains one or more <pre>any text</pre> sections, we want to put <p> at the beginning of, and </p> at the end of each of the paragraphs. But no mark up around and within <pre> and </pre>. That is, we do not want to have something like <p><pre>any text</pre></p>, as it is not allowed in html.

Algorithms and Implementation

Marking up natural paragraphs with <p> and </p> is trivial. The challenge lies in insulating the <pre>...</pre> sections so that they will not be marked up with <p> and </p>. A <pre>...</pre> section can be understood as a string containing <pre>, followed by any text other than </pre>, finally followed by </pre>.


1. So let's define a form-wide global Regex object as follows:
     
     Regex preblockRegex = new Regex("<pre>.+?</pre>", RegexOptions.Singleline|RegexOptions.Compiled);

    (1) The dot symbol (.) means any character except the new line or carriage return characters. However,
        the RegexOptions.Singleline option even removes this restriction, so now '.' means any character.
        Without this option, it cannot match multiple lines of strings within <pre> and </pre>.

    (2) The plus symbol (+) is called quantifier, which requires the preceding character repeat at least one time.

    (3) The question mark (?) immediately following the plus symbol is a lazy search marker. It tells the regex 
        engine to check whether the entire expression has been satisfied after adding every character. If it has,
        stop immediately.

    (4) The RegexOption.Compiled compiles the regular expresion and makes subsequent uses more efficient.

     So the regular expression tries to match a string containing <pre> followed by any number of any character but
     stops as soon as it is followed by </pre>.
 
 2. If the regular expression finds matches in the input text:  
    (1) Extract all occurrences of <pre>...</pre> and store them in a List     
    (2) Replace all occurrences of <pre>...</pre> with a unique dummy string such as "pre.block.erp", so that
        they are isolated away and can be accurately put back.

 3. Split the changed text into a list of natural paragraphs, using "\r\n" as the splitters.     
 4. Mark up each natural paragraph with <p> and </p> and concatenate them into a new text.
     
 5. Restore the extracted <pre>...</pre> sections by substituting them for the corresponding
    <p>dummy string</p> parts:
    (1) Define another regular expression to capture strings of the following pattern:
       - It must start with <p>
       - It must end with </p>
       - Between <p> and </p> is the dummy string we defined.
         
       For example, if our dummy string is "pre.block.erp", the regular expression will caputre
       <p>pre.block.erp</p>
         
    (2) For each occurrence of <p>pre.block.erp</p> of the changed text, replace it with its corresponding
        extracted <pre>...</pre> section.    
   

Code (Download)


using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Windows.Forms;
using System.Text.RegularExpressions;

namespace ToHtmlParagraphs
{
    public partial class Form1 : Form
    {
         // A unique string for the pre-blocks 
        const string PreDummy = "pre.block.erp" ;
        
         // To capture the pre-blocks 
        Regex preblockRegex = new Regex( "<pre>.+?</pre>" , RegexOptions.Singleline|RegexOptions.Compiled );

         // To capture the changed dummy pre-blocks 
        Regex preDummyRegex = new Regex( string.Format( "<p>{0}</p>" , PreDummy ), RegexOptions.Compiled );
        
        public Form1( )
        {
            InitializeComponent( );
        }

        private void btnGo_Click( object sender, EventArgs e )
        {
            string input = textBoxInput.Text;
            if ( input != string.Empty )
            {
                string output = ToParagraphs( input );
                textBoxOuput.Text = output;
            }
        }

         // Mark up natural paragraphs with HTML <p> and </p>, with <pre>...</pre> spared. 
        private string ToParagraphs( string text )
        {
             // To host the extracted <pre>...</pre> sections. 
            List<string> preblocks = null;

             // If text has any <pre>...</pre> section 
            if ( preblockRegex.IsMatch( text ) )
            {
                 // Extract the preblock strings and store them in preblocks 
                preblocks = new List<string>( );

                MatchCollection matches = preblockRegex.Matches( text );
                foreach ( Match match in matches )
                    preblocks.Add( match.Value );

                 // Replace all <pre>...</pre> sections with a unique dummy string 
                text = preblockRegex.Replace( text, PreDummy );
            }

             // Split text into natural paragraphs 
            string[] paragraphs = text.Split( new string[] { "\r\n" }, StringSplitOptions.RemoveEmptyEntries );

             // Mark up natural paragraphs with <p> and </p> and concatenate them into text  
            text = string.Empty;

            foreach ( string p in paragraphs )
            {
                text += string.Format( "<p>{0}</p>" , p.Trim( ) );
            }

             // Restore the <pre>...</pre> sections 
            if ( preblocks != null )
            {
                if ( preDummyRegex.IsMatch( text ) )
                {
                    MatchCollection matches = preDummyRegex.Matches( text );
                    for ( int i = 0; i < matches.Count; i++ )
                    {
                         // Each time replace only one <p>pre.block.erp</p> with the formatted  
                         // <pre>...</pre>to ensure the correct restoration. 
                        text = preDummyRegex.Replace( text,
                             "\r\n\r\n" + preblocks [i] + "\r\n\r\n" , 1, matches [i].Index );
                    }
                }
            }
            return text;
        }
    }
}

Figures

Convert Natural Paragraphs to Html Paragraphs
Fig. Natural Paragraphs To Html Paragraphs

Note

Summary of the regular expressions used:

(1) The dot symbol is a special character which means any character except the new line and carriage return characters. The RegexOption.SingleLine option removes this restriction, enabling '.' to mean any character.

(2) The plus symbol is a quantifier requiring the preceding character appear at least one time.

(3) The question mark immediately following a quantifier asks the search engine to eat one character per time and check to stop if the entrie expression has been satisfied.

(4) The RegexOption.Compiled compiles the regular expresion and makes subsequent uses more efficient.


Skip Navigation LinksHome > Nlp Sample Code > Regular Expressions > Convert Natural Paragraphs to Html Paragraphs