Text to HTML: Difference between revisions

From Rosetta Code
Content added Content deleted
m (→‎{{header|Pike}}: capitalize Pike)
(→‎Tcl: Added implementation)
Line 1: Line 1:
{{Draft task|Text processing}}
{{Draft task|Text processing}}

When developing a Website it is occasionally necessary to handle text that is received without formatting, and present it in a pleasing manner. to achieve this the text needs to be converted to HTML.
When developing a Website it is occasionally necessary to handle text that is received without formatting, and present it in a pleasing manner. to achieve this the text needs to be converted to HTML.


Line 151: Line 150:
return root;
return root;
}</lang>
}</lang>

=={{header|Tcl}}==
This renderer doesn't do all that much. Indeed, it deliberately avoids doing all the complexity that is possible; instead it seeks to just provide the minimum that could possibly be useful to someone who is doing very simple text pages.
<lang tcl>package require Tcl 8.5

proc splitParagraphs {text} {
split [regsub -all {\n\s*(\n\s*)+} [string trim $text] \u0000] "\u0000"
}
proc determineParagraph {para} {
set para [regsub -all {\s*\n\s*} $para " "]
switch -regexp -- $para {
{^\s*\*+\s} {
return [list ul [string trimleft $para " \t*"]]
}
{^\s*\d+\.\s} {
set para [string trimleft $para " \t\n0123456789"]
set para [string range $para 1 end]
return [list ol [string trimleft $para " \t"]]
}
{^#+\s} {
return [list heading [string trimleft $para " \t#"]]
}
}
return [list normal $para]
}
proc markupParagraphContent {para} {
set para [string map {& &amp; < &lt; > &gt;} $para]
regsub -all {_([\w&;]+)_} $para {<i>\1</i>} para
regsub -all {\*([\w&;]+)\*} $para {<b>\1</b>} para
regsub -all {`([\w&;]+)`} $para {<tt>\1</tt>} para
return $para
}

proc markupText {title text} {
set title [string map {& &amp; < &lt; > &gt;} $title]
set result "<html>"
append result "<head><title>" $title "</title>\n</head>"
append result "<body>" "<h1>$title</h1>\n"
set state normal
foreach para [splitParagraphs $text] {
lassign [determineParagraph $para] type para
set para [markupParagraphContent $para]
switch $state,$type {
normal,normal {append result "<p>" $para "</p>\n"}
normal,heading {
append result "<h2>" $para "</h2>\n"
set type normal
}
normal,ol {append result "<ol>" "<li>" $para "</li>\n"}
normal,ul {append result "<ul>" "<li>" $para "</li>\n"}

ul,normal {append result "</ul>" "<p>" $para "</p>\n"}
ul,heading {
append result "</ul>" "<h2>" $para "</h2>\n"
set type normal
}
ul,ol {append result "</ul>" "<ol>" "<li>" $para "</li>\n"}
ul,ul {append result "<li>" $para "</li>\n"}

ol,normal {append result "</ol>" "<p>" $para "</p>\n"}
ol,heading {
append result "</ol>" "<h2>" $para "</h2>\n"
set type normal
}
ol,ol {append result "<li>" $para "</li>\n"}
ol,ul {append result "</ol>" "<ul>" "<li>" $para "</li>\n"}
}
set state $type
}
if {$state ne "normal"} {
append result "</$state>"
}
return [append result "</body></html>"]
}</lang>
Here's an example of how it would be used.
<lang tcl>set sample "
This is an example of how a pseudo-markdown-ish formatting scheme could
work. It's really much simpler than markdown, but does support a few things.

# Block paragraph types

* This is a bulleted list

* And this is the second item in it

1. Here's a numbered list

2. Second item

3. Third item

# Inline formatting types

The formatter can render text with _italics_, *bold* and in a `typewriter`
font. It also does the right thing with <angle brackets> and &amp;ersands,
but relies on the encoding of the characters to be conveyed separately."

puts [markupText "Sample" $sample]</lang>
{{out}}
<lang html><html><head><title>Sample</title>
</head><body><h1>Sample</h1>
<p>This is an example of how a pseudo-markdown-ish formatting scheme could work. It's really much simpler than markdown, but does support a few things.</p>
<h2>Block paragraph types</h2>
<ul><li>This is a bulleted list</li>
<li>And this is the second item in it</li>
</ul><ol><li>Here's a numbered list</li>
<li>Second item</li>
<li>Third item</li>
</ol><h2>Inline formatting types</h2>
<p>The formatter can render text with <i>italics</i>, <b>bold</b> and in a <tt>typewriter</tt> font. It also does the right thing with &lt;angle brackets&gt; and &amp;amp;ersands, but relies on the encoding of the characters to be conveyed separately.</p>
</body></html></lang>

Revision as of 17:35, 31 March 2012

Text to HTML is a draft programming task. It is not yet considered ready to be promoted as a complete task, for reasons that should be found in its talk page.

When developing a Website it is occasionally necessary to handle text that is received without formatting, and present it in a pleasing manner. to achieve this the text needs to be converted to HTML.

Write a converter from plain text to HTML.

The plain text has no formatting information.

It may have centered headlines, numbered sections, paragraphs, lists, and URIs. It could even have tables.

Simple converters restrict themselves at identifying paragraphs, but i believe more can be done if the text is analyzed.

You are not requested to copy the algorithm from the existing solutions but use whatever faculties available in your language to best solve the problem.

The only requirement is to ensure that the result is valid xhtml.

Pike

algorithm:

  • split by line
  • find average line length to identify centered lines
  • find isolated lines to identify section headings
  • find URIs
  • identify section numbering
  • identify bullet and numbered lists
  • identify paragraphs
  • identify indented lines
  • if possible identify tables

to ensure valid xhtml create a nested structure:

  • create an xml node
  • add elements to node
  • add lines to element if appropriate

this implementation is still incomplete. <lang Pike>// function to calculate the average line length (not used yet below) int linelength(array lines) {

   array sizes = sizeof(lines[*])-({0}); 
   sizes = sort(sizes); 
   // only consider the larger half of lines minus the top 5%
   array larger = sizes[sizeof(sizes)/2..sizeof(sizes)-sizeof(sizes)/20];
   int averagelarger = `+(@larger)/sizeof(larger);
   return averagelarger; 

}

array mark_up(array lines) {

   array markup = ({});
   // find special lines
   foreach(lines; int index; string line)
   {
       string strippedline = String.trim_whites(line);
       if (sizeof(strippedline))
       {
           string firstchar = strippedline[0..0];
           int pos = search(line, firstchar);
           if (lines[index-1]-" "-"\t" =="" && lines[index+1]-" "-"\t" =="")
               markup +=({ ({ "heading", strippedline, pos }) });
           else if (firstchar == "*")
               markup += ({ ({ "bullet", strippedline, pos }) });
           else if ( (<"0","1","2","3","4","5","6","7","8","9">)[firstchar] )
               markup += ({ ({ "number", strippedline, pos }) });
           else if (pos > 0)
               markup += ({ ({ "indent", strippedline, pos }) });
           else            
               markup += ({ ({ "regular", strippedline, pos }) });
       }
       else markup += ({ ({ "empty" }) });
   }
   foreach(markup; int index; array line)
   {
       if (index > 0 && index < sizeof(markup)-1 )
       {
           if (line[0] == "regular" && markup[index-1][0] != "regular" && markup[index+1][0] != "regular")
               line[0] = "heading";
       }
   }
   //find paragraphs
   foreach(markup; int index; array line)
   {
       if (index > 0 && index < sizeof(markup)-1 )
       {
           if (line[0] == "empty" && markup[index-1][0] == "regular" && markup[index+1][0] == "regular")
               line[0] = "new paragraph";
           else if (line[0] == "empty" && markup[index-1][0] == "regular" && markup[index+1][0] != "regular")
               line[0] = "end paragraph";
           else if (line[0] == "empty" && markup[index-1][0] != "regular" && markup[index+1][0] == "regular")
               line[0] = "begin paragraph";
       }
   }
   return markup;

}

object make_tree(array markup) {

   object root = Parser.XML.Tree.SimpleRootNode(); 
   object newline = Parser.XML.Tree.SimpleNode(Parser.XML.Tree.XML_TEXT, "", ([]), "\n");
   array current = ({ Parser.XML.Tree.SimpleNode(Parser.XML.Tree.XML_ELEMENT, "div", ([]), "") });
   root->add_child(current[-1]);
   foreach (markup; int index; array line)
   {
       switch(line[0])
       {
           case "heading": 
                     current[-1]->add_child(newline);
                     object h = Parser.XML.Tree.SimpleNode(Parser.XML.Tree.XML_ELEMENT, "h3", ([]), "");
                     h->add_child(Parser.XML.Tree.SimpleNode(Parser.XML.Tree.XML_TEXT, "", ([]), line[1]));
                     current[-1]->add_child(h);
                     current[-1]->add_child(newline);
                 break;
           case "bullet":
           case "number":
                     if (current[-1]->get_tag_name() == "li")
                         current = Array.pop(current)[1];
                     current[-1]->add_child(newline);
                     object li = Parser.XML.Tree.SimpleNode(Parser.XML.Tree.XML_ELEMENT, "li", ([]), "");
                     li->add_child(Parser.XML.Tree.SimpleNode(Parser.XML.Tree.XML_TEXT, "", ([]), line[1]));
                     current[-1]->add_child(li);
                     current = Array.push(current, li);
                 break;
           case "indent":
                     if (markup[index-1][0] != "bullet" && markup[index-1][0] != "number")
                         current = Array.pop(current)[1];
                     current[-1]->add_child(Parser.XML.Tree.SimpleNode(Parser.XML.Tree.XML_TEXT, "", ([]), line[1]));
                 break;
           case "new paragraph":
                     current = Array.pop(current)[1];
                     current[-1]->add_child(newline);
           case "begin paragraph":
                     object p = Parser.XML.Tree.SimpleNode(Parser.XML.Tree.XML_ELEMENT, "p", ([]), "");
                     current[-1]->add_child(p); 
                     current = Array.push(current, p);
                break;
           case "end paragraph":
                     current = Array.pop(current)[1];
                     current[-1]->add_child(newline);
                break;
           case "regular":           
                     current[-1]->add_child(Parser.XML.Tree.SimpleNode(Parser.XML.Tree.XML_TEXT, "", ([]), line[1]));
           case "empty": 
                 break;
       } 
   }   
   return root;

}</lang>

Tcl

This renderer doesn't do all that much. Indeed, it deliberately avoids doing all the complexity that is possible; instead it seeks to just provide the minimum that could possibly be useful to someone who is doing very simple text pages. <lang tcl>package require Tcl 8.5

proc splitParagraphs {text} {

   split [regsub -all {\n\s*(\n\s*)+} [string trim $text] \u0000] "\u0000"

} proc determineParagraph {para} {

   set para [regsub -all {\s*\n\s*} $para " "]
   switch -regexp -- $para {

{^\s*\*+\s} { return [list ul [string trimleft $para " \t*"]] } {^\s*\d+\.\s} { set para [string trimleft $para " \t\n0123456789"] set para [string range $para 1 end] return [list ol [string trimleft $para " \t"]] } {^#+\s} { return [list heading [string trimleft $para " \t#"]] }

   }
   return [list normal $para]

} proc markupParagraphContent {para} {

   set para [string map {& & < < > >} $para]
   regsub -all {_([\w&;]+)_} $para {\1} para
   regsub -all {\*([\w&;]+)\*} $para {\1} para
   regsub -all {`([\w&;]+)`} $para {\1} para
   return $para

}

proc markupText {title text} {

   set title [string map {& & < < > >} $title]
   set result "<html>"
   append result "<head><title>" $title "</title>\n</head>"

append result "<body>" "

$title

\n"

   set state normal
   foreach para [splitParagraphs $text] {

lassign [determineParagraph $para] type para set para [markupParagraphContent $para] switch $state,$type {

normal,normal {append result "

" $para "

\n"}

normal,heading {

append result "

" $para "

\n"

set type normal }

normal,ol {append result "

    " "
  1. " $para "
  2. \n"} normal,ul {append result "
      " "
    • " $para "
    • \n"} ul,normal {append result "
    " "

    " $para "

    \n"}

    ul,heading {

    append result "" "

    " $para "

    \n"

    set type normal }

    ul,ol {append result "" "
      " "
    1. " $para "
    2. \n"} ul,ul {append result "
    3. " $para "
    4. \n"} ol,normal {append result "
    " "

    " $para "

    \n"}

    ol,heading {

    append result "

" "

" $para "

\n"

set type normal }

ol,ol {append result "

  • " $para "
  • \n"} ol,ul {append result "" "

      " "
    • " $para "
    • \n"}

      } set state $type

         }
         if {$state ne "normal"} {
      

      append result "</$state>"

         }
         return [append result "</body></html>"]
      

      }</lang> Here's an example of how it would be used. <lang tcl>set sample " This is an example of how a pseudo-markdown-ish formatting scheme could work. It's really much simpler than markdown, but does support a few things.

      1. Block paragraph types
      • This is a bulleted list
      • And this is the second item in it

      1. Here's a numbered list

      2. Second item

      3. Third item

      1. Inline formatting types

      The formatter can render text with _italics_, *bold* and in a `typewriter` font. It also does the right thing with <angle brackets> and &ersands, but relies on the encoding of the characters to be conveyed separately."

      puts [markupText "Sample" $sample]</lang>

      Output:

      <lang html><html><head><title>Sample</title>

      </head><body>

      Sample

      This is an example of how a pseudo-markdown-ish formatting scheme could work. It's really much simpler than markdown, but does support a few things.

      Block paragraph types

      • This is a bulleted list
      • And this is the second item in it
      1. Here's a numbered list
      2. Second item
      3. Third item

      Inline formatting types

      The formatter can render text with italics, bold and in a typewriter font. It also does the right thing with <angle brackets> and &amp;ersands, but relies on the encoding of the characters to be conveyed separately.

      </body></html></lang>