Too broad

Right now, I think the task is a bit too broad. In particular, there's too many options for what to do and we should instead focus on a more restricted set of things that everyone can implement “the same”. Since we're really talking something like wikitext or markdown, we should use something like that (including allowing people to make use of useful libraries if they wish). Paragraphs are really the minimum; inline bold, italic and fixed-width are also very useful in the minimum set (if only there was single accepted standard for doing them…) –Donal Fellows 12:03, 5 January 2012 (UTC)

we are specifically not talking about wikitext or markdown, but text without any markup at all, except for indenting, empty lines and numbers and bullets, things you would use in plain text. specifically things like inline bold, italic and fixed-width are not possible without some kind of markup, and thus not what this task is looking for.
Bold and italic can be recognized by things like *foo* and /foo/ which people use in plain text anyway.
what i am looking for is to go beyond just recognizing paragraphs, to explore what else can be analyzed out of plain text. i very much expect that the task description will be in flux for a while until we can work out a reasonable set of requirements.
think of what you would get when using a commandline browser like lynx or w3m on a terminal with out colors or bold text. what you see there is potential input for this task. i do not expect all such input to be parsable, but a reasonable set that goes beyond just paragraphs.--eMBee 12:36, 5 January 2012 (UTC)
It is going to be difficult to compare implementations if none of them are doing the same thing. And, for example, the open ended concept of "plain text tables" pretty much guarantees that any implementation which does not ignore that part of the task will be different from any other implementation where a "copy of implementation" relationship does not exist. A lack of examples will also make comparison difficult. --Rdm 18:11, 5 January 2012 (UTC)

Concrete requirements?

    Recognize
a leading indentation.

Also hanging
    indentations.

   Block
   indentations.

A paragraph
* with
* bullets, some
of which are like this, but
the additional lines should line up with the first word.

Treat     this
as a      table
because   of
alignment.

A little convention for *bold* or /italic/ or _underline_ is not
such a terrible thing.

Horizontal rule:

----------------

+-----------------------+
| Box                   |
+-----------+-----------+
| Structure |           |
+-----------------------+
| How about it?         |
+-----------------------+

Of course, it should go without saying that, HTML characters like
< and & must be properly escaped.

But http://this.is.a/url turned into a link.

192.139.122.42 21:14, 5 January 2012 (UTC)

Some of those seem plausible.
I'm dubious about hanging indentation (is that first line a part of the following text or is it independent? How many people actually use hanging indentation? Isn't that likely to cause problems with automatic list formatting? What is it doing in a plain-text converter?)
i use hanging indentation inside a list:
list item:
  content
  content
list item:
  sublist:
    content
but that's just one example, there might be counter examples--eMBee 02:33, 6 January 2012 (UTC)
I'm dubious about line-end treatment, in general. (In part because I have been discovering, recently, that what I thought was plain text treatment sometimes is not, and my lines are being glued together in contexts where that destroys my presentation.)
but in those cases don't you also need monospace? this is the hardest part to solve. because it could be wrong either way. some of the text-to-html converters i found offered the user a choice, if they wanted
inserted or not.--eMBee 02:33, 6 January 2012 (UTC)
I'm also dubious about that table concept: If I put two spaces between my sentences and I use line ends in my text, and a space happens to fall under the second space on the following line, is that a table? If the line after it is short, does that make it a table? Where are the row boundaries?
And what about boxes that do not align?
i don't know if tables can be done in any reasonable manner. two lines and two columns is certainly not enough to recognize a table. but what about this?
this is no table.  just 9 sentences.  that align.
a short sentence.  another sentence.  and more.
some longer text.  this too is text.  so is this.
also, sentences in a table are unlikely all ending in .
the 1 above may  but this text
not be a table.  here surely is.
tables were just a thought, wondering if they can be recognized in a reasonable way. if they can't i'd drop the idea.--eMBee 02:33, 6 January 2012 (UTC)
Anyways, this is a plausible start, but some aspects of it still seem overly open-ended. --Rdm 22:11, 5 January 2012 (UTC)
the requirements are currently open ended exactly because i want to bring out questions like yours. thank you. we can discuss these questions then use them to formulate more concrete requirements.--eMBee 02:33, 6 January 2012 (UTC)
Return to "Text to HTML" page.