Talk:Compiler/lexical analyzer: Difference between revisions

From Rosetta Code
Content added Content deleted
(new section →‎Clarification)
 
m (wording)
Line 5: Line 5:
* '''encoding:''' Should we expect the input files in a specific encoding? Maybe ''latin-1'' or ''utf-8''?
* '''encoding:''' Should we expect the input files in a specific encoding? Maybe ''latin-1'' or ''utf-8''?
* '''encoding:''' Should string and char literals support Unicode, or just ASCII?
* '''encoding:''' Should string and char literals support Unicode, or just ASCII?
* '''char literals:''' The stated regex is <code>'x'</code>, but that's not actually a regex. Shouldn't it be <code>'\\?[^']'</code> ("<code>\n</code> or <code>\\</code> or any character except <code>'</code>, enclosed in single quotes")?
* '''char literals:''' The stated regex is <code>'x'</code>, but that's not actually a regex. Shouldn't it be <code>'\\?[^']'</code> (a.k.a. "<code>\n</code> or <code>\\</code> or any character except <code>'</code>, enclosed in single quotes")?
* '''char literals:''' How can a single quote be represented as a char, if there are no other escape sequences besides <code>\n</code> and <code>\\</code>?
* '''char literals:''' How can a single quote be represented as a char, if there are no other escape sequences besides <code>\n</code> and <code>\\</code>?
* '''string literals:''' The stated regex is <code>".*"</code>, but this would match e.g. <code>"foo bar" < "</code> due * performing greedy matching. Shouldn't it be <code>"[^"]*"</code>?
* '''string literals:''' The stated regex is <code>".*"</code>, but this would match e.g. <code>"foo bar" < "</code> due to the asterisk performing greedy matching. Shouldn't it be <code>"[^"]*"</code> (a.k.a. "match zero or more characters except the double quote, enclosed in double quotes")?
* '''string literals:''' How can a double quote be represented inside a string literal, if there are no other escape sequences besides <code>\n</code> and <code>\\</code>?
* '''string literals:''' How can a double quote be represented inside a string literal, if there are no other escape sequences besides <code>\n</code> and <code>\\</code>?
* '''whitespace:''' This needs an actual thorough description, instead of just an example. Am I right to assume that zero or more whitespace characters or comments are allowed between ''any'' two tokens, with no exceptions, and that "longest token matching" is used to resolve conflicts (e.g. in order to match <code><=</code> as a single token rather than the two tokens <code><</code> and <code>=</code>)?
* '''whitespace:''' This needs an actual thorough description, instead of just an example. Am I right to assume that zero or more whitespace characters or comments are allowed between ''any'' two tokens, with no exceptions, and that "longest token matching" is used to resolve conflicts (e.g. in order to match <code><=</code> as a single token rather than the two tokens <code><</code> and <code>=</code>)?
Sorry if some of these sound pedantic, but experience on rosettacode has shown that tasks of this complexity absolutely need to be precise and unambiguous in order to not cause problems for people who will try to add solutions.
Sorry if some of these sound pedantic, but experience on rosettacode has shown that tasks of this complexity absolutely need to be precise and unambiguous in order to not cause problems for people who will try to add solutions... :)<br>
--[[User:Smls|Smls]] ([[User talk:Smls|talk]]) 13:32, 14 August 2016 (UTC)
--[[User:Smls|Smls]] ([[User talk:Smls|talk]]) 13:32, 14 August 2016 (UTC)

Revision as of 13:36, 14 August 2016

Clarification

I like this task, thanks for contributing it. But some more clarification needs to be added before moving it out of draft status. Off the top of my head:

  • encoding: Should we expect the input files in a specific encoding? Maybe latin-1 or utf-8?
  • encoding: Should string and char literals support Unicode, or just ASCII?
  • char literals: The stated regex is 'x', but that's not actually a regex. Shouldn't it be '\\?[^']' (a.k.a. "\n or \\ or any character except ', enclosed in single quotes")?
  • char literals: How can a single quote be represented as a char, if there are no other escape sequences besides \n and \\?
  • string literals: The stated regex is ".*", but this would match e.g. "foo bar" < " due to the asterisk performing greedy matching. Shouldn't it be "[^"]*" (a.k.a. "match zero or more characters except the double quote, enclosed in double quotes")?
  • string literals: How can a double quote be represented inside a string literal, if there are no other escape sequences besides \n and \\?
  • whitespace: This needs an actual thorough description, instead of just an example. Am I right to assume that zero or more whitespace characters or comments are allowed between any two tokens, with no exceptions, and that "longest token matching" is used to resolve conflicts (e.g. in order to match <= as a single token rather than the two tokens < and =)?

Sorry if some of these sound pedantic, but experience on rosettacode has shown that tasks of this complexity absolutely need to be precise and unambiguous in order to not cause problems for people who will try to add solutions... :)
--Smls (talk) 13:32, 14 August 2016 (UTC)