Talk:Compiler/lexical analyzer

Clarification

I like this task, thanks for contributing it. But some more clarification needs to be added before moving it out of draft status. Off the top of my head:

encoding: Should we expect the input files in a specific encoding? Maybe latin-1 or utf-8?
encoding: Should string and char literals support Unicode, or just ASCII?
char literals: The stated regex is 'x', but that's not actually a regex. Shouldn't it be '\\?[^']' (a.k.a. an escape sequence or any character except ', enclosed in single quotes")?
char literals: How can a single quote be represented as a char, if there are no other escape sequences besides \n and \\?
string literals: The stated regex is ".*", but this would match e.g. "foo bar" < " due to the asterisk performing greedy matching. Shouldn't it be "[^"]*" (a.k.a. "match zero or more characters except the double quote, enclosed in double quotes")?
string literals: How can a double quote be represented inside a string literal, if there are no other escape sequences besides \n and \\?
whitespace: This needs an actual thorough description, instead of just an example. Am I right to assume that zero or more whitespace characters or comments are allowed between any two tokens, with no exceptions, and that "longest token matching" is used to resolve conflicts (e.g. in order to match <= as a single token rather than the two tokens < and =)?
operators: How is the lexer supposed to differentiate between Sub and Uminus? And why does the third test-case print "Sub" for both?

Sorry if some of these sound pedantic, but experience on rosettacode has shown that tasks of this complexity absolutely need to be precise and unambiguous in order to not cause problems for people who will try to add solutions... :)
--Smls (talk) 14:22, 14 August 2016 (UTC)

Another small clarification. The table of valid tokens refers to a "char literal" but the error examples reference "char constants". Are these the same token? --Thundergnat (talk) 14:44, 14 August 2016 (UTC)

Response

This first category is part of a larger example. In the works, I have syntax analysis, code generation (for a stack based virtual machine) and virtual machine emulator. The code is already complete in C and Python. But no writeups yet.

There are lots of things missing from this simple compiler, as I attempted to weed out features, in order to keep the implementations down to a manageable size. Things like else, >=, ==, data declarations, functions and so on.

The goal was to be able to run simple programs like the prime number generator in the white-space example.

1) encoding (overall):

latin-1

2) encoding - string and char literals:

ASCII

Thinking about it a bit, for a hand-written scanner, there is really nothing that I am aware of preventing string literals and comments from including utf-8. Of course this does not include character literals, where the code would have to be utf-8 aware.

Not sure how/where to update the page regarding the above two.

3) char literal regex:

The (new) definition I'm using for Flex:

\'([^'\n]|\\n|\\\\)\'

Page has been updated.

4) char literals: embedded single quote?

Not supported It is one of the features I arbitrarily removed.

Page has been updated.

5) string literals: regex:

The (new)) definition I'm using for flex:

\"[^"\n]*\"

(thanks for the new definition!)

Page has been updated.

6) string literals: embedded double quotes?

Not supported It is one of the features I arbitrarily removed.

Page has been updated.

7) Whitespace:

I have updated the description.

8) Operators: Sub vs. Uminus

Uminus cannot be recognized by the scanner. It is recognized by the syntax analyzer, i.e., the parser. The token type is there since it will turn up in the parser and the code generator.

Not sure if the page should be updated regarding this.

10) char literal vs char constants

Yes, char literal and char constants represent the same thing.

Interestingly, when I was researching this I got the following doing a Google search for: ("string literal") OR ("string constant")

https://en.wikipedia.org/wiki/String_literal
A string literal or anonymous string is the representation of a string value within the source code ..... Among other things, it must be possible to encode the character that normally terminates the string constant, plus there must be some way to ...

Not sure if the page should be updated regarding this.

--Ed Davis (talk) 16:10, 15 August 2016 (UTC)

Output Format

Thanks for the clarifications. With that out the way, two questions about apparent inconsistencies in the output format:

Why are character literals fed through the evaluator and their calculated value printed, but string literals are printed "raw", exactly as they appeared in the code?
```
line    18  col    26 Integer         32
```
```
line     4  col     7 String   "Hello, World!\n"
```
I suppose this is arbitrary, and done because of potential newlines in strings – if so, wouldn't it be easier to remove support for \n, and let the lexer call its evaluator on strings as well, so that it can print the actual plain string value in the output? So the second line (without the \n) would look like:
```
line     4  col     7 String   Hello, World!
```
Maybe require a single TAB as separator between the output fields, to avoid whitespace ambiguity.
The output consists of four data fields (line number, column number, token type, token value) - how come only the first two fields have a label included in the output? Shouldn't either none or all fields include the label? i.e.:
```
4	7	String	Hello, World!
```
```
line	4	col	7	token	String	value	Hello, World!
```

--Smls (talk) 16:35, 15 August 2016 (UTC)