Talk:Compiler/lexical analyzer: Difference between revisions

m
 
(14 intermediate revisions by 6 users not shown)
Line 138:
 
--[[User:Ed Davis|Ed Davis]] ([[User talk:Ed Davis|talk]]) 17:32, 15 August 2016 (UTC)
 
Regarding your recent edit, [[User:Ed Davis|Ed Davis]], if we require producing exact output we should reconsider converting escape sequences in string literals. It doesn't make sense to convert them if they must be converted back for the parser to consume. Also, just a thought, your edit might be better placed in the Output Format section, since that is the specification it refers to.
 
--[[User:The-lambda-way|the-lambda-way]] ([[User talk:The-lambda-way|talk]]) 02:44, 22 September 2020 (UTC)
 
I've updated the "Output section", and removed the same from the "Additional examples" section. Let me know what you think.
--[[User:Ed Davis|Ed Davis]] ([[User talk:Ed Davis|talk]])[[User:Ed Davis|Ed Davis]] ([[User talk:Ed Davis|talk]]) 19:36, 22 September 2020 (UTC)
 
:It should also permit reusable components, such that vm hello.c is perfectly valid. In particular, however, entries must not replicate large sections of code from earlier stages. Of course you must grab each stage individually, but that is just as true for pipe-based solutions. I have just now also added the previously-mentioned-but-not-actually-on-rc extra.e code to the Phix entry, so you can use pipes (and cross-test) if you really want to. --[[User:Petelomax|Pete Lomax]] ([[User talk:Petelomax|talk]]) 13:06, 23 September 2020 (UTC)
 
==Token names==
Line 246 ⟶ 255:
 
: Thanks! :) I tried to be clever and dynamically construct a single regex (with one branch per token) to act as the scanner, since it's safe to assume that the Perl regex engine is more bug-free and better optimized than a <code>substr</code>-based scanner that I could have written by hand. But then I realized that there's no easy way to get the line and column number of a regex match, so I had to scan and accumulate those separately, which introduced overhead again. I wonder if the approach was still worth it, performance-wise. Not that a solution in an interpreted language like Perl could ever compete with the C solution, but it might be interesting to benchmark it against the Python solution for large input files... --[[User:Smls|Smls]] ([[User talk:Smls|talk]]) 17:06, 18 August 2016 (UTC)
 
It's easy to get line and column numbers out of a regex. See the Alternate.
--[[User:Tybalt89|Tybalt89]] ([[User talk:Tybalt89|talk]]) 14:10, 24 May 2018 (UTC)
 
== Simple benchmark ==
Line 355 ⟶ 367:
 
: Indeed, I would have expected a larger difference between C and the interpreted languages. --[[User:Smls|Smls]] ([[User talk:Smls|talk]]) 13:23, 19 August 2016 (UTC)
 
==Future==
 
My goal is to add the following related tasks:
 
;Syntax Analysis
: this is basically a parser that outputs a textural parse tree
;Code Generation
: code generation for a simple stack based virtual machine - outputs stack vm assembly code
;Virtual Machine
: virtual machine code interpreter - interprets the vm assembly code
 
I have already implemented all 3 of these in C and Python. I can do something like:
 
Given the following program:
 
count = 1;
while (count < 10) {
print("count is: ", count, "\n");
count = count + 1;
}
 
Running:
 
lex count.t | parse
 
Will output a parse tree in textural format:
 
Sequence
Sequence
;
Assign
Identifier count
Integer 1
While
Less
Identifier count
Integer 10
Sequence
Sequence
;
Sequence
Sequence
Sequence
;
Prts
String "count is: "
;
Prti
Identifier count
;
Prts
String "\n"
;
Assign
Identifier count
Add
Identifier count
Integer 1
 
Running:
 
lex count.t | parse | gen
 
Will output the following virtual assembly code:
 
Datasize: 1 Strings: 2
"count is: "
"\n"
0 push 1
5 store [0]
10 fetch [0]
15 push 10
20 lt
21 jz (43) 65
26 push 0
31 prts
32 fetch [0]
37 prti
38 push 1
43 prts
44 fetch [0]
49 push 1
54 add
55 store [0]
60 jmp (-51) 10
65 halt
 
Running:
 
lex count.t | parse | gen | vm
 
And it will output the following.
 
count is: 1
count is: 2
count is: 3
count is: 4
count is: 5
count is: 6
count is: 7
count is: 8
count is: 9
 
And, I can mix and match - I can use the C lexer, the Python parser, the C code generator, and the Python vm, if that makes sense.
 
I've already started the write-ups for these. I'll make them draft tasks. '''Question:''' What is the protocol, e.g., is it acceptable to post very-rough draft work, in order to solicit feedback? Or should I wait until I have it specified more clearly?
 
--[[User:Ed Davis|Ed Davis]] ([[User talk:Ed Davis|talk]]) 15:01, 13 September 2016 (UTC)
 
==Status==
I vetted this on 2 programming forums/mailing lists, and 2 compiler specific forums. Got feedback, and got two additional solutions! I believe it is ready to go!
--[[User:Ed Davis|Ed Davis]] ([[User talk:Ed Davis|talk]]) 10:55, 22 October 2016 (UTC)
==Error Test Cases==
I think the set of programs used to test qualifying codes should also include the set of failures such as: end of line in a string, end of file in a string, end of file in a comment, a quoted character with two or more characters specified, a empty quoted character, a number containing a non digit, a number over the maximum possible, etcetera. Both the C and Java versions have code for testing over the maximum number and neither worked correctly for me. If the maximum integer is turned into a string and the input integer's string is longer or equal and the string is greater than the maximum, it is not valid.
--[[User:Jwells1213|Jwells1213]] ([[User talk:Jwells1213|talk]]) 18:11, 21 May 2022 (UTC)
 
: This seems valid.
 
: Perhaps you could propose the examples you used (add them to the task)? --[[User:Rdm|Rdm]] ([[User talk:Rdm|talk]]) 18:57, 21 May 2022 (UTC)
 
: re: C version ''testing over the maximum number'' - I could not reproduce this. Can you tell me how you compiled the C version (OS, compiler, options, compiler version, 32 or 64 bit) and what numbers you tried? In my tests, it failed as I expected it to. --[[User:Ed Davis|Ed Davis]] ([[User talk:Ed Davis|talk]]) 10:46, 22 May 2022 (UTC)
 
: I agree, this seems like a good idea. Perhaps under '''Additional examples'''
: You could add a link to a page on negative examples, or examples that should fail. --[[User:Ed Davis|Ed Davis]] ([[User talk:Ed Davis|talk]]) 10:46, 22 May 2022 (UTC)
 
==Output Clarification==
The example outputs use \n to display the new line within strings. However, some scanners place the actual character into the output. It should all be done one way so lexical output from C for example would be valid input into all other languages syntax program.
[[User:Jwells1213|Jwells1213]] ([[User talk:Jwells1213|talk]]) 18:31, 21 May 2022 (UTC)
 
: This sounds like a nice idea, but there's other aspects of the task which are not sufficiently standardized for that to work (like the pipe mechanism). Conceptually. Perhaps a more important goal would be to have the implementations be readable enough that making these adjustments would be straightforward to implement. --[[User:Rdm|Rdm]] ([[User talk:Rdm|talk]]) 19:01, 21 May 2022 (UTC)
155

edits