User:Ed Davis: Difference between revisions

From Rosetta Code
Content added Content deleted
(Created page with "Hello, World!")
 
No edit summary
Line 1: Line 1:
Lexical Analyzer
Hello, World!
----------------

From Wikipedia: (https://en.wikipedia.org/wiki/Lexical_analysis)

Lexical analysis is the process of converting a sequence of characters (such as in a
computer program or web page) into a sequence of tokens (strings with an identified
"meaning"). A program that performs lexical analysis may be called a lexer, tokenizer,[1]
or scanner (though "scanner" is also used to refer to the first stage of a lexer).

The Task
--------

Create a lexical analyzer for the Tiny programming language.

Specification
-------------

{| class="wikitable"
|-
! Characters !! Regular expression !! Name
|-
| integers || [0-9]+ || Integer
|-
| char literal || 'x' || Integer
|-
| identifiers || [_a-zA-Z][_a-zA-Z0-9]+ || Ident
|-
| string literal || ".*" || String
|}


Notes: For char literals, '\n' is supported as a new line
character. To represent \, use: '\\'. \n may also be used in
Strings, to print a newline. No other special sequences are
supported.

operators:

'*' multiply Mul
'/' divide Div
'+' plus Add
'-' minus and unary minus Sub and Uminus
'<' less than Lss
'<=' less than or equal Leq
'>' greater than Gtr
'!=' not equal Neq
'=' assign Assign
'&&' and And

symbols:

'(' left parenthesis Lparen
')' right parenthesis Rparen
'{' left brace Lbrace
'}' right brace Rbrace
';' semi colon Semi
',' comma Comma

keywords:

"if" If
"while" While
"print" Print
"putc" Putc

comments: /* ... */ (multi-line)

Complete list of token types:

EOI, Print, Putc, If, While, Lbrace, Rbrace, Lparen, Rparen, Uminus, Mul, Div, Add,
Sub, Lss, Gtr, Leq, Neq, And, Semi, Comma, Assign, Integerk, Stringk, Ident

Output of the program should be the line and column where the
found token starts, followed by the Token name. For tokens
Integer, Ident and String, the Integer, identifier, or string
should follow.


Test Cases
----------

/*
Hello world
*/
print("Hello, World!\n");

Output
------

line 4 col 1 Print
line 4 col 6 Lparen
line 4 col 7 String "Hello, World!\n"
line 4 col 24 Rparen
line 4 col 25 Semi
line 5 col 1 EOI

/*
Show Ident and Integers
*/
phoenix_number = 142857;
print(phoenix_number, "\n");

Output
------

line 1 col 1 Ident phoenix_number
line 1 col 16 Assign
line 1 col 18 Integer 142857
line 1 col 24 Semi
line 2 col 1 Print
line 2 col 6 Lparen
line 2 col 7 Ident phoenix_number
line 2 col 21 Comma
line 2 col 23 String "\n"
line 2 col 27 Rparen
line 2 col 28 Semi
line 3 col 1 EOI

Diagnostics:
------------
The following error conditions should be caught:

Empty character constant. Example: ''
Unknown escape sequence. Example: '\r'
Multi-character constant. Example: 'xx'
End-of-file in comment. Closing comment characters not found.
End-of-file while scanning string literal. Closing string character not found.
End-of-line while scanning string literal. Closing string character not found before end-of-line.
Unrecognized character. Example: |

Refer additional questions to the C and Python implementations.

Revision as of 17:30, 9 August 2016

Lexical Analyzer


From Wikipedia: (https://en.wikipedia.org/wiki/Lexical_analysis)

Lexical analysis is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an identified "meaning"). A program that performs lexical analysis may be called a lexer, tokenizer,[1] or scanner (though "scanner" is also used to refer to the first stage of a lexer).

The Task


Create a lexical analyzer for the Tiny programming language.

Specification


Characters Regular expression Name
integers [0-9]+ Integer
char literal 'x' Integer
identifiers [_a-zA-Z][_a-zA-Z0-9]+ Ident
string literal ".*" String


Notes: For char literals, '\n' is supported as a new line character. To represent \, use: '\\'. \n may also be used in Strings, to print a newline. No other special sequences are supported.

operators:

'*' multiply Mul '/' divide Div '+' plus Add '-' minus and unary minus Sub and Uminus '<' less than Lss '<=' less than or equal Leq '>' greater than Gtr '!=' not equal Neq '=' assign Assign '&&' and And

symbols:

'(' left parenthesis Lparen ')' right parenthesis Rparen '{' left brace Lbrace '}' right brace Rbrace ';' semi colon Semi ',' comma Comma

keywords:

"if" If "while" While "print" Print "putc" Putc

comments: /* ... */ (multi-line)

Complete list of token types:

EOI, Print, Putc, If, While, Lbrace, Rbrace, Lparen, Rparen, Uminus, Mul, Div, Add, Sub, Lss, Gtr, Leq, Neq, And, Semi, Comma, Assign, Integerk, Stringk, Ident

Output of the program should be the line and column where the found token starts, followed by the Token name. For tokens Integer, Ident and String, the Integer, identifier, or string should follow.


Test Cases


/*

 Hello world
*/

print("Hello, World!\n");

Output


line 4 col 1 Print line 4 col 6 Lparen line 4 col 7 String "Hello, World!\n" line 4 col 24 Rparen line 4 col 25 Semi line 5 col 1 EOI

/*

 Show Ident and Integers
*/

phoenix_number = 142857; print(phoenix_number, "\n");

Output


line 1 col 1 Ident phoenix_number line 1 col 16 Assign line 1 col 18 Integer 142857 line 1 col 24 Semi line 2 col 1 Print line 2 col 6 Lparen line 2 col 7 Ident phoenix_number line 2 col 21 Comma line 2 col 23 String "\n" line 2 col 27 Rparen line 2 col 28 Semi line 3 col 1 EOI

Diagnostics:


The following error conditions should be caught:

Empty character constant. Example: Unknown escape sequence. Example: '\r' Multi-character constant. Example: 'xx' End-of-file in comment. Closing comment characters not found. End-of-file while scanning string literal. Closing string character not found. End-of-line while scanning string literal. Closing string character not found before end-of-line. Unrecognized character. Example: |

Refer additional questions to the C and Python implementations.