Compiler/Preprocessor
This task modifies the source code prior to the lexical analysis similar to the C built-in preprocessor.
Create a preprocessor for the simple programming language specified in the lexical analysis task referenced below. The program should read input from a file and/or stdin, and write output to a file and/or stdout.
The program should treat any line starting with a hashtag (#) as a command to process. There are currently two valid commands, include and define. No space between the hashtag and its command. Multiple whitespace is treated the same as one.
The include command must be followed by whitespace and a string who contents is the actual file to read. Includes should allow the inclusion of other files to a recursive limit of five active header files plus the original source file.
The define command must be followed by whitespace and a new macro name. Redefinition is illegal. The same character convention for naming variables in the language is used for macro names. No whitespace is required in the arguments but is allowed between every token. When there are no parameters, both the definition or usage must either have an empty argument list or there must not be one. The empty list is required to avoid confusion when the definition needs parenthesizes to force precedence. If there is a close parenthesis, the whitespace trailing it is optional. Otherwise, it is required. From that point to end of line is the definition, whitespace removed from both the start and end of it. Whitespace within the definition between tokens must be maintained. Any names within the definition are treated as macro names first before it is assumed they are a variable in the language during the usage.
To make it easier to find, the usage will be within hashtags, and replaces its usage elsewhere in the files processed. These usages will be processed everywhere they are encountered without regard to the syntax of the sample language. The calling arguments replace the define's parameters as a simple string substitution. You may not assume the usage proceeds in an order to form complex combinations. Tokens detected during definition processing can remain separated during usage processing. If the contents within the double hashtags is not a valid macro usage, its entire text is written to the output as if it was not detected. It is not required to use the ending hashtag as the start of another macro usage as we are going for as simple an example as possible to show concept.
There are three possible command line arguments, debug, input, and output. Debug is an implementer depended switch such as -d or --debug to allow the user to pick between the commands vanishing from the output or the commands appearing as comments in the output. Debug can be specified in any order on the command line beyond the command. Input is the file to process, when missing the console input is used. The Input is always specified before the Output. Output is the file to create, when missing the console output is used. If only one file is specified, it is the input file. If you wish to use an output file and console input, you must specify both arguments, who's usage is left up to the implementer.
This is an example usage of this concept, given a header and source the output should be able to feed straight into the lexical analyzer task.
~~ Header.h ~~ #define area(h, w) h * w ~~ Source.t ~~ #include "Header.h" #define width 5 #define height 6 area = #area(height, width)#;
If you do not support a runtime debugging flag, your code should support only the second version. Otherwise, it should provide either. Yielding code output of:
area = 6 * 5;
Or:
/* Include Header.h */ /* Define area(h, w) as h * w */ /* End Header.h */ /* Define width as 5 */ /* Define height as 6 */ /* Use area, height, and width */ area = 6 * 5;
- Related Tasks
- Lexical Analyzer task
- Syntax Analyzer task
- Code Generator task
- Virtual Machine Interpreter task
- AST Interpreter task
Phix
-- -- demo\rosetta\Compiler\preprocess.exw -- ==================================== -- -- Note this uses js_open() and js_gets() directly, to avoid distributing another two files. -- Also implemented as a standalone demonstration of the general approach, and as such -- might require a bit more work to integrate this properly into the likes of next_ch(), -- unless of course you write it out to disk and/or add some kind of js_write() function. -- Also as noted this won't cope particularly well with #macro("1st,first","2nd")#, etc. -- In other words splitting up the parameters may need to be made significantly smarter. -- with javascript_semantics include core.e -- (see Compiler/lexical_analyzer#Phix - specifcally js_io.e's Source.t) sequence stack, includes, defines, arglst, bodies integer stack_ptr procedure begin(string filename) -- (to allow with and without comments, sequentially, and -- specifically not moaning about things being redefined) stack = repeat(0,5) -- (limited as per task description) includes = repeat("?",5) -- "" defines = {} -- eg "area(h, w) h * w" -> "area" arglst = {} -- -1 if () absent, else eg {"h","w"} bodies = {} -- eg "area(h, w) h * w" -> {1," * ",2} stack_ptr = 1 stack[stack_ptr] = js_open(filename) end procedure function get_word(string line, integer k=1) string word = "" for ch in line[k..$] do if not find(charmap[ch],{LETTER,DIGIT}) then exit end if word &= ch end for return word end function function preprocess(string fragment, bool comments) string word = get_word(fragment) integer k = find(word,defines) assert(k!=0,"no such macro:%s",{word}) sequence used = {word}, body = deep_copy(bodies[k]) fragment = fragment[length(word)+1..$] object args = arglst[k] if sequence(args) then assert(fragment[1]='(' and fragment[$]=')') fragment = fragment[2..$-1] // NB: won't cope with eg #macro("1st,first","2nd")#, etc. sequence params = apply(split(fragment,','),trim) assert(length(params)==length(args)) for i=1 to length(body) do if integer(body[i]) then word = params[body[i]] k = find(word,defines) if k then // (this /might/ want to be recursive...) used = append(used,word) assert(atom(arglst[k])) // placeholder word = join(bodies[k],"") end if body[i] = word end if end for else assert(fragment="") end if if comments then printf(1,"/* Use %s */\n",{join(used,", ",", and ")}) end if string replacement = join(body,"") return replacement end function for comments in {false,true} do printf(1,"with%s comments:\n",{iff(comments?"":"out")}) begin("Source.t") while stack_ptr do object oneline = js_gets(stack[stack_ptr]) if oneline=EOF then if comments and stack_ptr>1 then printf(1,"/* End %s */\n",{includes[stack_ptr]}) end if stack_ptr -= 1 else integer k = find('#',oneline) if k then string word = get_word(oneline,k+1) if word="include" then stack_ptr += 1 assert(k=1) -- 10 is length("#include ")+1 oneline = trim(oneline[10..$],` "`) stack[stack_ptr] = js_open(oneline) if comments then printf(1,"/* Include %s */\n",{oneline}) includes[stack_ptr] = oneline end if elsif word="define" then assert(k=1) -- 9 is length("#define ")+1 word = get_word(oneline,9) assert(not find(word,defines)) defines = append(defines,word) oneline = trim(oneline[9+length(word)..$]) sequence body = {} if oneline[1]='(' then k = find(')',oneline,2) assert(k>0,"closing parenthesis missing") sequence args = apply(split(oneline[2..k-1],','),trim) oneline = trim(oneline[k+1..$]) string fixed = "" while length(oneline) do word = get_word(oneline) if length(word)=0 then fixed &= oneline[1] oneline = oneline[2..$] else k = find(word,args) if k then if length(fixed) then body = append(body,fixed) fixed = "" end if body = append(body,k) else fixed &= word end if oneline = oneline[length(word)+1..$] end if end while if length(fixed) then body = append(body,fixed) fixed = "" end if arglst = append(arglst,args) else body = {oneline} arglst = append(arglst,-1) end if bodies = append(bodies,body) if comments then object al = arglst[$] string n = defines[$], a = iff(atom(al)?"":sprintf("(%s)",{join(al,',')})) sequence b = deep_copy(bodies[$]) for i=1 to length(b) do if atom(b[i]) then b[i] = al[b[i]] end if end for b = join(b,"") printf(1,"/* Define %s%s as %s */\n",{n,a,b}) end if else while k do integer l = find('#',oneline,k+1) assert(l!=0,"missing closing #") string fragment = oneline[k+1..l-1], replacement = preprocess(fragment,comments) oneline[k..l] = replacement k = find('#',oneline,k+length(replacement)-1) end while printf(1,"%s\n",{oneline}) end if else printf(1,"%s\n",{oneline}) end if end if end while printf(1,"\n") end for --close_files()
- Output:
without comments: area = 6 * 5; with comments: /* Include Header.h */ /* Define area(h,w) as h * w */ /* End Header.h */ /* Define width as 5 */ /* Define height as 6 */ /* Use area, height, and width */ area = 6 * 5;
Wren
A fairly naive solution compared to the complexities of a modern C pre-processor.
I've made the simplifying assumption that macro parameters in a macro definition will always be separated from other tokens by at least one space.
I've also assumed that the header files will always be actual files, and never entered from the console.
Note that the program errors out if there are any syntax or other errors when defining the macros. <lang ecmascript>import "os" for Process import "./ioutil" for FileUtil, File, Input import "./str" for Char import "./pattern" for Pattern import "./seq" for Lst, Stack
var isIdentChar = Fn.new { |c| Char.isAsciiAlphaNum(c) || c == "_" }
var isIdent = Fn.new { |s|
if (s == "") return false if (Char.isDigit(s[0])) return false return s.all { |c| isIdentChar.call(c) }
}
var clargs = Process.arguments if (clargs.count > 3) {
System.print("There can't be more than 3 command line arguments: -d // debug mode, comments will be included in output input // filename: if absent or == console, gets input from console output // filename: if absent or == console, sends output to console") return
} var debug = clargs.contains("-d") || clargs.contains("--debug") if (debug) {
clargs.remove("-d") clargs.remove("--debug")
} var inputFileName = "console" if (clargs.count > 0) inputFileName = clargs[0] var lines if (inputFileName != "console") {
lines = FileUtil.readLines(inputFileName)
} else {
var n = Input.integer("How many lines are to be entered? : ", 1) System.print("\nOK, enter the lines and press enter after each one.\n") lines = List.filled(n, null) for (i in 0...n) lines[i] = Input.text("") System.print()
}
var macros = [] var comments = [] var used = [] var includes = Stack.new() var i = 0 while (i < lines.count) {
var line = lines[i].trim() if (line == "" || !line.startsWith("#")) { i = i + 1 } else if (line.startsWith("#include")) { var fname = line[8..-1].trimStart() if (fname.count < 3 || fname[0] != "\"" || fname[-1] != "\"") { Fiber.abort("'#include' directive must be followed by a non-empty string.") } var lines2 = FileUtil.readLines(fname[1..-2]) if (includes.count == 5) { Fiber.abort("Can't have more than 5 active 'include' files.") } else { includes.push([fname, i + lines2.count - 1]) if (debug) comments.add("/* Include Header %(fname) */") } lines = lines[0...i] + lines2 + lines[i+1..-1] } else if (line.startsWith("#define")) { line = line[7..-1].trimStart() if (line == "") Fiber.abort("Missing macro name.") var name = "" var j = 0 while (j < line.count) { var c = line[j] if (isIdentChar.call(c)) name = name + c else break j = j + 1 } if (name == "") Fiber.abort("Missing macro name.") if (!isIdent.call(name)) Fiber.abort("Macro name is not a valid identifier.") if (macros.any { |macro| macro[0] == name }) Fiber.abort("Macro '%(name)' cannot be redefined.") if (j == line.count) Fiber.abort("Missing macro definition.") var paramStr = "" var params = null if (line[j] == "(") { j = j + 1 var k = line.indexOf(")", j) if (k == -1) Fiber.abort("Missing ')' in macro parameter list.") if (k == j) { params = [] } else { paramStr = line[j...k] params = paramStr.split(",") params = params.map { |param| param.trim() }.toList if (!params.all { |param| isIdent.call(param) }) { Fiber.abort("Macro parameter is not a valid identifier.") } } j = k + 1 } if (j == line.count) Fiber.abort("Missing macro definition.") var defn = line[j..-1].trimStart() macros.add([name, params, defn]) if (debug) { if (params == null) { comments.add("/* Define %(name) as %(defn) */") } else { comments.add("/* Define %(name)(%(params.toString[1...-1])) as %(defn) */") } } lines.removeAt(i) } else { Fiber.abort("Unknown directive.") } if (debug) { while (includes.count > 0 && i >= includes.peek()[1]) { comments.add("/* End %(includes.pop()[0]) */") } }
} var src = lines.where { |line| line != "" }.join("\n") for (macro in macros) {
var name = macro[0] var params = macro[1] var defn = macro[2] var p if (params == null) { p = Pattern.new("/X[%(name)]~/X") } else if (params.count == 0) { p = Pattern.new("[/#%(name)()/#]") } else if (params.count > 0) { p = Pattern.new("[/#%(name)(+1^))/#]") } var m = null while (m = p.find(src)) { var span = m.captures[0].span if (params == null || params.count == 0) { src = src[0...span[0]] + defn + src[span[1]+1..-1] used.add(name) } else { var argStr = m.captures[0].text var ix1 = argStr.indexOf("(") + 1 var ix2 = argStr.indexOf(")") - 1 argStr = argStr[ix1..ix2] var args = argStr.split(",") if (args.count == params.count) { var temp = " " + defn + " " for (i in 0...args.count) { temp = temp.replace(" " + params[i] + " ", " " + args[i].trim() + " ") } src = src[0...span[0]] + temp.trim() + src[span[1]+1..-1] used.add(name) } } }
} if (debug) {
while (includes.count > 0) { comments.add("/* End %(includes.pop()[0]) */") }
} used = Lst.distinct(used) if (used.count > 0) {
var temp = (used.count == 1) ? used[0] : used[0..-2].join(", ") + " and " + used[-1] if (debug) comments.add("/* Used %(temp) */")
} if (debug) comments = comments.join("\n")
var outputFileName = "console" if (clargs.count > 1) outputFileName = clargs[1]
if (outputFileName == "console") {
System.print("Output:\n") if (debug) System.print(comments) System.print(src)
} else {
File.create(outputFileName) { |file| if (debug) { file.writeBytes(comments) file.writeBytes("\n") } file.writeBytes(src) file.writeBytes("\n") }
}</lang>
- Output:
Using the example files;
$ wren-cli compiler_preprocessor.wren -d How many lines are to be entered? : 4 OK, enter the lines and press enter after each one. #include "Header.h" #define width 5 #define height 6 area = #area(height, width)#; Output: /* Include Header "Header.h" */ /* Define area(h, w]) as h * w */ /* End "Header.h" */ /* Define width as 5 */ /* Define height as 6 */ /* Used area, width and height */ area = 6 * 5;
Or adding another header file to make the example slightly more interesting:
~~ Header.h ~~ #define area(h, w) h * w #include "Header2.h" ~~ Header2.h ~~ #define depth 7 #define volume(h, w, d) h * w * d ~~ Source.t ~~ #include "Header.h" #define width 5 #define height 6 area = #area(height, width)#; volume = #volume(height, width, depth)#;
- Output:
$ wren-cli compiler_preprocessor.wren -d Source.t Output: /* Include Header "Header.h" */ /* Define area(h, w) as h * w */ /* Include Header "Header2.h" */ /* Define depth as 7 */ /* Define volume(h, w, d) as h * w * d */ /* End "Header2.h" */ /* End "Header.h" */ /* Define width as 5 */ /* Define height as 6 */ /* Used area, depth, volume, width and height */ area = 6 * 5; volume = 6 * 5 * 7;