Writing an extremely simple parser

I'm writing a very basic web server that has to support an extremely limited special server side scripting language. Basically all I need to support is "echo", addition/subtraction/multiplication (no division) with only 2 operands, a simple "date()" function that outputs the date and the use of the "&" operator to concatenate strings.

An example could be:

echo "Here is the date: " & date();
echo "9 x 15 = : & 9*15;

I've gone through and created the code necessary to generate tokens, but I'm not sure I'm using the right tokens.

I created tokens for the following:

ECHO - The echo command
WHITESPACE - Any whitespace
STRING - A string inside quotations
DATE - The date() function
CONCAT - the & operator for concatenation
MATH - Any instance of binary operation (5+4, 9*2, 8-2, etc)
TERM - The terminal character (;)

The MATH one I am particularly unsure about. Typically I see people create a token specifically for integers and then for each operator as well, but since I ONLY want to allow binary operations, I thought it made sense to group it into one token. If I were to do everything separately, I would have to do some extra work to ensure that I never accepted "5+4+1".

So question 1 is am I on the right track with which tokens to use?

My next question is what to I do with these tokens next to ensure correct syntax? The approach that I had thought of was to basically say, "Okay I know I have this token, here is a list of tokens that are allowed to come next based on the current token. Is the next token in the list?"

Based on that, I made a list of all of my tokens as well as what tokens are valid to appear directly after them (didn't include whitespace for simplicity).

ECHO        ->      STRING|MATH|DATE
STRING      ->      TERM|CONCAT
MATH        ->      TERM|CONCAT
DATE        ->      TERM|CONCAT
CONCAT      ->      STRING|MATH|DATE

The problem is I'm not sure at all how to best implement this. Really I need to keep track of whitespace as well to make sure there are spaces between the tokens. But that means I have to look ahead two tokens at a time which is getting even more intimidating. I also am not sure how to manage the "valid next tokens" stuff without just some disgusting section of if blocks. Should I be checking for valid syntax before trying to actually execute the script, or should I do it all at once and just throw an error when I reach an unexpected token? In this simple example, everything will always work just fine parsing left to right, there's no real precedence rules (except the MATH thing, but that's part of why I combined it into one token even though it feels wrong.) Even so, I wouldn't mind designing a more scalable and elegant solution.

In my research about writing parsers, I see a lot of references to creating "accept()" and "expect()" functions but I can't find any clear description of what they are supposed to do or how they are supposed to work.

I guess I'm just not sure how to implement this, and then how to actually come up with a resulting string at the end of the day.

Am I heading in the right direction and does anybody know of a resource that might help me understand how to best implement something simple like this? I am required to do it by hand and cannot use a tool like ANTLR.

Thanks in advance for any help.


The first thing that you need to do is to discard all the white-spaces (except for the ones in strings). This way, when you add tokens to the list of tokens, you are sure that the list contains only valid tokens. For example, consider this statement:

echo "Here is the date: " & date();

I will start tokenizing and first separate echo based on the white-space (yes, white-space is needed here to separate it but isn't useful after that). The tokenizer then encounters a double quote and continues reading everything until the closing double quote is found. Similarly, I create separate tokens for & , date and () .

My token list now contains the following tokens:

echo
"Here is the date: "
&
date
()

Now, in the parsing stage, we read these tokens. The parser loops through every token in the token list. It reads echo and checks if it is valid (based on the rules / functions you have for the language). It advances to the next token and sees if it is either of the date , string or math . Similarly, it checks the rest of the tokens. If at any point, a token is not supposed to be there, you can throw an error indicating syntax error or something.

For the math statement tokenization, only combine the expression that is contained in a bracket and rest of operands and operators separately. For example: 9/3 + (7-3+1) would have the tokens 9, /, 3, +, and (7-3+1). As every token has its own priority (that you define in the token struct), you can start evaluating from the highest priority token down to the lowest token priority. This way you can have prioritized expressions. If you still have confusion, let me know. I'll write you some example code.


expect is what your parser does to get the next token, and fails if the token isn't a proper following token. To begin with, your parser expects ECHO or WHITESPACE. Those are the only valid starting terms. Having seen "ECHO", your parser expects one of WHITESPACE|STRING|MATH|DATE; anything else is an error. And so on.

accept is when your parser has seen a complete "statement" - ECHO, followed by a valid sequence of tokens, followed by TERM. Your parser now has enough information to process your ECHO command.

Oh, and hand-written parsers (especially simple ones) are very often disgusting collections of if blocks (or moral equivalents like switch statements) :) Further up the line of elegant-ness would be some kind of state machine, and further up from that is a grammar generator like yacc or GOLD Parser Generator (which in turn churn out ugly if , switch , and state machines for you).

EDIT to provide more details.

To help sort out responsibilities, create a "lexer" whose job is to read the input and produce tokens. This involves deciding what tokens look like. An easy token is the word "echo". A less easy token is a math operation; the token would consist of one or more digits, an operator, and one or more digits, with no whitespace between. The lexer would take care of skipping whitespace, as well as understanding a quoted string and the characters that form the date() function. The lexer would return two things - the type of token read and the value of the token (eg, "MATH" and "9*15").

With a lexer in hand to read your input, the parser consumes the tokens and ensures they're in a proper order. First you have to see the ECHO token. If not, fail with an error message. After that, you have to see STRING, DATE, or MATH. If not, fail with an error message. After that, you loop, watching for either TERM, or else CONCAT followed by another STRING, DATE, or MATH. If you see TERM, break the loop. If you see neither TERM nor CONCAT, fail with an error message.

You can process the ECHO command as you're parsing, since it's a simple grammar. Each time you find a STRING, DATE or MATH, evaluate it and concatenate it to what you already have. When you find TERM, exit the function and return the built-up string.

Questions? Comments? Omelets? :)

链接地址: http://www.djcxy.com/p/11408.html

上一篇: 使用HTML构建本地Windows应用程序

下一篇: 编写一个非常简单的解析器