Regular expression for parsing similar assembler instructions
The intro is a bit lengthy, so please bear with me. :)
I am writing a simple regex-based parser for a large source file written in assembler. Most of these instructions are just moving, adding, subtracting and jumping around, but it's a pretty large file which I need to port to two different languages and I am too lazy to do it manually. That's the requirement and I can't do much 'bout it (so please don't answer stuff like "why don't you simply use ANTLR").
So, after I do some preprocessing (I already did this part: replaced defines and macros and stripped redundant whitespace and comments), I now basically have to read the file line by line and parse one or potentially more lines into "intermediate" instructions, which I will then use to generate more or less 1-to-1 equivalent (using actual integer arithmetics and a bunch of GOTOs).
So, presuming that I can have all these different addressing modes:
I can go two different ways:
My question is: If I have a single regex for all instructions, how should I specify my groups and captures to be able to simply differentiate between different modes?
Or do I simply catch everything and then process the source/destination address after the initial match?
Eg a rather simple match-all regex would be:
^MOVs+(?<dest>[^s,]+)[s,]*(?<src>[^s,]+)$
(Split into multiple lines with comments):
^MOV (?#instruction)
s+ (?#some whitespace)
(?<dest>[^s,]+) (?#match everything except whitespace and comma)
s*,s* (?#match comma, allow some whitespace)
(?<src>[^s,]+) (?#match everything except whitespace and comma)$
So, I can certainly do this and then process dest
and src
groups separately. But would it be better to create a nasty complex regex to match all cases from the table below? In that case I am not sure how I would interpret these captures to understand what addressing mode was matched.
I am using C#, if that makes any difference.
You are discovering what happens when you attempt to bring a lexer to a parser's job. Much of your difficulty I think is in trying to do too much with the regexes.
And yes, I'm going to suggest a parser like ANTLR or equivalent.
If you went that route, you'd write a whole lot of little regexps to identify tokens ("MOV", "#", "[", ...) and then you'd write a grammar that defined how these compose into instructions. If nothing else, this makes it a lot easier to simply write the parsing part.
You can see what an assembler code this looks like. (Uses a system other than ANTLR, but the ideas are the same). This was pretty straightforward to write, and there was no agony about trying to write the One Regex to Rule them All. [I did that example in an evening, and used it parse a rather large set of sources].
You were unclear on what "port" meant. Presumably you are going to another assembler syntax, if not another machine architecture. To do that well, you'll need access to various instruction parts (which a single regex for all possible MOV instructions won't give you). Here is the beauty of parsing and producing trees: all those parts are exposed to you, embedded in the structure in which they belong. You can even generate single instructions from multiple assembly language statements, because the tree holds the entire program. (Rather large doesn't mean much in terms of tree size on systems with a gigabyte of RAM).
Here's a regex that does pretty much what you want (you'll have to edit for the actual data forms; ie instead of all the register labels ax, bx, ... I just used 'reg', etc.)
(?<Opt1>MOVs*Rw,sRw)
|(?<Opt2>MOVs*Rw,s#data4)
|(?<Opt3>MOVs*Rw,s#data16)
|(?<Opt4>MOVs*Rw,s[Rw])
|(?<Opt5>MOVs*Rw,s[Rw+])
|(?<Opt6>MOVs*[Rw],sRw)
|(?<Opt7>MOVs*[-Rw],sRw)
|(?<Opt8>MOVs*[Rw],s[Rw])
|(?<Opt9>MOVs*[Rw+],s[Rw])
|(?<OptA>MOVs*[Rw],s[Rw+])
using this data:
MOV Rw, Rw
MOV Rw, #data4
MOV Rw, #data16
MOV Rw, [Rw]
MOV Rw, [Rw+]
MOV [Rw], Rw
MOV [-Rw], Rw
MOV [Rw], [Rw]
MOV [Rw+], [Rw]
MOV [Rw], [Rw+]
RegexBuddy generates this:
Match 1: MOV Rw, Rw 0 10
Group "Opt1": MOV Rw, Rw 0 10
Group "Opt2" did not participate in the match
Group "Opt3" did not participate in the match
Group "Opt4" did not participate in the match
Group "Opt5" did not participate in the match
Group "Opt6" did not participate in the match
Group "Opt7" did not participate in the match
Group "Opt8" did not participate in the match
Group "Opt9" did not participate in the match
Group "OptA" did not participate in the match
Match 2: MOV Rw, #data4 12 14
Group "Opt1" did not participate in the match
Group "Opt2": MOV Rw, #data4 12 14
Group "Opt3" did not participate in the match
Group "Opt4" did not participate in the match
Group "Opt5" did not participate in the match
Group "Opt6" did not participate in the match
Group "Opt7" did not participate in the match
Group "Opt8" did not participate in the match
Group "Opt9" did not participate in the match
Group "OptA" did not participate in the match
Match 3: MOV Rw, #data16 28 15
Group "Opt1" did not participate in the match
Group "Opt2" did not participate in the match
Group "Opt3": MOV Rw, #data16 28 15
Group "Opt4" did not participate in the match
Group "Opt5" did not participate in the match
Group "Opt6" did not participate in the match
Group "Opt7" did not participate in the match
Group "Opt8" did not participate in the match
Group "Opt9" did not participate in the match
Group "OptA" did not participate in the match
Match 4: MOV Rw, [Rw] 45 12
Group "Opt1" did not participate in the match
Group "Opt2" did not participate in the match
Group "Opt3" did not participate in the match
Group "Opt4": MOV Rw, [Rw] 45 12
Group "Opt5" did not participate in the match
Group "Opt6" did not participate in the match
Group "Opt7" did not participate in the match
Group "Opt8" did not participate in the match
Group "Opt9" did not participate in the match
Group "OptA" did not participate in the match
Match 5: MOV Rw, [Rw+] 59 13
Group "Opt1" did not participate in the match
Group "Opt2" did not participate in the match
Group "Opt3" did not participate in the match
Group "Opt4" did not participate in the match
Group "Opt5": MOV Rw, [Rw+] 59 13
Group "Opt6" did not participate in the match
Group "Opt7" did not participate in the match
Group "Opt8" did not participate in the match
Group "Opt9" did not participate in the match
Group "OptA" did not participate in the match
Match 6: MOV [Rw], Rw 74 12
Group "Opt1" did not participate in the match
Group "Opt2" did not participate in the match
Group "Opt3" did not participate in the match
Group "Opt4" did not participate in the match
Group "Opt5" did not participate in the match
Group "Opt6": MOV [Rw], Rw 74 12
Group "Opt7" did not participate in the match
Group "Opt8" did not participate in the match
Group "Opt9" did not participate in the match
Group "OptA" did not participate in the match
Match 7: MOV [-Rw], Rw 88 13
Group "Opt1" did not participate in the match
Group "Opt2" did not participate in the match
Group "Opt3" did not participate in the match
Group "Opt4" did not participate in the match
Group "Opt5" did not participate in the match
Group "Opt6" did not participate in the match
Group "Opt7": MOV [-Rw], Rw 88 13
Group "Opt8" did not participate in the match
Group "Opt9" did not participate in the match
Group "OptA" did not participate in the match
Match 8: MOV [Rw], [Rw] 103 14
Group "Opt1" did not participate in the match
Group "Opt2" did not participate in the match
Group "Opt3" did not participate in the match
Group "Opt4" did not participate in the match
Group "Opt5" did not participate in the match
Group "Opt6" did not participate in the match
Group "Opt7" did not participate in the match
Group "Opt8": MOV [Rw], [Rw] 103 14
Group "Opt9" did not participate in the match
Group "OptA" did not participate in the match
Match 9: MOV [Rw+], [Rw] 119 15
Group "Opt1" did not participate in the match
Group "Opt2" did not participate in the match
Group "Opt3" did not participate in the match
Group "Opt4" did not participate in the match
Group "Opt5" did not participate in the match
Group "Opt6" did not participate in the match
Group "Opt7" did not participate in the match
Group "Opt8" did not participate in the match
Group "Opt9": MOV [Rw+], [Rw] 119 15
Group "OptA" did not participate in the match
Match 10: MOV [Rw], [Rw+] 136 15
Group "Opt1" did not participate in the match
Group "Opt2" did not participate in the match
Group "Opt3" did not participate in the match
Group "Opt4" did not participate in the match
Group "Opt5" did not participate in the match
Group "Opt6" did not participate in the match
Group "Opt7" did not participate in the match
Group "Opt8" did not participate in the match
Group "Opt9" did not participate in the match
Group "OptA": MOV [Rw], [Rw+] 136 15
链接地址: http://www.djcxy.com/p/72438.html
下一篇: 正则表达式用于解析相似的汇编程序指令