Regular expression for parsing similar assembler instructions

The intro is a bit lengthy, so please bear with me. :)

I am writing a simple regex-based parser for a large source file written in assembler. Most of these instructions are just moving, adding, subtracting and jumping around, but it's a pretty large file which I need to port to two different languages and I am too lazy to do it manually. That's the requirement and I can't do much 'bout it (so please don't answer stuff like "why don't you simply use ANTLR").

So, after I do some preprocessing (I already did this part: replaced defines and macros and stripped redundant whitespace and comments), I now basically have to read the file line by line and parse one or potentially more lines into "intermediate" instructions, which I will then use to generate more or less 1-to-1 equivalent (using actual integer arithmetics and a bunch of GOTOs).

So, presuming that I can have all these different addressing modes:

寻址模式取决于指令的格式

I can go two different ways:

  • Have a single MOV regex which will handle all these cases, or
  • Have multiple MOV regexes, on for each instruction type. The problem with this approach is that I would have to design each regex pretty carefully to avoid any ambiguity. AND it seems there would be lots of duplicates since source and destination operands share many of the addressing modes.
  • My question is: If I have a single regex for all instructions, how should I specify my groups and captures to be able to simply differentiate between different modes?

    Or do I simply catch everything and then process the source/destination address after the initial match?

    Eg a rather simple match-all regex would be:

    ^MOVs+(?<dest>[^s,]+)[s,]*(?<src>[^s,]+)$
    

    (Split into multiple lines with comments):

    ^MOV              (?#instruction)
    s+               (?#some whitespace)
    (?<dest>[^s,]+)  (?#match everything except whitespace and comma)
    s*,s*           (?#match comma, allow some whitespace)
    (?<src>[^s,]+)   (?#match everything except whitespace and comma)$
    

    So, I can certainly do this and then process dest and src groups separately. But would it be better to create a nasty complex regex to match all cases from the table below? In that case I am not sure how I would interpret these captures to understand what addressing mode was matched.

    I am using C#, if that makes any difference.


    You are discovering what happens when you attempt to bring a lexer to a parser's job. Much of your difficulty I think is in trying to do too much with the regexes.

    And yes, I'm going to suggest a parser like ANTLR or equivalent.

    If you went that route, you'd write a whole lot of little regexps to identify tokens ("MOV", "#", "[", ...) and then you'd write a grammar that defined how these compose into instructions. If nothing else, this makes it a lot easier to simply write the parsing part.

    You can see what an assembler code this looks like. (Uses a system other than ANTLR, but the ideas are the same). This was pretty straightforward to write, and there was no agony about trying to write the One Regex to Rule them All. [I did that example in an evening, and used it parse a rather large set of sources].

    You were unclear on what "port" meant. Presumably you are going to another assembler syntax, if not another machine architecture. To do that well, you'll need access to various instruction parts (which a single regex for all possible MOV instructions won't give you). Here is the beauty of parsing and producing trees: all those parts are exposed to you, embedded in the structure in which they belong. You can even generate single instructions from multiple assembly language statements, because the tree holds the entire program. (Rather large doesn't mean much in terms of tree size on systems with a gigabyte of RAM).


    Here's a regex that does pretty much what you want (you'll have to edit for the actual data forms; ie instead of all the register labels ax, bx, ... I just used 'reg', etc.)

     (?<Opt1>MOVs*Rw,sRw)
    |(?<Opt2>MOVs*Rw,s#data4)
    |(?<Opt3>MOVs*Rw,s#data16)
    |(?<Opt4>MOVs*Rw,s[Rw])
    |(?<Opt5>MOVs*Rw,s[Rw+])
    |(?<Opt6>MOVs*[Rw],sRw)
    |(?<Opt7>MOVs*[-Rw],sRw)
    |(?<Opt8>MOVs*[Rw],s[Rw])
    |(?<Opt9>MOVs*[Rw+],s[Rw])
    |(?<OptA>MOVs*[Rw],s[Rw+]) 
    

    using this data:

    MOV Rw, Rw
    MOV Rw, #data4
    MOV Rw, #data16
    MOV Rw, [Rw]
    MOV Rw, [Rw+]
    MOV [Rw], Rw
    MOV [-Rw], Rw
    MOV [Rw], [Rw]
    MOV [Rw+], [Rw]
    MOV [Rw], [Rw+]
    

    RegexBuddy generates this:

    Match 1:    MOV Rw, Rw       0      10
    Group "Opt1":   MOV Rw, Rw       0      10
    Group "Opt2" did not participate in the match
    Group "Opt3" did not participate in the match
    Group "Opt4" did not participate in the match
    Group "Opt5" did not participate in the match
    Group "Opt6" did not participate in the match
    Group "Opt7" did not participate in the match
    Group "Opt8" did not participate in the match
    Group "Opt9" did not participate in the match
    Group "OptA" did not participate in the match
    Match 2:    MOV Rw, #data4      12      14
    Group "Opt1" did not participate in the match
    Group "Opt2":   MOV Rw, #data4      12      14
    Group "Opt3" did not participate in the match
    Group "Opt4" did not participate in the match
    Group "Opt5" did not participate in the match
    Group "Opt6" did not participate in the match
    Group "Opt7" did not participate in the match
    Group "Opt8" did not participate in the match
    Group "Opt9" did not participate in the match
    Group "OptA" did not participate in the match
    Match 3:    MOV Rw, #data16     28      15
    Group "Opt1" did not participate in the match
    Group "Opt2" did not participate in the match
    Group "Opt3":   MOV Rw, #data16     28      15
    Group "Opt4" did not participate in the match
    Group "Opt5" did not participate in the match
    Group "Opt6" did not participate in the match
    Group "Opt7" did not participate in the match
    Group "Opt8" did not participate in the match
    Group "Opt9" did not participate in the match
    Group "OptA" did not participate in the match
    Match 4:    MOV Rw, [Rw]        45      12
    Group "Opt1" did not participate in the match
    Group "Opt2" did not participate in the match
    Group "Opt3" did not participate in the match
    Group "Opt4":   MOV Rw, [Rw]        45      12
    Group "Opt5" did not participate in the match
    Group "Opt6" did not participate in the match
    Group "Opt7" did not participate in the match
    Group "Opt8" did not participate in the match
    Group "Opt9" did not participate in the match
    Group "OptA" did not participate in the match
    Match 5:    MOV Rw, [Rw+]       59      13
    Group "Opt1" did not participate in the match
    Group "Opt2" did not participate in the match
    Group "Opt3" did not participate in the match
    Group "Opt4" did not participate in the match
    Group "Opt5":   MOV Rw, [Rw+]       59      13
    Group "Opt6" did not participate in the match
    Group "Opt7" did not participate in the match
    Group "Opt8" did not participate in the match
    Group "Opt9" did not participate in the match
    Group "OptA" did not participate in the match
    Match 6:    MOV [Rw], Rw        74      12
    Group "Opt1" did not participate in the match
    Group "Opt2" did not participate in the match
    Group "Opt3" did not participate in the match
    Group "Opt4" did not participate in the match
    Group "Opt5" did not participate in the match
    Group "Opt6":   MOV [Rw], Rw        74      12
    Group "Opt7" did not participate in the match
    Group "Opt8" did not participate in the match
    Group "Opt9" did not participate in the match
    Group "OptA" did not participate in the match
    Match 7:    MOV [-Rw], Rw       88      13
    Group "Opt1" did not participate in the match
    Group "Opt2" did not participate in the match
    Group "Opt3" did not participate in the match
    Group "Opt4" did not participate in the match
    Group "Opt5" did not participate in the match
    Group "Opt6" did not participate in the match
    Group "Opt7":   MOV [-Rw], Rw       88      13
    Group "Opt8" did not participate in the match
    Group "Opt9" did not participate in the match
    Group "OptA" did not participate in the match
    Match 8:    MOV [Rw], [Rw]     103      14
    Group "Opt1" did not participate in the match
    Group "Opt2" did not participate in the match
    Group "Opt3" did not participate in the match
    Group "Opt4" did not participate in the match
    Group "Opt5" did not participate in the match
    Group "Opt6" did not participate in the match
    Group "Opt7" did not participate in the match
    Group "Opt8":   MOV [Rw], [Rw]     103      14
    Group "Opt9" did not participate in the match
    Group "OptA" did not participate in the match
    Match 9:    MOV [Rw+], [Rw]    119      15
    Group "Opt1" did not participate in the match
    Group "Opt2" did not participate in the match
    Group "Opt3" did not participate in the match
    Group "Opt4" did not participate in the match
    Group "Opt5" did not participate in the match
    Group "Opt6" did not participate in the match
    Group "Opt7" did not participate in the match
    Group "Opt8" did not participate in the match
    Group "Opt9":   MOV [Rw+], [Rw]    119      15
    Group "OptA" did not participate in the match
    Match 10:   MOV [Rw], [Rw+]    136      15
    Group "Opt1" did not participate in the match
    Group "Opt2" did not participate in the match
    Group "Opt3" did not participate in the match
    Group "Opt4" did not participate in the match
    Group "Opt5" did not participate in the match
    Group "Opt6" did not participate in the match
    Group "Opt7" did not participate in the match
    Group "Opt8" did not participate in the match
    Group "Opt9" did not participate in the match
    Group "OptA":   MOV [Rw], [Rw+]    136      15
    
    链接地址: http://www.djcxy.com/p/72438.html

    上一篇: 有人知道这个C ++内联程序集是干什么的吗?

    下一篇: 正则表达式用于解析相似的汇编程序指令