Writing a code formatting tool for a programming language

I'm looking into the feasibility of writing a code formatting tool for the Apex language, a Salesforce.com variation on Java, and perhams VisualForce, its tag based markup language.

I have no idea on where to start this, apart from feeling/knowing that writing a language parser from scratch is probably not the best approach.

I have a fairly thin grasp of what Antlr is and what it does, but conceptually, I'm imagining one could 'train' antlr to understand the syntax of Apex. I could then get a structured version of the code in a data structure (AST?) which I could then walk to produce correctly formatted code.

Is this the right concept? Is Antlr a tool to do that? Any links to a brief synopsis on this? I'm looking for investing a few days in this task, not months, and I'm not sure if its even vaguely achievable.


Since Apex syntax is similar to Java, I'd look at Eclipse's JDT. Edit down the Java grammar to match Apex. Do the same w/ formatting rules/options. This is more than a few days of work.


Steven Herod wrote:

... I'm imagining one could 'train' antlr to understand the syntax of Apex. ...

What do you mean by "'train' antlr"? "Train" as in artificial intelligence (training a neural-net)? If so, then you are mistaken.

Steven Herod wrote:

... get a structured version of the code in a data structure (AST?) which I could then walk to produce correctly formatted code.

Is this the right concept? Is Antlr a tool to do that?

Yes, more or less. You write a grammar that precisely defines the language you want to parse. Then you use ANTLR which will generate a lexer (tokenizer) and parser based on the grammar file. You can let the parser create an AST from your input source and then walk the AST and emit (custom) output/code.

Steven Herod wrote:

... I'm looking for investing a few days in this task, not months, and I'm not sure if its even vaguely achievable.

Well, I don't know you of course, but I'd say writing a grammar for a language similar to Java, and then emitting output by walking the AST within just a couple of days is impossible, even more so for someone new to ANTLR. I am fairly familiar with ANTLR, but I couldn't do it in just a few days. Note that I'm only talking about the "parsing-part", after you've done that, you'll need to integrate this in some text editor. This all looks to be more a project of several months, not even weeks, let alone several days.

So, in short, if all you want to do is write a custom code highlighter, ANTLR isn't your best choice.

You could have a look at Xtext which uses ANTLR under the hood. To quote their website:

With Xtext you can easily create your own programming languages and domain-specific languages (DSLs). The framework supports the development of language infrastructures including compilers and interpreters as well as full blown Eclipse-based IDE integration. ...

But I doubt you'll have an Eclipse plugin up and running within just a few days.

Anyway, best of luck!


Our DMS Software Reengineering Toolkit is designed to do this as kind poker-pot ante necessary to do any kind of automated software reengineering project.

DMS allows one to define a grammar, similar to ANTLR's (and other parser generator) styles. Unlike ANTLR (and other parser generators), DMS uses a GLR parser, which means you don't have to bend the language grammar rules to meet the requirements of the parser generator. If you can write an context-free grammar, DMS will convert that into a parser for that language. This means in fact you can get a working, correct grammar up considerably faster than with typical LL or L(AL)R parser generators.

Unlike ANTLR (and other parser generators), there is no additional work to build the AST; it is automatically constructed. This means you spend zero time write tree-building rules and none debugging them.

DMS additionally provides a pretty-printing specification language, specifying text boxes stack vertically, horizontally, or indented, in which you can define the "format" that is used to convert the AST back into completely legal, nicely formatted source text. None of the well known parser generators provide any help here; if you want to prettyprint the tree, you get to do a great deal of custom coding. For more details on this, see my SO answer to Compiling an AST back to source. What this means is you can build a prettyprinter for your grammar in an (intense) afternoon by simply annotating the grammar rules with box layout directives.

DMS's lexer is very careful to capture comments and "lexical formats" (was that number octal? What kind of quotes did that string have? Escaped characters?) so that they can be regenerated correctly. Parse-to-AST and then prettyprint-AST-to-text round trips arbitrarily ugly code into formatted code following the prettyprinting rules. (This round trip is the poker ante: if you want go further, to actually manipulate the AST, you still want to be able to regenerate valid source text).

We recently built parser/prettyprinters for EGL. This took about a week end to end. Granted, we are expert at our tools.

You can download any of a number of different formatters built using DMS from our web site, to see what such formatting can do.

EDIT July 2012: Last week (5 days) using DMS, from scratch we (I personally) built a fully compliant IEC61131-3 "Structured Text" (industrial control language, Pascal-like) parser and prettyprinter. (It handles all the examples from the standards documents).

链接地址: http://www.djcxy.com/p/7592.html

上一篇: SQLite中的Base64?

下一篇: 为编程语言编写代码格式化工具