Translating source code into a foreign language

I'm running an educational website which is teaching programming to kids (12-15 years old).

As they don't all speak English in the code source of the solutions we are using French variables and functions names. However we are planing to translate the content into other languages (German, Spanish, English). To do so I would like to translate the source code as fast as possible. We mostly have C/C++ code.

The solution I'm planning to use :

  • extract all variables/functions names from the source-code, with their position in the file (where they are declared, used, called...)
  • remove all language keywords and library functions
  • ask the translator to provide translations for the remaining names
  • replace the names in the file
  • Is there already some open-source code/project that can do that ? (For the points 1,2 and 4)

    If there isn't, the most difficult point in the first one : using a C/C++ parser to build a syntactical tree and then extracting the variables with their position seems the way to go. Do you have others ideas ?

    Thank you for any advice.

    Edit : As noted in a comment I will also need to take care of the comments but there is only a few of them : the complete solution is already explained in plain-text and then we are showing the code-source with self-explained variable/function names. The source code is rarely more that 30/40 lines long and good names must make it understandable without comments if you already know what the code is doing.

    Additional info : for the people interested the website is a training platform for the International Olympiads in Informatics and C/C++ (at least the minimum needed for programming contest) is not so difficult to learn by a 12 years old.


    Are you sure you need a full syntax tree for this? I think it would be enough to do lexical analysis to find the identifiers, which is much easier. Then exclude keywords and identifiers that also appear in the header files being included.

    In principle it is possible that you want different variables with the same English name to be translated to different words in French/German -- but for educational use the risk of this arising is probably small enough to ignore at first. You could sidestep the issue by writing the original sources with some disambiguating quasi-Hungarian prefixes and then remove these with the same translation mechanism for display to English-speaking end users.

    Be sure to let translators see the name they are translating with full context before they choose a translation.


    我真的认为你可以使用clang (libclang)来解析你的源代码并做你想做的事情(这里有更多的信息),好消息是它们有python绑定,如果你想访问一个翻译,这会让你的生活更轻松服务或类似的东西。


    You don't really need a C/C++ parser, just a simple lexer that gives you elements of the code one by one. Then you get a lot of { , [ , 213 , ) etc that you simply ignore and write to the result file. You translate whatever consists of only letters (except keywords) and you put them in the output.

    Now that I think about it, it's as simple as this:

    bool is_letter(char c)
    {
        return (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z');
    }
    bool is_keyword(string &s)
    {
        return s == "if" || s == "else" || s == "void" /* rest of them */;
    }
    void translateCode(istream &in, ostream &out)
    {
        while (!in.eof())
        {
            char c = in.get();
            if (is_letter(c))
            {
                string name = "";
                do
                {
                    name += c;
                    c = in.get();
                } while (is_letter(c) && !in.eof());
                if (is_keyword(name))
                    out << name;
                else
                    out << translate(name);
            }
            out << c;  // even if is_letter(c) was true, there is a new c from the
                       // while inside that was read (which was not letter), but
                       // not written, so would be written here.
        }
    }
    

    I wrote the code in the editor, so there may be minor errors. Tell me if there are any and I'll fix it.

    Edit: Explanation:

    What the code does is simply to read input character by character, outputting whatever non-letter characters it reads (including spaces, tabs and new lines). If it does see a letter though, it will start putting all the following letters in one string (until it reaches another non-letter). Then if the string was a keyword, it would output the keyword itself. If it was not, would translate it and output it.

    The output would have the exact same format as the input.

    链接地址: http://www.djcxy.com/p/73288.html

    上一篇: C翻译阶段的具体例子

    下一篇: 将源代码翻译成外语