Regular expression matching fully qualified class names

What is the best way to match fully qualified Java class name in a text?

Examples: java.lang.Reflect , java.util.ArrayList , org.hibernate.Hibernate .


A Java fully qualified class name (lets say "N") has the structure

N.N.N.N

The "N" part must be a Java identifier. Java identifiers cannot start with a number, but after the initial character they may use any combination of letters and digits, underscores or dollar signs:

([a-zA-Z_$][a-zA-Zd_$]*.)*[a-zA-Z_$][a-zA-Zd_$]*
------------------------    -----------------------
          N                           N

They can also not be a reserved word (like import , true or null ). If you want to check plausibility only, the above is enough. If you also want to check validity, you must check against a list of reserved words as well.

Java identifiers may contain any Unicode letter instead of "latin only". If you want to check for this as well, use Unicode character classes:

([p{Letter}_$][p{Letter}p{Number}_$]*.)*[p{Letter}_$][p{Letter}p{Number}_$]*

or, for short

([p{L}_$][p{L}p{N}_$]*.)*[p{L}_$][p{L}p{N}_$]*

The Java Language Specification, (section 3.8) has all details about valid identifier names.

Also see the answer to this question: Java Unicode variable names


基于来自@ alan-moore的优秀评论,这里有一个完整的工作课程,包含测试

import static org.junit.Assert.assertFalse;
import static org.junit.Assert.assertTrue;

import java.util.regex.Pattern;

import org.junit.Test;

public class ValidateJavaIdentifier {

    private static final String ID_PATTERN = "p{javaJavaIdentifierStart}p{javaJavaIdentifierPart}*";
    private static final Pattern FQCN = Pattern.compile(ID_PATTERN + "(." + ID_PATTERN + ")*");

    public static boolean validateJavaIdentifier(String identifier) {
        return FQCN.matcher(identifier).matches();
    }


    @Test
    public void testJavaIdentifier() throws Exception {
        assertTrue(validateJavaIdentifier("C"));
        assertTrue(validateJavaIdentifier("Cc"));
        assertTrue(validateJavaIdentifier("b.C"));
        assertTrue(validateJavaIdentifier("b.Cc"));
        assertTrue(validateJavaIdentifier("aAa.b.Cc"));
        assertTrue(validateJavaIdentifier("a.b.Cc"));

        // after the initial character identifiers may use any combination of
        // letters and digits, underscores or dollar signs
        assertTrue(validateJavaIdentifier("a.b.C_c"));
        assertTrue(validateJavaIdentifier("a.b.C$c"));
        assertTrue(validateJavaIdentifier("a.b.C9"));

        assertFalse("cannot start with a dot", validateJavaIdentifier(".C"));
        assertFalse("cannot have two dots following each other",
                validateJavaIdentifier("b..C"));
        assertFalse("cannot start with a number ",
                validateJavaIdentifier("b.9C"));
    }
}

The pattern provided by Renaud works. But, as far as I can tell, it will always backtrack at the end.

To optimize it, you can essentially swap the first half with the last. Note the dot match that you also need to change.

The following is my version of it that, when compared to the original, runs about twice as fast:

String ID_PATTERN = "p{javaJavaIdentifierStart}p{javaJavaIdentifierPart}*";
Pattern FQCN = Pattern.compile(ID_PATTERN + "(." + ID_PATTERN + ")*");

I cannot write comments, so I decided to write an answer instead.

链接地址: http://www.djcxy.com/p/92744.html

上一篇: 标准化的电子邮件地址

下一篇: 正则表达式匹配完全限定的类名