Java regular expression offers any performance benefit?

In Java, when we try to do pattern matching using a regular expression. eg take a input string and use regular expression to find out if it is numeric. If not, throw an exception. In this case, I understand, using regex makes the code less verbose than if we were to take each character of the string, check if it is a number and if not throw an exception.

But I was under the assumption that regex also makes the process more efficient. IS this true? I cannot find any evidence on this point. How is regex doing the match behind the scenes? IS it not also iterating over the string and checking each character one by one?


Just for fun, I have run this micro benchmark. The results of the last run (ie post JVM warm up / JIT) are below (results are fairly consistent from one run to another anyway):

regex with numbers 123
chars with numbers 33
parseInt with numbers 33
regex with words 123
chars with words 34
parseInt with words 733

In other words, chars is very efficient, Integer.parseInt is as efficient as char IF the string is a number, but awfully slow if the string is not a number. Regex is in between.

Conclusion

If you parse a string into a number and you expect the string to be a number in general, using Integer.parseInt is the best solution (efficient and readable). The penalty you get when the string is not a number should be low if it is not too frequent.

ps: my regex is maybe not optimal, feel free to comment.

public class TestNumber {

    private final static List<String> numbers = new ArrayList<>();
    private final static List<String> words = new ArrayList<>();

    public static void main(String args[]) {
        long start, end;
        Random random = new Random();

        for (int i = 0; i < 1000000; i++) {
            numbers.add(String.valueOf(i));
            words.add(String.valueOf(i) + "x");
        }

        for (int i = 0; i < 5; i++) {
            start = System.nanoTime();
            regex(numbers);
            System.out.println("regex with numbers " + (System.nanoTime() - start) / 1000000);
            start = System.nanoTime();
            chars(numbers);
            System.out.println("chars with numbers " + (System.nanoTime() - start) / 1000000);
            start = System.nanoTime();
            exception(numbers);
            System.out.println("exceptions with numbers " + (System.nanoTime() - start) / 1000000);

            start = System.nanoTime();
            regex(words);
            System.out.println("regex with words " + (System.nanoTime() - start) / 1000000);
            start = System.nanoTime();
            chars(words);
            System.out.println("chars with words " + (System.nanoTime() - start) / 1000000);
            start = System.nanoTime();
            exception(words);
            System.out.println("exceptions with words " + (System.nanoTime() - start) / 1000000);
        }
    }

    private static int regex(List<String> list) {
        int sum = 0;
        Pattern p = Pattern.compile("[0-9]+");
        for (String s : list) {
            sum += (p.matcher(s).matches() ? 1 : 0);
        }
        return sum;
    }

    private static int chars(List<String> list) {
        int sum = 0;

        for (String s : list) {
            boolean isNumber = true;
            for (char c : s.toCharArray()) {
                if (c < '0' || c > '9') {
                    isNumber = false;
                    break;
                }
            }
            if (isNumber) {
                sum++;
            }
        }
        return sum;
    }

    private static int exception(List<String> list) {
        int sum = 0;

        for (String s : list) {
            try {
                Integer.parseInt(s);
                sum++;
            } catch (NumberFormatException e) {
            }
        }
        return sum;
    }
}

I don't have a technical answer yet, but I could write some code and see. I don't think that regular expressions would be the way to go for converting a string to a number. In many instances they can be more efficient, but if its written poorly it'll be slow.

May I ask however, why aren't you just using: Integer.parseInt("124") ? That will throw a NumberFormatException. Should be able to handle it, and it leaves the detection of a number up to core Java.


About regex behind the scenes...

A finite-state machine (FSM) is equivalent to a Regular Expression. FSM is a machine that can recognize a language (in your case numbers). FSM has an alphabet, states, an initial state, N-final states and transition functions from one state to another. The string needs to be contain in the alphabet(ASCII for example). The FSM begins at the initial state. When you input a string it process char by char moving from state to state depending on a function(state, char) => state. When it reaches a final state you know if you string is a numeric or not.

For more, see FSM and see Automata-based_programming

链接地址: http://www.djcxy.com/p/15256.html

上一篇: 使用单个正则表达式从一个字符串中提取少量子字符串

下一篇: Java正则表达式提供任何性能优势?