\ d效率低于[0

我昨天对某个人在正则表​​达式中使用[0123456789]而不是[0-9]d的回答发表了评论。 我说使用范围或数字说明符可能比字符集更有效。

我决定今天测试一下,发现令我惊讶的是(至少在C#正则表达式引擎中) d看起来效率不如其他两种似乎差别不大的其他两种效率更高。 这是我的测试输出超过1000个随机字符的10000个随机字符串,其中5077实际上包含一个数字:

Regular expression d           took 00:00:00.2141226 result: 5077/10000
Regular expression [0-9]        took 00:00:00.1357972 result: 5077/10000  63.42 % of first
Regular expression [0123456789] took 00:00:00.1388997 result: 5077/10000  64.87 % of first

出于两个原因,我感到很惊讶:

  • 我会认为这个系列比这个系列更有效率。
  • 我不明白为什么d[0-9]更糟糕。 是否有更多的d不是简单的简写[0-9]
  • 这里是测试代码:

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Diagnostics;
    using System.Text.RegularExpressions;
    
    namespace SO_RegexPerformance
    {
        class Program
        {
            static void Main(string[] args)
            {
                var rand = new Random(1234);
                var strings = new List<string>();
                //10K random strings
                for (var i = 0; i < 10000; i++)
                {
                    //Generate random string
                    var sb = new StringBuilder();
                    for (var c = 0; c < 1000; c++)
                    {
                        //Add a-z randomly
                        sb.Append((char)('a' + rand.Next(26)));
                    }
                    //In roughly 50% of them, put a digit
                    if (rand.Next(2) == 0)
                    {
                        //Replace one character with a digit, 0-9
                        sb[rand.Next(sb.Length)] = (char)('0' + rand.Next(10));
                    }
                    strings.Add(sb.ToString());
                }
    
                var baseTime = testPerfomance(strings, @"d");
                Console.WriteLine();
                var testTime = testPerfomance(strings, "[0-9]");
                Console.WriteLine("  {0:P2} of first", testTime.TotalMilliseconds / baseTime.TotalMilliseconds);
                testTime = testPerfomance(strings, "[0123456789]");
                Console.WriteLine("  {0:P2} of first", testTime.TotalMilliseconds / baseTime.TotalMilliseconds);
            }
    
            private static TimeSpan testPerfomance(List<string> strings, string regex)
            {
                var sw = new Stopwatch();
    
                int successes = 0;
    
                var rex = new Regex(regex);
    
                sw.Start();
                foreach (var str in strings)
                {
                    if (rex.Match(str).Success)
                    {
                        successes++;
                    }
                }
                sw.Stop();
    
                Console.Write("Regex {0,-12} took {1} result: {2}/{3}", regex, sw.Elapsed, successes, strings.Count);
    
                return sw.Elapsed;
            }
        }
    }
    

    d检查所有的Unicode数字,而[0-9]仅限于这10个字符。 例如,波斯数字۱۲۳۴۵۶۷۸۹是与d匹配的Unicode数字的一个例子,但不是[0-9]

    您可以使用以下代码生成所有这些字符的列表:

    var sb = new StringBuilder();
    for(UInt16 i = 0; i < UInt16.MaxValue; i++)
    {
        string str = Convert.ToChar(i).ToString();
        if (Regex.IsMatch(str, @"d"))
            sb.Append(str);
    }
    Console.WriteLine(sb.ToString());
    

    其中产生:

    012345678901234567890123456789߀߁߂߃߄߅߆߇߈߉012345678901২345678901234567890123456789୦୧୨୩୪୫୬୭୮୯0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789᠐᠑᠒᠓᠔᠕᠖᠗᠘᠙᥆᥇᥈᥉᥊᥋᥌᥍᥎᥏᧐᧑᧒᧓᧔᧕᧖᧗᧘᧙᭐᭑᭒᭓᭔᭕᭖᭗᭘᭙᮰᮱᮲᮳᮴᮵᮶᮷᮸᮹᱀᱁᱂᱃᱄᱅᱆᱇᱈᱉᱐᱑᱒᱓᱔᱕᱖᱗᱘᱙꘠꘡꘢꘣꘤꘥꘦꘧꘨꘩꣐꣑꣒꣓꣔꣕꣖꣗꣘꣙꤀꤁꤂꤃꤄꤅꤆꤇꤈꤉꩐꩑꩒꩓꩔꩕꩖꩗꩘꩙0123456789


    感谢ByteBlast在文档中注意到这一点。 只是改变正则表达式的构造函数:

    var rex = new Regex(regex, RegexOptions.ECMAScript);
    

    提供新的时机:

    Regex d           took 00:00:00.1355787 result: 5077/10000
    Regex [0-9]        took 00:00:00.1360403 result: 5077/10000  100.34 % of first
    Regex [0123456789] took 00:00:00.1362112 result: 5077/10000  100.47 % of first
    

    从正则表达式中的“ d”是什么意思?

    [0-9]不等同于d[0-9]只匹配0123456789字符,而d匹配[0-9]和其他数字字符,例如东部阿拉伯数字٠١٢٣٤٥٦٧٨٩

    链接地址: http://www.djcxy.com/p/2495.html

    上一篇: \d is less efficient than [0

    下一篇: Compiling an application for use in highly radioactive environments