\ d效率低于[0
我昨天对某个人在正则表达式中使用[0123456789]
而不是[0-9]
或d
的回答发表了评论。 我说使用范围或数字说明符可能比字符集更有效。
我决定今天测试一下,发现令我惊讶的是(至少在C#正则表达式引擎中) d
看起来效率不如其他两种似乎差别不大的其他两种效率更高。 这是我的测试输出超过1000个随机字符的10000个随机字符串,其中5077实际上包含一个数字:
Regular expression d took 00:00:00.2141226 result: 5077/10000
Regular expression [0-9] took 00:00:00.1357972 result: 5077/10000 63.42 % of first
Regular expression [0123456789] took 00:00:00.1388997 result: 5077/10000 64.87 % of first
出于两个原因,我感到很惊讶:
d
比[0-9]
更糟糕。 是否有更多的d
不是简单的简写[0-9]
这里是测试代码:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;
using System.Text.RegularExpressions;
namespace SO_RegexPerformance
{
class Program
{
static void Main(string[] args)
{
var rand = new Random(1234);
var strings = new List<string>();
//10K random strings
for (var i = 0; i < 10000; i++)
{
//Generate random string
var sb = new StringBuilder();
for (var c = 0; c < 1000; c++)
{
//Add a-z randomly
sb.Append((char)('a' + rand.Next(26)));
}
//In roughly 50% of them, put a digit
if (rand.Next(2) == 0)
{
//Replace one character with a digit, 0-9
sb[rand.Next(sb.Length)] = (char)('0' + rand.Next(10));
}
strings.Add(sb.ToString());
}
var baseTime = testPerfomance(strings, @"d");
Console.WriteLine();
var testTime = testPerfomance(strings, "[0-9]");
Console.WriteLine(" {0:P2} of first", testTime.TotalMilliseconds / baseTime.TotalMilliseconds);
testTime = testPerfomance(strings, "[0123456789]");
Console.WriteLine(" {0:P2} of first", testTime.TotalMilliseconds / baseTime.TotalMilliseconds);
}
private static TimeSpan testPerfomance(List<string> strings, string regex)
{
var sw = new Stopwatch();
int successes = 0;
var rex = new Regex(regex);
sw.Start();
foreach (var str in strings)
{
if (rex.Match(str).Success)
{
successes++;
}
}
sw.Stop();
Console.Write("Regex {0,-12} took {1} result: {2}/{3}", regex, sw.Elapsed, successes, strings.Count);
return sw.Elapsed;
}
}
}
d
检查所有的Unicode数字,而[0-9]
仅限于这10个字符。 例如,波斯数字۱۲۳۴۵۶۷۸۹
是与d
匹配的Unicode数字的一个例子,但不是[0-9]
。
您可以使用以下代码生成所有这些字符的列表:
var sb = new StringBuilder();
for(UInt16 i = 0; i < UInt16.MaxValue; i++)
{
string str = Convert.ToChar(i).ToString();
if (Regex.IsMatch(str, @"d"))
sb.Append(str);
}
Console.WriteLine(sb.ToString());
其中产生:
012345678901234567890123456789߀߁߂߃߄߅߆߇߈߉012345678901২345678901234567890123456789୦୧୨୩୪୫୬୭୮୯0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789᠐᠑᠒᠓᠔᠕᠖᠗᠘᠙᥆᥇᥈᥉᥊᥋᥌᥍᥎᥏᧐᧑᧒᧓᧔᧕᧖᧗᧘᧙᭐᭑᭒᭓᭔᭕᭖᭗᭘᭙᮰᮱᮲᮳᮴᮵᮶᮷᮸᮹᱀᱁᱂᱃᱄᱅᱆᱇᱈᱉᱐᱑᱒᱓᱔᱕᱖᱗᱘᱙꘠꘡꘢꘣꘤꘥꘦꘧꘨꘩꣐꣑꣒꣓꣔꣕꣖꣗꣘꣙꤀꤁꤂꤃꤄꤅꤆꤇꤈꤉꩐꩑꩒꩓꩔꩕꩖꩗꩘꩙0123456789
感谢ByteBlast在文档中注意到这一点。 只是改变正则表达式的构造函数:
var rex = new Regex(regex, RegexOptions.ECMAScript);
提供新的时机:
Regex d took 00:00:00.1355787 result: 5077/10000
Regex [0-9] took 00:00:00.1360403 result: 5077/10000 100.34 % of first
Regex [0123456789] took 00:00:00.1362112 result: 5077/10000 100.47 % of first
从正则表达式中的“ d”是什么意思?
[0-9]
不等同于d
。 [0-9]
只匹配0123456789
字符,而d
匹配[0-9]
和其他数字字符,例如东部阿拉伯数字٠١٢٣٤٥٦٧٨٩
上一篇: \d is less efficient than [0
下一篇: Compiling an application for use in highly radioactive environments