你如何只用正则表达式匹配有效的罗马数字?
考虑到我的另一个问题,我决定我甚至不能创建一个匹配罗马数字的正则表达式(更不用说可以生成它们的上下文无关语法)
问题是只匹配有效的罗马数字。 例如,990不是“XM”,它是“CMXC”
我为这个正则表达式的问题是,为了允许或不允许某些字符,我需要回头看看。 例如,我们需要数千和数百个。
我可以允许M {0,2} C?M(允许900,1000,1900,2000,2900和3000)。 但是,如果比赛在CM上,我不能让后面的字符是C或D(因为我已经在900)。
我怎样才能用正则表达式来表达呢?
如果它不能在正则表达式中表达,它是否可以用上下文无关的语法表达?
尝试:
^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$
打破它:
M{0,4}
这指定了千分节,基本上将其限制在0
到4000
之间。 这是一个相对简单的:
0: <empty> matched by M{0}
1000: M matched by M{1}
2000: MM matched by M{2}
3000: MMM matched by M{3}
4000: MMMM matched by M{4}
(CM|CD|D?C{0,3})
稍微复杂一点,这是针对数百个部分,涵盖所有可能性:
0: <empty> matched by D?C{0} (with D not there)
100: C matched by D?C{1} (with D not there)
200: CC matched by D?C{2} (with D not there)
300: CCC matched by D?C{3} (with D not there)
400: CD matched by CD
500: D matched by D?C{0} (with D there)
600: DC matched by D?C{1} (with D there)
700: DCC matched by D?C{2} (with D there)
800: DCCC matched by D?C{3} (with D there)
900: CM matched by CM
(XC|XL|L?X{0,3})
与上一部分相同的规则,但对于十个地方:
0: <empty> matched by L?X{0} (with L not there)
10: X matched by L?X{1} (with L not there)
20: XX matched by L?X{2} (with L not there)
30: XXX matched by L?X{3} (with L not there)
40: XL matched by XL
50: L matched by L?X{0} (with L there)
60: LX matched by L?X{1} (with L there)
70: LXX matched by L?X{2} (with L there)
80: LXXX matched by L?X{3} (with L there)
90: XC matched by XC
(IX|IV|V?I{0,3})
这是单位部分,处理0
到9
,也类似于前两个部分(罗马数字,尽管他们看起来很古怪,一旦你弄清楚他们是什么,遵循一些逻辑规则):
0: <empty> matched by V?I{0} (with V not there)
1: I matched by V?I{1} (with V not there)
2: II matched by V?I{2} (with V not there)
3: III matched by V?I{3} (with V not there)
4: IV matched by IV
5: V matched by V?I{0} (with V there)
6: VI matched by V?I{1} (with V there)
7: VII matched by V?I{2} (with V there)
8: VIII matched by V?I{3} (with V there)
9: IX matched by IX
其实,你的前提是有缺陷的。 990 IS “XM”以及“CMXC”。
罗马人比你的三年级老师更关心“规则”。 只要加起来就没问题。 因此,“IIII”与4的“IV”一样好。998年的“IIM”完全酷。
(如果你在处理这些问题时遇到了麻烦......请记住,直到1700年代,英语拼写还没有形式化,在此之前,只要读者能够弄明白,这已经足够了)。
为避免匹配空字符串,您需要重复该模式四次,并依次用1
替换每个0
,并计算V
, L
和D
:
(M{1,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))
在这种情况下(因为这种模式使用^
和$
),您最好先检查空行,不要打扰匹配它们。 如果你使用的是单词界限,那么你就没有问题,因为没有空单词这样的事情。 (至少正则表达式并没有定义一个;不开始哲学化,我在这里务实!)
在我个人的(现实世界)案例中,我需要在单词结尾处使用匹配数字,并且我没有发现任何其他方法。 我需要清除我的纯文本文件中的脚注编号,其中诸如“红色海豹和大堡礁”等文本已被转换为the Red Seacl and the Great Barrier Reefcli
。 但我仍然遇到像Tahiti
这样有效的词语的问题,而fantastic
被洗进了Tahit
和fantasti
。
上一篇: How do you match only valid roman numerals with a regular expression?
下一篇: Is there a version of JavaScript's String.indexOf() that allows for regular expressions?