你如何只用正则表达式匹配有效的罗马数字？

2018-07-01 04:41:28

考虑到我的另一个问题，我决定我甚至不能创建一个匹配罗马数字的正则表达式（更不用说可以生成它们的上下文无关语法）

问题是只匹配有效的罗马数字。例如，990不是“XM”，它是“CMXC”

我为这个正则表达式的问题是，为了允许或不允许某些字符，我需要回头看看。例如，我们需要数千和数百个。

我可以允许M {0,2} C？M（允许900,1000,1900,2000,2900和3000）。但是，如果比赛在CM上，我不能让后面的字符是C或D（因为我已经在900）。

我怎样才能用正则表达式来表达呢？
如果它不能在正则表达式中表达，它是否可以用上下文无关的语法表达？

尝试：

^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$

打破它：

M{0,4}

这指定了千分节，基本上将其限制在0到4000之间。这是一个相对简单的：

   0: <empty>  matched by M{0}
1000: M        matched by M{1}
2000: MM       matched by M{2}
3000: MMM      matched by M{3}
4000: MMMM     matched by M{4}

(CM|CD|D?C{0,3})

稍微复杂一点，这是针对数百个部分，涵盖所有可能性：

  0: <empty>  matched by D?C{0} (with D not there)
100: C        matched by D?C{1} (with D not there)
200: CC       matched by D?C{2} (with D not there)
300: CCC      matched by D?C{3} (with D not there)
400: CD       matched by CD
500: D        matched by D?C{0} (with D there)
600: DC       matched by D?C{1} (with D there)
700: DCC      matched by D?C{2} (with D there)
800: DCCC     matched by D?C{3} (with D there)
900: CM       matched by CM

(XC|XL|L?X{0,3})

与上一部分相同的规则，但对于十个地方：

 0: <empty>  matched by L?X{0} (with L not there)
10: X        matched by L?X{1} (with L not there)
20: XX       matched by L?X{2} (with L not there)
30: XXX      matched by L?X{3} (with L not there)
40: XL       matched by XL
50: L        matched by L?X{0} (with L there)
60: LX       matched by L?X{1} (with L there)
70: LXX      matched by L?X{2} (with L there)
80: LXXX     matched by L?X{3} (with L there)
90: XC       matched by XC

(IX|IV|V?I{0,3})

这是单位部分，处理0到9 ，也类似于前两个部分（罗马数字，尽管他们看起来很古怪，一旦你弄清楚他们是什么，遵循一些逻辑规则）：

0: <empty>  matched by V?I{0} (with V not there)
1: I        matched by V?I{1} (with V not there)
2: II       matched by V?I{2} (with V not there)
3: III      matched by V?I{3} (with V not there)
4: IV       matched by IV
5: V        matched by V?I{0} (with V there)
6: VI       matched by V?I{1} (with V there)
7: VII      matched by V?I{2} (with V there)
8: VIII     matched by V?I{3} (with V there)
9: IX       matched by IX

其实，你的前提是有缺陷的。 990 IS “XM”以及“CMXC”。

罗马人比你的三年级老师更关心“规则”。只要加起来就没问题。因此，“IIII”与4的“IV”一样好。998年的“IIM”完全酷。

（如果你在处理这些问题时遇到了麻烦......请记住，直到1700年代，英语拼写还没有形式化，在此之前，只要读者能够弄明白，这已经足够了）。

为避免匹配空字符串，您需要重复该模式四次，并依次用1替换每个0 ，并计算V ， L和D ：

(M{1,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))

在这种情况下（因为这种模式使用^和$ ），您最好先检查空行，不要打扰匹配它们。如果你使用的是单词界限，那么你就没有问题，因为没有空单词这样的事情。（至少正则表达式并没有定义一个;不开始哲学化，我在这里务实！）

在我个人的（现实世界）案例中，我需要在单词结尾处使用匹配数字，并且我没有发现任何其他方法。我需要清除我的纯文本文件中的脚注编号，其中诸如“红色海豹和大堡礁”等文本已被转换为the Red Seacl and the Great Barrier Reefcli 。但我仍然遇到像Tahiti这样有效的词语的问题，而fantastic被洗进了Tahit和fantasti 。

链接地址: http://www.djcxy.com/p/87003.html

上一篇: How do you match only valid roman numerals with a regular expression?

下一篇: Is there a version of JavaScript's String.indexOf() that allows for regular expressions?