我如何检测希伯来语的回文？

2018-06-06 20:36:16

我正在为回文求解器编写一系列测试。我遇到了希伯来语有趣的回文：

טעםלפתתפלמעט

这是一个回文，但Mem这个字母既有常规形式也有“最终形式”（ם），它看起来像是一个单词中的最后一个字母。但是在我的程序中缺少硬编码，即“0x5de => 0x5dd”，我无法找到一种通过编程方式依靠Unicode，Python或将这两者视为相同的库的方法。我做过的事情：

s = 'טעם לפת תפל מעט'
s.casefold() # Python 3.4
s.lower()
s.upper()
import unicodedata
unicodedata.normalize(...) # In case this functioned like a German Eszett

所有人都得到了相同的字符串。其他希伯来语字母会导致这个问题（如果有人在后面搜索这个）将是Kaf，Nun，Peh和Tsadeh。不，我不是希伯来语的母语。

你可以做一个稍微更“严谨”的答案（一个不太可能给出误报和误报的答案），并做更多的工作。请注意，Patrick Collin的答案可能会因匹配很多不相关的字符而失败，因为他们在unicode数据名称中共享最后一个单词。

你可以做的一件事是在转换最后的字母时采用更严格的方法：

import unicodedata

# Note the added accents
phrase = 'טעם̀ לפת תפל מ̀עט'

def convert_final_characters(phrase):
    for character in phrase:
        try:
            name = unicodedata.name(character)
        except ValueError:
            yield character
            continue

        if "HEBREW" in name and " FINAL" in name:
            try:
                yield unicodedata.lookup(name.replace(" FINAL", ""))
            except KeyError:
                # Fails for HEBREW LETTER WIDE FINAL MEM "ﬦ",
                # which has no non-final counterpart
                #
                # No failure if you first normalize to
                # HEBREW LETTER FINAL MEM "ם"
                yield character
        else:
            yield character

phrase = "".join(convert_final_characters(phrase))

phrase
#>>> 'טעמ̀ לפת תפל מ̀עט'

这只是寻找希伯来字符，其中“FINAL”可以被删除，并做到这一点。

然后您可以使用PyPI上的“new” regex模块将其转换为字形。

import regex

# "X" matches graphemes
graphemes = regex.findall("X", phrase)
graphemes
#>>> ['ט', 'ע', 'מ̀', ' ', 'ל', 'פ', 'ת', ' ', 'ת', 'פ', 'ל', ' ', 'מ̀', 'ע', 'ט']

graphemes == graphemes[::-1]
#>>> True

这涉及口音和其他组合字符。

这是一个很糟糕的解决方案，适用于您当前的问题：

import unicodedata 

def make_map(ss):
    return [unicodedata.name(s).split(' ')[-1] for s in ss]

def is_palindrome(ss):
    return make_map(ss) == make_map(reversed(ss))

尽管如此，这依赖于Python查找表中希伯来字符名称的格式，所以它可能无法完全概括。

具体来说，你有：

In [29]: unicodedata.name(s[2])
Out[29]: 'HEBREW LETTER FINAL MEM'
...
In [31]: unicodedata.name(s[-3])
Out[31]: 'HEBREW LETTER MEM'

所以，除了最后一个字之外的所有东西都会剥离出来：

In [35]: [unicodedata.name(s_).split(" ")[-1] for s_ in s]
Out[35]: ['TET', 'AYIN', 'MEM', 'SPACE', 'LAMED', 'PE', 'TAV', 'SPACE', 'TAV', 'PE', 'LAMED', 'SPACE', 'MEM', 'AYIN', 'TET']

与此相反。不过，Unicode是一个很大的世界，所以我不确定你是否不能构建一个能够胜过这个的例子。

链接地址: http://www.djcxy.com/p/21159.html

上一篇: How can I detect a palindrome in Hebrew?

下一篇: Spark Streaming historical state