How can I detect a palindrome in Hebrew?

2018-06-06 20:36:16

I am writing a series of tests for a palindrome solver. I came across the interesting palindrome in Hebrew:

טעם לפת תפל מעט

Which is a palindrome, but the letter Mem has both a regular form (מ) and a "final form" (ם), how it appears as the last letter in a word. But short of hardcoding that "0x5de => 0x5dd" in my program, I was not able to figure out a way to programmatically rely on Unicode, Python, or a library that would treat the two as the same. Things I did try:

s = 'טעם לפת תפל מעט'
s.casefold() # Python 3.4
s.lower()
s.upper()
import unicodedata
unicodedata.normalize(...) # In case this functioned like a German Eszett

All yielded the same string. Other Hebrew letters that would cause this problem (in case someone searches for this later) would be Kaf, Nun, Peh, and Tsadeh. No, I am not a native speaker of Hebrew.

You can make a slightly more "rigorous" answer (one that's less likely to give false positives and false negatives), with a little more work. Note that Patrick Collin's answer could fail by matching lots of unrelated characters because they share the last word in their unicode data name.

One thing you can do is a stricter approach at converting final letters:

import unicodedata

# Note the added accents
phrase = 'טעם̀ לפת תפל מ̀עט'

def convert_final_characters(phrase):
    for character in phrase:
        try:
            name = unicodedata.name(character)
        except ValueError:
            yield character
            continue

        if "HEBREW" in name and " FINAL" in name:
            try:
                yield unicodedata.lookup(name.replace(" FINAL", ""))
            except KeyError:
                # Fails for HEBREW LETTER WIDE FINAL MEM "ﬦ",
                # which has no non-final counterpart
                #
                # No failure if you first normalize to
                # HEBREW LETTER FINAL MEM "ם"
                yield character
        else:
            yield character

phrase = "".join(convert_final_characters(phrase))

phrase
#>>> 'טעמ̀ לפת תפל מ̀עט'

This just looks for Hebrew characters where "FINAL" can be removed, and does that.

You can then also convert to graphemes using the "new" regex module on PyPI.

import regex

# "X" matches graphemes
graphemes = regex.findall("X", phrase)
graphemes
#>>> ['ט', 'ע', 'מ̀', ' ', 'ל', 'פ', 'ת', ' ', 'ת', 'פ', 'ל', ' ', 'מ̀', 'ע', 'ט']

graphemes == graphemes[::-1]
#>>> True

This deals with accents and other combining characters.

Here's an ugly solution that works for your current issue:

import unicodedata 

def make_map(ss):
    return [unicodedata.name(s).split(' ')[-1] for s in ss]

def is_palindrome(ss):
    return make_map(ss) == make_map(reversed(ss))

This relies on the formatting of Hebrew character names in Python's lookup table, though, so it might not generalize perfectly.

Specifically, you have:

In [29]: unicodedata.name(s[2])
Out[29]: 'HEBREW LETTER FINAL MEM'
...
In [31]: unicodedata.name(s[-3])
Out[31]: 'HEBREW LETTER MEM'

So stripping out all but the last word gives you:

In [35]: [unicodedata.name(s_).split(" ")[-1] for s_ in s]
Out[35]: ['TET', 'AYIN', 'MEM', 'SPACE', 'LAMED', 'PE', 'TAV', 'SPACE', 'TAV', 'PE', 'LAMED', 'SPACE', 'MEM', 'AYIN', 'TET']

with the same in reverse. Unicode is a big world, though, so I'm not sure if you can't construct an example that beats this.

链接地址: http://www.djcxy.com/p/21160.html

上一篇: 将参考存储在字典中

下一篇: 我如何检测希伯来语的回文？