Matching Unicode letter characters in PCRE/PHP

I'm trying to write a reasonably permissive validator for names in PHP, and my first attempt consists of the following pattern:

// unicode letters, apostrophe, hyphen, space
$namePattern = "/^([p{L}'- ])+$/";

This is eventually passed to a call to preg_match() . As far as I can tell, this works with your vanilla ASCII alphabet, but seems to trip up on spicier characters like Ă or 张.

Is there something wrong with the pattern itself? Perhaps I'm expecting p{L} to do more work than I think it does?

Or does it have something to do with the way input is being passed in? I'm not sure if it's relevant, but I did make sure to specify a UTF8 encoding on the form page.


I think the problem is much simpler than that: You forgot to specify the u modifier. The Unicode character properties are only available in UTF-8 mode.

Your regex should be:

// unicode letters, apostrophe, hyphen, space
$namePattern = '/^[-' p{L}]+$/u';

First of all, your life would be a lot easier if you'd use single apostrophes instead of double quotes when writing these -- you need only one backslash. Second, combining marks pM should also be included. If you find a character not matched please find out its Unicode code point and then you can use http://www.fileformat.info/info/unicode/ to figure out where it is. I found http://hsivonen.iki.fi/php-utf8/ an invaluable tool when doing debugging with UTF-8 properties (don't forget to convert to hex before trying to look up: array_map('dechex', utf8ToUnicode($text)) ).

For example, Ă turns out to be http://www.fileformat.info/info/unicode/char/0102/index.htm and to be in Lu and so L should match it and it does match for me. The other character is http://www.fileformat.info/info/unicode/char/5f20/index.htm and is also isLetter and indeed matches for me. Do you have the Unicode character tables compiled in?


If you want to replace Unicode old pattern with new pattern you should write:

$text = preg_replace('/bold patternb/u', 'new pattern', $text);

So the key here is u modifier

Note : Your server php version shoud be at least PHP 4.3.5

as mentioned here php.net | Pattern Modifiers

u (PCRE_UTF8) This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.

Thanks AgreeOrNot who give me that key here preg_replace match whole word in arabic

I tried it and it worked in localhost but when I try it in remote server it didn't work, then I found that php.net start use u modifier in PHP 4.3.5. , I upgrade php version and it works

Its important to know that this method is very helpful for Arabic users (عربي) because - as I believe - unicode is the best encode for arabic language, and replacement will not work if you don't use the u modifier, see next example it should work with you

$text = preg_replace('/bمرحبا بكb/u', 'NEW', $text);

链接地址: http://www.djcxy.com/p/59328.html

上一篇: 字符串损坏或preg

下一篇: 在PCRE / PHP中匹配Unicode字母字符