insensitive string comparison in C++

2018-06-03 22:22:03

What is the best way of doing case-insensitive string comparison in C++ without transforming a string to all uppercase or all lowercase?

Please indicate whether the methods are Unicode-friendly and how portable they are.

Boost包含一个方便的算法：

#include <boost/algorithm/string.hpp>
// Or, for fewer header dependencies:
//#include <boost/algorithm/string/predicate.hpp>

std::string str1 = "hello, world!";
std::string str2 = "HELLO, WORLD!";

if (boost::iequals(str1, str2))
{
    // Strings are identical
}

Take advantage of the standard char_traits . Recall that a std::string is in fact a typedef for std::basic_string<char> , or more explicitly, std::basic_string<char, std::char_traits<char> > . The char_traits type describes how characters compare, how they copy, how they cast etc. All you need to do is typedef a new string over basic_string , and provide it with your own custom char_traits that compare case insensitively.

struct ci_char_traits : public char_traits<char> {
    static bool eq(char c1, char c2) { return toupper(c1) == toupper(c2); }
    static bool ne(char c1, char c2) { return toupper(c1) != toupper(c2); }
    static bool lt(char c1, char c2) { return toupper(c1) <  toupper(c2); }
    static int compare(const char* s1, const char* s2, size_t n) {
        while( n-- != 0 ) {
            if( toupper(*s1) < toupper(*s2) ) return -1;
            if( toupper(*s1) > toupper(*s2) ) return 1;
            ++s1; ++s2;
        }
        return 0;
    }
    static const char* find(const char* s, int n, char a) {
        while( n-- > 0 && toupper(*s) != toupper(a) ) {
            ++s;
        }
        return s;
    }
};

typedef std::basic_string<char, ci_char_traits> ci_string;

The details are on Guru of The Week number 29.

Are you talking about a dumb case insensitive compare or a full normalized Unicode compare?

A dumb compare will not find strings that might be the same but are not binary equal.

Example:

U212B (ANGSTROM SIGN)
U0041 (LATIN CAPITAL LETTER A) + U030A (COMBINING RING ABOVE)
U00C5 (LATIN CAPITAL LETTER A WITH RING ABOVE).

Are all equivalent but they also have different binary representations.

That said, Unicode Normalization should be a mandatory read especially if you plan on supporting Hangul, Thaï and other asian languages.

Also, IBM pretty much patented most optimized Unicode algorithms and made them publicly available. They also maintain an implementation : IBM ICU

链接地址: http://www.djcxy.com/p/13072.html

上一篇: Python连接：为什么它是string.join（list）而不是list.join（string）？

下一篇: 在C ++中不敏感的字符串比较