如何从字符串数组中按任何顺序匹配和突出显示所有术语？

2018-06-14 01:53:12

要求是：

从数组中查找字符串（从这里称为选项），其中包含任何顺序的所有术语

正确突出显示匹配的术语 - 即。在每个匹配词语前后插入一个字符串 - 我在这里使用<u>和</u>

搜索查询和选项都可以是“任何”

为了简单起见，答案可以专注于通过仅包括ASCII字符的列表进行不区分大小写的搜索，并假定术语分隔符是简单的空格 - 即。作为“Foo bar baz”输入的查询意味着搜索条件是['foo', 'bar', 'baz'] 。

澄清：

所有条款必须在匹配选项中彼此分开存在 - 即。较短的术语不应只作为较长期术语的子字符串存在，也不应该有两个术语应该重叠

重复搜索条件必须至少与查询中一样多

最终的应用程序（可能并不奇怪）是一种自动完成的种类。

TL; DR

最近的小提琴并排比较所提出的算法：
https://jsfiddle.net/Mikk3lRo/ndeuqn02/7/
（如果您添加新算法，请随时更新此链接）

jsPerf以比较现实/具有代表性的方式比较算法 - 几个字符串基本上每次在每个代表中“输入”一个字符：
https://jsperf.com/comparison-of-algorithms-to-search-and-highlight

在这一点上，很明显（感谢trincot卓越的基础比较），原始实现使用的大部分时间都花在了DOM输出上。它的重要性已经尽可能地在小提琴中最小化。

在每次搜索中，算法之间的性能仍然存在明显差异，但是在每次击键时，其中一个算法的算法一直都不是最快的。在重新审视和清理我自己的“分而治之”之后，它在我尝试的任何现实场景中都会一直超越其他人。

Tigregalis介绍了预运行优化的想法，这似乎是合理的，因为选项不可能在击键之间改变。我在这里为所有方法添加了（一个函数）。我从中看到明显的好处的唯一算法是在Skirtle's Permutations中，但我会鼓励每个回答者考虑它是否对他们自己的算法有用。

一些算法比其他算法更容易适应。我仍然认为，这比实际执行中的次要性能差异更重要。

请注意，当前版本的Tigregalis'收缩集有一个错误 - 我已经将它从小提琴和jsperf中排除，直到它被修复。

病毒排列

理论上这可以通过“手动”构建RegExp来解决，该RegExp包含捕获组分隔的搜索项的每个可能的排列，以捕获术语之间的任何内容 - 在(foo)(.*?)(bar)|(bar)(.*?)(foo)搜索“foo bar” (foo)(.*?)(bar)|(bar)(.*?)(foo) 。

然后使用string.replace()一次完成突出显示。如果字符串中有任何更改，我们会匹配。

演示：

var options = ['United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People's Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People's Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];
var terms, terms_esc;
function viral_permutations() {
    var t0, t1, i, permuted, res_elm, meta_elm, regex_string, regex, li, option, match_groups, highlighted;
    meta_elm = document.getElementById('viral_permutations_meta');
    res_elm = document.getElementById('viral_permutations_result');
    res_elm.innerHTML = meta_elm.innerHTML = '';
    
    t0 = performance.now();
    
    //Split query in terms at delimiter and lowercase them
    terms = document.getElementById('viral_permutations').value.split(/s/).filter(function(n) {
        return n;
    }).map(function(term){
        return term.toLowerCase();
    });
    meta_elm.innerHTML += 'Terms: ' + JSON.stringify(terms) + '<br>';
        
    //Escape terms
    terms_esc = terms.map(function(term) {
        return term.replace(/[-[]{}()*+?.,^$|#s]/g, "$&");
    });
    
    //Wrap terms in in individual capturing groups
    terms_esc = terms.map(function(term) {
        return '(' + term + ')';
    });
    
    //Find all permutations
    permuted = permutate_array(terms_esc);
    
    //Construct a group for each permutation
    match_groups = [];
    for (var i in permuted) {
        match_groups.push(permuted[i].join('(.*?)'));
    }
    
    try {
        //Construct the final regex
        regex_string = match_groups.join('|');
        //Display it
        document.getElementById('viral_permutations_regex').innerHTML = regex_string;
        meta_elm.innerHTML += 'RegExp length: ' + regex_string.length + '<br>';
        
        regex = new RegExp(regex_string, 'i');
    

        //Loop through each option
        for (i = 0; i < options.length; i++) {
            option = options[i];

            //Replace the terms with highlighted terms
            highlighted = option.replace(regex, viral_permutations_replace);

            //If anything was changed (or the query is empty) we have a match
            if (highlighted !== option || terms.length === 0) {
                //Append it to the result list
                li = document.createElement('li');
                li.innerHTML = highlighted;
                res_elm.appendChild(li);
            }
        }

        //Print some meta
        t1 = performance.now();
        meta_elm.innerHTML += 'Time: ' + (Math.round((t1 - t0) * 100) / 100) + 'ms';
    } catch(e) {
        meta_elm.innerHTML += '<span style="color:red">' + e.message + '</span>';
    }
}

//The replacement function
function viral_permutations_replace() {
    var i, m, j, retval, m_cased, unmatched;
    retval = '';
    
    //Make a clone of the terms array (that we can modify without destroying the original)
    unmatched = terms.slice(0);
    
    //Loop arguments between the first (which is the full match) and
    //the last 2 (which are the offset and the full option)
    for (i = 1; i < arguments.length - 1; i++) {
        m = arguments[i];
        
        //Check that we have a string - most of the arguments will be undefined
        if (typeof m !== 'string') continue;
        
        //Lowercase the match
        m_cased = m.toLowerCase();
        
        //Append it to the return value - highlighted if it is among our terms
        j = unmatched.indexOf(m_cased);
        if (j >= 0) {
            //Remove it from the unmatched terms array
            unmatched.splice(j, 1);
            retval += '<u>' + m + '</u>';
        } else {
            retval += m;
        }
    }
    return retval;
}

//Helper function to return all possible permutations of an array
function permutate_array(arr) {
    var perm, perm_intern;
    perm_intern = function(perm, pre, post, n) {
        var elem, i, j, ref, rest;
        if (n > 0) {
            for (i = j = 0, ref = post.length; 0 <= ref ? j < ref : j > ref; i = 0 <= ref ? ++j : --j) {
                rest = post.slice(0);
                elem = rest.splice(i, 1);
                perm_intern(perm, pre.concat(elem), rest, n - 1);
            }
        } else {
            perm.push(pre);
        }
    };
    perm = [];
    perm_intern(perm, [], arr, arr.length);
    return perm;
}
viral_permutations();

<input type="text" id="viral_permutations" onkeyup="viral_permutations()">
<p id="viral_permutations_meta"></p>
<pre style="width:100%;overflow:auto;background:#eee" id="viral_permutations_regex"></pre>
<ul style="height:300px;overflow:auto" id="viral_permutations_result"></ul>

我给了它一个去，但我不知道这会有多大的帮助。我的方法类似于你的分而治之技术。

我没有咬掉字符串的某些部分，而是提前搜索每个术语，并存储所有匹配的集合，记录开始和结束位置。如果没有足够的特定搜索词匹配，算法会立即为该选项保留。

一旦将所有可能的匹配聚集在一起，就会递归地尝试找到不重叠的组合。在这种递归中有很多数据结构的复制，我怀疑它可以比我在这里优化得更好。我也只能为一些变量名称道歉，我一直在努力想出一些有意义的名字。

对于某些测试搜索，如anananan ...似乎应付得比原来的分而治之技术要好，但我怀疑这可能只是因为在特定搜索词找到不足匹配时执行的早期纾困。如果没有大量的实际数据，很难知道真正有价值的优化在哪里。

function search() {
    var options = [
        'ababababa',
        'United States',
        'United Kingdom',
        'Afghanistan',
        'Aland Islands',
        'Albania',
        'Algeria',
        'American Samoa',
        'Andorra',
        'Angola',
        'Anguilla',
        'Antarctica',
        'Antigua and Barbuda',
        'Argentina',
        'Armenia',
        'Aruba',
        'Australia',
        'Austria',
        'Azerbaijan',
        'Bahamas',
        'Bahrain',
        'Bangladesh',
        'Barbados',
        'Belarus',
        'Belgium',
        'Belize',
        'Benin',
        'Bermuda',
        'Bhutan',
        'Bolivia, Plurinational State of',
        'Bonaire, Sint Eustatius and Saba',
        'Bosnia and Herzegovina',
        'Botswana',
        'Bouvet Island',
        'Brazil',
        'British Indian Ocean Territory',
        'Brunei Darussalam',
        'Bulgaria',
        'Burkina Faso',
        'Burundi',
        'Cambodia',
        'Cameroon',
        'Canada',
        'Cape Verde',
        'Cayman Islands',
        'Central African Republic',
        'Chad',
        'Chile',
        'China',
        'Christmas Island',
        'Cocos (Keeling) Islands',
        'Colombia',
        'Comoros',
        'Congo',
        'Congo, The Democratic Republic of The',
        'Cook Islands',
        'Costa Rica',
        'Cote D'ivoire',
        'Croatia',
        'Cuba',
        'Curacao',
        'Cyprus',
        'Czech Republic',
        'Denmark',
        'Djibouti',
        'Dominica',
        'Dominican Republic',
        'Ecuador',
        'Egypt',
        'El Salvador',
        'Equatorial Guinea',
        'Eritrea',
        'Estonia',
        'Ethiopia',
        'Falkland Islands (Malvinas)',
        'Faroe Islands',
        'Fiji',
        'Finland',
        'France',
        'French Guiana',
        'French Polynesia',
        'French Southern Territories',
        'Gabon',
        'Gambia',
        'Georgia',
        'Germany',
        'Ghana',
        'Gibraltar',
        'Greece',
        'Greenland',
        'Grenada',
        'Guadeloupe',
        'Guam',
        'Guatemala',
        'Guernsey',
        'Guinea',
        'Guinea-bissau',
        'Guyana',
        'Haiti',
        'Heard Island and Mcdonald Islands',
        'Holy See (Vatican City State)',
        'Honduras',
        'Hong Kong',
        'Hungary',
        'Iceland',
        'India',
        'Indonesia',
        'Iran, Islamic Republic of',
        'Iraq',
        'Ireland',
        'Isle of Man',
        'Israel',
        'Italy',
        'Jamaica',
        'Japan',
        'Jersey',
        'Jordan',
        'Kazakhstan',
        'Kenya',
        'Kiribati',
        'Korea, Democratic People's Republic of',
        'Korea, Republic of',
        'Kuwait',
        'Kyrgyzstan',
        'Lao People's Democratic Republic',
        'Latvia',
        'Lebanon',
        'Lesotho',
        'Liberia',
        'Libya',
        'Liechtenstein',
        'Lithuania',
        'Luxembourg',
        'Macao',
        'Macedonia, The Former Yugoslav Republic of',
        'Madagascar',
        'Malawi',
        'Malaysia',
        'Maldives',
        'Mali',
        'Malta',
        'Marshall Islands',
        'Martinique',
        'Mauritania',
        'Mauritius',
        'Mayotte',
        'Mexico',
        'Micronesia, Federated States of',
        'Moldova, Republic of',
        'Monaco',
        'Mongolia',
        'Montenegro',
        'Montserrat',
        'Morocco',
        'Mozambique',
        'Myanmar',
        'Namibia',
        'Nauru',
        'Nepal',
        'Netherlands',
        'New Caledonia',
        'New Zealand',
        'Nicaragua',
        'Niger',
        'Nigeria',
        'Niue',
        'Norfolk Island',
        'Northern Mariana Islands',
        'Norway',
        'Oman',
        'Pakistan',
        'Palau',
        'Palestinian Territory, Occupied',
        'Panama',
        'Papua New Guinea',
        'Paraguay',
        'Peru',
        'Philippines',
        'Pitcairn',
        'Poland',
        'Portugal',
        'Puerto Rico',
        'Qatar',
        'Reunion',
        'Romania',
        'Russian Federation',
        'Rwanda',
        'Saint Barthelemy',
        'Saint Helena, Ascension and Tristan da Cunha',
        'Saint Kitts and Nevis',
        'Saint Lucia',
        'Saint Martin (French part)',
        'Saint Pierre and Miquelon',
        'Saint Vincent and The Grenadines',
        'Samoa',
        'San Marino',
        'Sao Tome and Principe',
        'Saudi Arabia',
        'Senegal',
        'Serbia',
        'Seychelles',
        'Sierra Leone',
        'Singapore',
        'Sint Maarten (Dutch part)',
        'Slovakia',
        'Slovenia',
        'Solomon Islands',
        'Somalia',
        'South Africa',
        'South Georgia and The South Sandwich Islands',
        'South Sudan',
        'Spain',
        'Sri Lanka',
        'Sudan',
        'Suriname',
        'Svalbard and Jan Mayen',
        'Swaziland',
        'Sweden',
        'Switzerland',
        'Syrian Arab Republic',
        'Taiwan, Province of China',
        'Tajikistan',
        'Tanzania, United Republic of',
        'Thailand',
        'Timor-leste',
        'Togo',
        'Tokelau',
        'Tonga',
        'Trinidad and Tobago',
        'Tunisia',
        'Turkey',
        'Turkmenistan',
        'Turks and Caicos Islands',
        'Tuvalu',
        'Uganda',
        'Ukraine',
        'United Arab Emirates',
        'United Kingdom',
        'United States',
        'United States Minor Outlying Islands',
        'Uruguay',
        'Uzbekistan',
        'Vanuatu',
        'Venezuela, Bolivarian Republic of',
        'Viet Nam',
        'Virgin Islands, British',
        'Virgin Islands, U.S.',
        'Wallis and Futuna',
        'Western Sahara',
        'Yemen',
        'Zambia',
        'Zimbabwe'
    ];

    var terms = document.getElementById('search').value.trim().toLowerCase().split(/s+/);

    if (!terms[0]) {
        terms = [];
    }

    document.getElementById('terms').innerText = 'Terms: ' + JSON.stringify(terms);

    var startTime = performance.now();

    // Term counts is a map storing how many times each search term appears in the query
    var termCounts = {};

    terms.forEach(function(term) {
        termCounts[term] = (termCounts[term] || 0) + 1;
    });

    // An array of search terms with the duplicates removed
    var uniqueTerms = Object.keys(termCounts);

    // Loop through each option and map to either a highlight version or null
    options = options.map(function(optionText) {
        var matches = {},
            lastMatchIndex = {},
            option = optionText.toLowerCase();

        uniqueTerms.forEach(function(term) {
            // This array will be populated with start/end position of each match for this term
            matches[term] = [];

            // The index of the last match... which could be deduced from the matches but this is slightly easier
            lastMatchIndex[term] = -1;
        });

        var incompleteMatchTerms = uniqueTerms.slice(),
            nextMatchTerm;

        // This is probably a premature optimization but doing it this
        // way ensures we check that each search term occurs at least
        // once as quickly as possible.
        while (nextMatchTerm = incompleteMatchTerms.shift()) {
            var nextMatchIndex = option.indexOf(nextMatchTerm, lastMatchIndex[nextMatchTerm] + 1);

            if (nextMatchIndex === -1) {
                // Haven't found enough matches for this term, so the option doesn't match
                if (termCounts[nextMatchTerm] > matches[nextMatchTerm].length) {
                    return null;
                }
            }
            else {
                // Found another match, put the term back on the queue
                // for another round
                incompleteMatchTerms.push(nextMatchTerm);
                lastMatchIndex[nextMatchTerm] = nextMatchIndex;

                matches[nextMatchTerm].push({
                    start: nextMatchIndex,
                    end: nextMatchIndex + nextMatchTerm.length
                });
            }
        }

        // Pass in the original array of terms... we attempt to highlight in the order of the original query
        var highlights = performHighlight(terms, matches);

        if (!highlights) {
            return null;
        }

        // We need the highlights sorted so that we do the replacing from the end of the string
        highlights.sort(function(h1, h2) {
            return h2.start - h1.start;
        });

        highlights.forEach(function(highlight) {
            optionText = optionText.slice(0, highlight.start)
                    + '<u>' + optionText.slice(highlight.start, highlight.end) + '</u>'
                    + optionText.slice(highlight.end);
        });

        return optionText;

        function performHighlight(terms, allMatches) {
            // If there are no terms left to match we've got a hit
            if (terms.length === 0) {
                return [];
            }

            var nextTerms = terms.slice(),
                term = nextTerms.shift(),
                matches = allMatches[term].slice(),
                match;

            while (match = matches.shift()) {
                var nextMatches = {};

                // We need to purge any entries from nextMatches that overlap the current match
                uniqueTerms.forEach(function(nextTerm) {
                    var nextMatch = term === nextTerm ? matches : allMatches[nextTerm];

                    nextMatches[nextTerm] = nextMatch.filter(function(match2) {
                        return match.start >= match2.end || match.end <= match2.start;
                    });
                });

                var highlights = performHighlight(nextTerms, nextMatches);

                if (highlights) {
                    highlights.push(match);

                    return highlights;
                }
            }

            return null;
        }
    });

    document.getElementById('results').innerHTML = options.map(function(option) {
        if (option) {
            return '<li>' + option + '</li>';
        }

        return '';
    }).join('');

    document.getElementById('time').innerText = Math.round((performance.now() - startTime) * 100) / 100 + 'ms';
}

<h1>Permutations</h1>
<input type="text" id="search" onkeyup="search()" autocomplete="off">
<p id="terms"></p>
<p id="time"></p>
<ul id="results"></ul>

我建议在分而治之思想上略有变化：不要将字符串拆分成块（块），而是可以“擦除”已匹配的字符，并对该字符串进行进一步搜索。要消灭的角色将是分隔符，因为它保证不会在任何条款中出现。

这里是：

function trincotWipeSearch(query, options, separator) {
    // Split query in terms at delimiter
    const terms = query.split(separator).filter(Boolean);
    if (!terms.length) return options;
    // Sort terms by descending size
    terms.sort( (a,b) => b.length - a.length );

    // Escape terms, and enrich with size of original term
    // and a string of the same length filled with the separator char
    const items = terms.map(term => ({
        size: term.length,
        wipe: separator.repeat(term.length), 
        regex: new RegExp(term.replace(/[-[]{}()*+?.,^$|#s]/g, "$&"), 'gi')
    }));

    function getOffsets(termIndex, text) {
        // All terms found?
        if (termIndex >= terms.length) return [];
        let match;
        const { regex, size, wipe } = items[termIndex];
        regex.lastIndex = 0;
        while (match = regex.exec(text)) {
            let index = match.index;
            // Wipe characters and recurse to find other terms
            let offsets = getOffsets(termIndex+1, 
                text.substr(0, index) + wipe + text.substr(index + size));
            if (offsets !== undefined) { 
                // Solution found, backtrack all the way
                return offsets.concat([index, index + size]);
            }
            regex.lastIndex = match.index + 1;
        }
    }

    // Loop through each option
    return options.map( option => {
        // Get the offsets of the matches
        let offsets = getOffsets(0, option);
        if (offsets) {
            // Apply the offsets to add the markup
            offsets
                .sort( (a,b) => b - a )
                .map((index, i) => {
                    option = option.substr(0, index) 
                        + (i%2 ? "<u>" : "</u>")
                        + option.substr(index);
                });
            return option;
        }
    }).filter(Boolean); // get only the non-empty results
}

var options = ['United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People's Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People's Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];

/*
 * I/O and performance measurements 
 */

function processInput() {
    var query = this.value.toLowerCase();
    const t0 = performance.now();
    const matches = trincotWipeSearch(query, options, ' ');
    const spentTime = performance.now() - t0;
    // Output the time spent
    time.textContent = spentTime.toFixed(2);
    // Output the matches
    result.innerHTML = '';
    for (var match of matches) {
        // Append it to the result list
        var li = document.createElement('li');
        li.innerHTML = match;
        result.appendChild(li);
    }
}

findTerms.addEventListener('keyup', processInput);
processInput.call(findTerms);

ul { 
    height:300px;
    font-size: smaller;
    overflow: auto;
}

Input terms: <input type="text" id="findTerms"><br>

<h3>Trincot's Wipe Search</h3>
Time: <span id="time"></span>ms<br>
<ul id="result"></ul>

分而治之

比单一正则表达式Viral Permutations策略稍微复杂一些 - 这种递归算法从最长期开始依次搜索每个术语。

每个发现匹配时它划分该“咬”成三个（除非在开始或结束），标志着匹配的“咬”所消耗，并试图在任何未消耗“咬”的匹配的下最长术语。

当它无法找到最长的不匹配的术语时，它将回溯并试图在不同的位置匹配上一个术语（即使在不同的“咬合”中）。

如果它回到最长期，并且找不到另一个位置来匹配它，它将返回错误。

这意味着在大多数情况下，它可以很快返回非匹配，因为它们甚至不包含最长的术语。

当然，如果它用完了 - 即。成功匹配最短 - 它会通过将所有“咬”连接在一起返回突出显示的匹配。

演示：

为了改进性能而更新：基本算法完全相同，但是对arr.slice()调用相当昂贵，可以完全避免。

function divide_and_conquer_replace(query, options, separator) {
    var terms, terms_esc;

    //The inner replacement function
    function divide_and_conquer_inner(bites, depth) {
        var this_term, i, bite, match, new_bites, found_all_others;

        depth = depth ? depth : 1;

        //Get the longest remaining term
        this_term = terms_esc[terms_esc.length - depth];

        //Loop all the bites
        for (i = 0; i < bites.length; i++) {
            bite = bites[i];

            //Reset the lastIndex since we're reusing the RegExp objects
            this_term.lastIndex = 0;

            //Check that we have a string (ie. do not attempt to match bites
            //that are already consumed)
            if (typeof bite === 'string') {
                //Find the next matching position (if any)
                while (match = this_term.exec(bite)) {
                    new_bites = (i > 0) ? bites.slice(0, i) : [];
                    if (match.index > 0) {
                        new_bites.push(bite.slice(0, match.index));
                    }
                    new_bites.push(['<u>' + match[0] + '</u>']);
                    if (this_term.lastIndex < bite.length) {
                        new_bites.push(bite.slice(this_term.lastIndex));
                    }
                    if (i < bites.length - 1) {
                        new_bites = new_bites.concat(bites.slice(i + 1));
                    }

                    if (terms_esc.length > depth) {
                        //Attempt to find all other terms
                        found_all_others = divide_and_conquer_inner(new_bites, depth + 1);

                        //If we found all terms we'll pass the modified string all the
                        //way up to the original callee
                        if (found_all_others) {
                            return found_all_others;
                        }
                        //Otherwise try to match current term somewhere else
                        this_term.lastIndex = match.index + 1;
                    } else {
                        //If no terms remain we have a match
                        return new_bites.join('');
                    }
                }
            }
        }
        //If we reach this point at least one term was not found
        return null;
    };

    // Split query in terms at delimiter
    terms = query.split(separator).filter(Boolean);
    if (!terms.length) return options;


    //Sort terms according to length - longest term last
    terms.sort(function(a, b) {
        return a.length - b.length;
    });

    //Escape terms
    //And store RegExp's instead of strings
    terms_esc = terms.map(function (term) {
        return term.replace(/[-[]{}()*+?.,^$|#s]/g, "$&");
    }).map(function (term) {
        return new RegExp(term, 'gi');
    });

    //Loop through each option
    return options.map(function(option){
        return divide_and_conquer_inner([option]);
    }).filter(Boolean);
}

var options = ['United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People's Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People's Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];
var separator = ' ';

function divide_and_conquer(){
    var query = document.getElementById('divide_and_conquer').value;
    var res_elm = document.getElementById('divide_and_conquer_result');
    var t0 = performance.now();
    var results = divide_and_conquer_replace(query, options, separator);
    var t1 = performance.now();
    document.getElementById('divide_and_conquer_meta').innerHTML = 'Time: ' + (t1 - t0).toFixed(2) + 'ms';
    res_elm.innerHTML = '';
    for (var result of results) {
        res_elm.innerHTML += '<li>' + result + '</li>';
    }
};
divide_and_conquer();

<input type="text" id="divide_and_conquer" onkeyup="divide_and_conquer()">
<p id="divide_and_conquer_meta"></p>
<ul style="height:300px;overflow:auto" id="divide_and_conquer_result"></ul>

链接地址: http://www.djcxy.com/p/40087.html

上一篇: How to match and highlight all terms in any order from an array of strings?

下一篇: Generalized Suffix Tree Java Implementation