How to match and highlight all terms in any order from an array of strings?
The requirements are:
<u>
and </u>
here For the sake of simplicity answers can concentrate on case-insensitive search through a list including only ASCII characters and assume that the term delimiter is a plain space - ie. a query entered as "Foo bar baz" means the search terms are ['foo', 'bar', 'baz']
.
To clarify:
The final application is (perhaps not surprisingly) an autocomplete of sorts.
TL;DR
Most recent fiddle comparing the proposed algorithms side by side:
https://jsfiddle.net/Mikk3lRo/ndeuqn02/7/
(feel free to update this link if you add a new algorithm)
jsPerf comparing algorithms in a somewhat more realistic / representative way - a few strings are basically "entered" one character at a time on each rep:
https://jsperf.com/comparison-of-algorithms-to-search-and-highlight
At this point it is clear (thanks to trincot's excellent base-comparison) that the majority of time used by the original implementations was spent on DOM-output. Its significance has been minimized as much as possible in the fiddle.
There is still a clear difference in performance between the algorithms in each search, but not one of them is consistently fastest on every keystroke. After revisiting and cleaning up my own "Divide and Conquer" it does outperform the others consistently in any realistic scenario I try though.
Tigregalis introduced the idea of a pre-run optimization, which seems reasonable given that options are unlikely to change between keystrokes. I have added (a function for) this to all methods here. The only algorithm where I saw an obvious benefit from it was in Skirtle's Permutations, but I'll encourage each answerer to consider if it might be useful for their own algorithms.
Some algorithms will be much easier to adapt than others. It is still my opinion that this will be more important than the minor performance differences in a real implementation.
Note that the current version of Tigregalis' Shrinking Set has a bug - I've excluded it from fiddle and jsperf until that is fixed.
Viral Permutations
In theory this can be solved by "manually" constructing a RegExp that contains every possible permutation of the search terms separated by a capturing group to catch anything between terms - a search for "foo bar" results in (foo)(.*?)(bar)|(bar)(.*?)(foo)
.
The highlighting is then done in one pass with string.replace()
. If there is any change in the string we have a match.
Demo:
var options = ['United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People's Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People's Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];
var terms, terms_esc;
function viral_permutations() {
var t0, t1, i, permuted, res_elm, meta_elm, regex_string, regex, li, option, match_groups, highlighted;
meta_elm = document.getElementById('viral_permutations_meta');
res_elm = document.getElementById('viral_permutations_result');
res_elm.innerHTML = meta_elm.innerHTML = '';
t0 = performance.now();
//Split query in terms at delimiter and lowercase them
terms = document.getElementById('viral_permutations').value.split(/s/).filter(function(n) {
return n;
}).map(function(term){
return term.toLowerCase();
});
meta_elm.innerHTML += 'Terms: ' + JSON.stringify(terms) + '<br>';
//Escape terms
terms_esc = terms.map(function(term) {
return term.replace(/[-[]{}()*+?.,^$|#s]/g, "$&");
});
//Wrap terms in in individual capturing groups
terms_esc = terms.map(function(term) {
return '(' + term + ')';
});
//Find all permutations
permuted = permutate_array(terms_esc);
//Construct a group for each permutation
match_groups = [];
for (var i in permuted) {
match_groups.push(permuted[i].join('(.*?)'));
}
try {
//Construct the final regex
regex_string = match_groups.join('|');
//Display it
document.getElementById('viral_permutations_regex').innerHTML = regex_string;
meta_elm.innerHTML += 'RegExp length: ' + regex_string.length + '<br>';
regex = new RegExp(regex_string, 'i');
//Loop through each option
for (i = 0; i < options.length; i++) {
option = options[i];
//Replace the terms with highlighted terms
highlighted = option.replace(regex, viral_permutations_replace);
//If anything was changed (or the query is empty) we have a match
if (highlighted !== option || terms.length === 0) {
//Append it to the result list
li = document.createElement('li');
li.innerHTML = highlighted;
res_elm.appendChild(li);
}
}
//Print some meta
t1 = performance.now();
meta_elm.innerHTML += 'Time: ' + (Math.round((t1 - t0) * 100) / 100) + 'ms';
} catch(e) {
meta_elm.innerHTML += '<span style="color:red">' + e.message + '</span>';
}
}
//The replacement function
function viral_permutations_replace() {
var i, m, j, retval, m_cased, unmatched;
retval = '';
//Make a clone of the terms array (that we can modify without destroying the original)
unmatched = terms.slice(0);
//Loop arguments between the first (which is the full match) and
//the last 2 (which are the offset and the full option)
for (i = 1; i < arguments.length - 1; i++) {
m = arguments[i];
//Check that we have a string - most of the arguments will be undefined
if (typeof m !== 'string') continue;
//Lowercase the match
m_cased = m.toLowerCase();
//Append it to the return value - highlighted if it is among our terms
j = unmatched.indexOf(m_cased);
if (j >= 0) {
//Remove it from the unmatched terms array
unmatched.splice(j, 1);
retval += '<u>' + m + '</u>';
} else {
retval += m;
}
}
return retval;
}
//Helper function to return all possible permutations of an array
function permutate_array(arr) {
var perm, perm_intern;
perm_intern = function(perm, pre, post, n) {
var elem, i, j, ref, rest;
if (n > 0) {
for (i = j = 0, ref = post.length; 0 <= ref ? j < ref : j > ref; i = 0 <= ref ? ++j : --j) {
rest = post.slice(0);
elem = rest.splice(i, 1);
perm_intern(perm, pre.concat(elem), rest, n - 1);
}
} else {
perm.push(pre);
}
};
perm = [];
perm_intern(perm, [], arr, arr.length);
return perm;
}
viral_permutations();
<input type="text" id="viral_permutations" onkeyup="viral_permutations()">
<p id="viral_permutations_meta"></p>
<pre style="width:100%;overflow:auto;background:#eee" id="viral_permutations_regex"></pre>
<ul style="height:300px;overflow:auto" id="viral_permutations_result"></ul>
I gave it a go but I'm not sure how much help this will be. My approach is similar to your Divide and Conquer technique.
Instead of biting off bits of the string I search for each term ahead of time and store up a collection of all the matches, recording the start and finish positions. If there aren't enough matches for a particular search term the algorithm immediately bails for that 'option'.
Once it has gathered together all of the possible matches it recursively attempts to find a combination that doesn't overlap. There's a lot of copying of data structures going on in that recursion and I suspect it could be optimized a lot better than I have it here. I can also only apologize for some of the variable names, I was struggling to come up with names that made any sense at all.
For certain test searches, such as anananan ...
it seems to cope better than the original Divide and Conquer technique but I suspect that may be just because of the early bail-out that is performed when insufficient matches are found for a particular search term. Without a large quantity of real data it's difficult to know where the really valuable optimizations would be.
function search() {
var options = [
'ababababa',
'United States',
'United Kingdom',
'Afghanistan',
'Aland Islands',
'Albania',
'Algeria',
'American Samoa',
'Andorra',
'Angola',
'Anguilla',
'Antarctica',
'Antigua and Barbuda',
'Argentina',
'Armenia',
'Aruba',
'Australia',
'Austria',
'Azerbaijan',
'Bahamas',
'Bahrain',
'Bangladesh',
'Barbados',
'Belarus',
'Belgium',
'Belize',
'Benin',
'Bermuda',
'Bhutan',
'Bolivia, Plurinational State of',
'Bonaire, Sint Eustatius and Saba',
'Bosnia and Herzegovina',
'Botswana',
'Bouvet Island',
'Brazil',
'British Indian Ocean Territory',
'Brunei Darussalam',
'Bulgaria',
'Burkina Faso',
'Burundi',
'Cambodia',
'Cameroon',
'Canada',
'Cape Verde',
'Cayman Islands',
'Central African Republic',
'Chad',
'Chile',
'China',
'Christmas Island',
'Cocos (Keeling) Islands',
'Colombia',
'Comoros',
'Congo',
'Congo, The Democratic Republic of The',
'Cook Islands',
'Costa Rica',
'Cote D'ivoire',
'Croatia',
'Cuba',
'Curacao',
'Cyprus',
'Czech Republic',
'Denmark',
'Djibouti',
'Dominica',
'Dominican Republic',
'Ecuador',
'Egypt',
'El Salvador',
'Equatorial Guinea',
'Eritrea',
'Estonia',
'Ethiopia',
'Falkland Islands (Malvinas)',
'Faroe Islands',
'Fiji',
'Finland',
'France',
'French Guiana',
'French Polynesia',
'French Southern Territories',
'Gabon',
'Gambia',
'Georgia',
'Germany',
'Ghana',
'Gibraltar',
'Greece',
'Greenland',
'Grenada',
'Guadeloupe',
'Guam',
'Guatemala',
'Guernsey',
'Guinea',
'Guinea-bissau',
'Guyana',
'Haiti',
'Heard Island and Mcdonald Islands',
'Holy See (Vatican City State)',
'Honduras',
'Hong Kong',
'Hungary',
'Iceland',
'India',
'Indonesia',
'Iran, Islamic Republic of',
'Iraq',
'Ireland',
'Isle of Man',
'Israel',
'Italy',
'Jamaica',
'Japan',
'Jersey',
'Jordan',
'Kazakhstan',
'Kenya',
'Kiribati',
'Korea, Democratic People's Republic of',
'Korea, Republic of',
'Kuwait',
'Kyrgyzstan',
'Lao People's Democratic Republic',
'Latvia',
'Lebanon',
'Lesotho',
'Liberia',
'Libya',
'Liechtenstein',
'Lithuania',
'Luxembourg',
'Macao',
'Macedonia, The Former Yugoslav Republic of',
'Madagascar',
'Malawi',
'Malaysia',
'Maldives',
'Mali',
'Malta',
'Marshall Islands',
'Martinique',
'Mauritania',
'Mauritius',
'Mayotte',
'Mexico',
'Micronesia, Federated States of',
'Moldova, Republic of',
'Monaco',
'Mongolia',
'Montenegro',
'Montserrat',
'Morocco',
'Mozambique',
'Myanmar',
'Namibia',
'Nauru',
'Nepal',
'Netherlands',
'New Caledonia',
'New Zealand',
'Nicaragua',
'Niger',
'Nigeria',
'Niue',
'Norfolk Island',
'Northern Mariana Islands',
'Norway',
'Oman',
'Pakistan',
'Palau',
'Palestinian Territory, Occupied',
'Panama',
'Papua New Guinea',
'Paraguay',
'Peru',
'Philippines',
'Pitcairn',
'Poland',
'Portugal',
'Puerto Rico',
'Qatar',
'Reunion',
'Romania',
'Russian Federation',
'Rwanda',
'Saint Barthelemy',
'Saint Helena, Ascension and Tristan da Cunha',
'Saint Kitts and Nevis',
'Saint Lucia',
'Saint Martin (French part)',
'Saint Pierre and Miquelon',
'Saint Vincent and The Grenadines',
'Samoa',
'San Marino',
'Sao Tome and Principe',
'Saudi Arabia',
'Senegal',
'Serbia',
'Seychelles',
'Sierra Leone',
'Singapore',
'Sint Maarten (Dutch part)',
'Slovakia',
'Slovenia',
'Solomon Islands',
'Somalia',
'South Africa',
'South Georgia and The South Sandwich Islands',
'South Sudan',
'Spain',
'Sri Lanka',
'Sudan',
'Suriname',
'Svalbard and Jan Mayen',
'Swaziland',
'Sweden',
'Switzerland',
'Syrian Arab Republic',
'Taiwan, Province of China',
'Tajikistan',
'Tanzania, United Republic of',
'Thailand',
'Timor-leste',
'Togo',
'Tokelau',
'Tonga',
'Trinidad and Tobago',
'Tunisia',
'Turkey',
'Turkmenistan',
'Turks and Caicos Islands',
'Tuvalu',
'Uganda',
'Ukraine',
'United Arab Emirates',
'United Kingdom',
'United States',
'United States Minor Outlying Islands',
'Uruguay',
'Uzbekistan',
'Vanuatu',
'Venezuela, Bolivarian Republic of',
'Viet Nam',
'Virgin Islands, British',
'Virgin Islands, U.S.',
'Wallis and Futuna',
'Western Sahara',
'Yemen',
'Zambia',
'Zimbabwe'
];
var terms = document.getElementById('search').value.trim().toLowerCase().split(/s+/);
if (!terms[0]) {
terms = [];
}
document.getElementById('terms').innerText = 'Terms: ' + JSON.stringify(terms);
var startTime = performance.now();
// Term counts is a map storing how many times each search term appears in the query
var termCounts = {};
terms.forEach(function(term) {
termCounts[term] = (termCounts[term] || 0) + 1;
});
// An array of search terms with the duplicates removed
var uniqueTerms = Object.keys(termCounts);
// Loop through each option and map to either a highlight version or null
options = options.map(function(optionText) {
var matches = {},
lastMatchIndex = {},
option = optionText.toLowerCase();
uniqueTerms.forEach(function(term) {
// This array will be populated with start/end position of each match for this term
matches[term] = [];
// The index of the last match... which could be deduced from the matches but this is slightly easier
lastMatchIndex[term] = -1;
});
var incompleteMatchTerms = uniqueTerms.slice(),
nextMatchTerm;
// This is probably a premature optimization but doing it this
// way ensures we check that each search term occurs at least
// once as quickly as possible.
while (nextMatchTerm = incompleteMatchTerms.shift()) {
var nextMatchIndex = option.indexOf(nextMatchTerm, lastMatchIndex[nextMatchTerm] + 1);
if (nextMatchIndex === -1) {
// Haven't found enough matches for this term, so the option doesn't match
if (termCounts[nextMatchTerm] > matches[nextMatchTerm].length) {
return null;
}
}
else {
// Found another match, put the term back on the queue
// for another round
incompleteMatchTerms.push(nextMatchTerm);
lastMatchIndex[nextMatchTerm] = nextMatchIndex;
matches[nextMatchTerm].push({
start: nextMatchIndex,
end: nextMatchIndex + nextMatchTerm.length
});
}
}
// Pass in the original array of terms... we attempt to highlight in the order of the original query
var highlights = performHighlight(terms, matches);
if (!highlights) {
return null;
}
// We need the highlights sorted so that we do the replacing from the end of the string
highlights.sort(function(h1, h2) {
return h2.start - h1.start;
});
highlights.forEach(function(highlight) {
optionText = optionText.slice(0, highlight.start)
+ '<u>' + optionText.slice(highlight.start, highlight.end) + '</u>'
+ optionText.slice(highlight.end);
});
return optionText;
function performHighlight(terms, allMatches) {
// If there are no terms left to match we've got a hit
if (terms.length === 0) {
return [];
}
var nextTerms = terms.slice(),
term = nextTerms.shift(),
matches = allMatches[term].slice(),
match;
while (match = matches.shift()) {
var nextMatches = {};
// We need to purge any entries from nextMatches that overlap the current match
uniqueTerms.forEach(function(nextTerm) {
var nextMatch = term === nextTerm ? matches : allMatches[nextTerm];
nextMatches[nextTerm] = nextMatch.filter(function(match2) {
return match.start >= match2.end || match.end <= match2.start;
});
});
var highlights = performHighlight(nextTerms, nextMatches);
if (highlights) {
highlights.push(match);
return highlights;
}
}
return null;
}
});
document.getElementById('results').innerHTML = options.map(function(option) {
if (option) {
return '<li>' + option + '</li>';
}
return '';
}).join('');
document.getElementById('time').innerText = Math.round((performance.now() - startTime) * 100) / 100 + 'ms';
}
<h1>Permutations</h1>
<input type="text" id="search" onkeyup="search()" autocomplete="off">
<p id="terms"></p>
<p id="time"></p>
<ul id="results"></ul>
I would suggest a slight variant on the divide and conquer idea: instead of splitting the string into chunks (bites), you could "wipe out" the characters that have been matched, and perform further searches on that one string. The character to wipe out with would be the separator, as it is guaranteed to not occur in any of the terms.
Here it is:
function trincotWipeSearch(query, options, separator) {
// Split query in terms at delimiter
const terms = query.split(separator).filter(Boolean);
if (!terms.length) return options;
// Sort terms by descending size
terms.sort( (a,b) => b.length - a.length );
// Escape terms, and enrich with size of original term
// and a string of the same length filled with the separator char
const items = terms.map(term => ({
size: term.length,
wipe: separator.repeat(term.length),
regex: new RegExp(term.replace(/[-[]{}()*+?.,^$|#s]/g, "$&"), 'gi')
}));
function getOffsets(termIndex, text) {
// All terms found?
if (termIndex >= terms.length) return [];
let match;
const { regex, size, wipe } = items[termIndex];
regex.lastIndex = 0;
while (match = regex.exec(text)) {
let index = match.index;
// Wipe characters and recurse to find other terms
let offsets = getOffsets(termIndex+1,
text.substr(0, index) + wipe + text.substr(index + size));
if (offsets !== undefined) {
// Solution found, backtrack all the way
return offsets.concat([index, index + size]);
}
regex.lastIndex = match.index + 1;
}
}
// Loop through each option
return options.map( option => {
// Get the offsets of the matches
let offsets = getOffsets(0, option);
if (offsets) {
// Apply the offsets to add the markup
offsets
.sort( (a,b) => b - a )
.map((index, i) => {
option = option.substr(0, index)
+ (i%2 ? "<u>" : "</u>")
+ option.substr(index);
});
return option;
}
}).filter(Boolean); // get only the non-empty results
}
var options = ['United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People's Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People's Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];
/*
* I/O and performance measurements
*/
function processInput() {
var query = this.value.toLowerCase();
const t0 = performance.now();
const matches = trincotWipeSearch(query, options, ' ');
const spentTime = performance.now() - t0;
// Output the time spent
time.textContent = spentTime.toFixed(2);
// Output the matches
result.innerHTML = '';
for (var match of matches) {
// Append it to the result list
var li = document.createElement('li');
li.innerHTML = match;
result.appendChild(li);
}
}
findTerms.addEventListener('keyup', processInput);
processInput.call(findTerms);
ul {
height:300px;
font-size: smaller;
overflow: auto;
}
Input terms: <input type="text" id="findTerms"><br>
<h3>Trincot's Wipe Search</h3>
Time: <span id="time"></span>ms<br>
<ul id="result"></ul>
Divide and Conquer
Somewhat more complicated than the single-regex Viral Permutations strategy - this recursive algorithm searches for each term one after the other, starting with the longest term.
Each time a match is found it divides that "bite" into three (unless at the beginning or end), marking the matched "bite" as consumed, and attempts to match the next-longest term in any of the unconsumed "bites".
When it is unable to find the longest unmatched term, it will backtrack and attempt to match the previous term at a different position (even in a different "bite").
If it comes back to the longest term and cannot find another position to match it at, it will return false.
This means that in most cases it can return non-matches pretty quickly, simply because they don't even contain the longest term.
Of course if it runs out of terms - ie. successfully matches the shortest - it will return the highlighted match by joining all the "bites" back together.
Demo:
Updated for improved performance : The base algorithm is exactly the same, but there were some pretty expensive calls to arr.slice()
that could be avoided completely.
function divide_and_conquer_replace(query, options, separator) {
var terms, terms_esc;
//The inner replacement function
function divide_and_conquer_inner(bites, depth) {
var this_term, i, bite, match, new_bites, found_all_others;
depth = depth ? depth : 1;
//Get the longest remaining term
this_term = terms_esc[terms_esc.length - depth];
//Loop all the bites
for (i = 0; i < bites.length; i++) {
bite = bites[i];
//Reset the lastIndex since we're reusing the RegExp objects
this_term.lastIndex = 0;
//Check that we have a string (ie. do not attempt to match bites
//that are already consumed)
if (typeof bite === 'string') {
//Find the next matching position (if any)
while (match = this_term.exec(bite)) {
new_bites = (i > 0) ? bites.slice(0, i) : [];
if (match.index > 0) {
new_bites.push(bite.slice(0, match.index));
}
new_bites.push(['<u>' + match[0] + '</u>']);
if (this_term.lastIndex < bite.length) {
new_bites.push(bite.slice(this_term.lastIndex));
}
if (i < bites.length - 1) {
new_bites = new_bites.concat(bites.slice(i + 1));
}
if (terms_esc.length > depth) {
//Attempt to find all other terms
found_all_others = divide_and_conquer_inner(new_bites, depth + 1);
//If we found all terms we'll pass the modified string all the
//way up to the original callee
if (found_all_others) {
return found_all_others;
}
//Otherwise try to match current term somewhere else
this_term.lastIndex = match.index + 1;
} else {
//If no terms remain we have a match
return new_bites.join('');
}
}
}
}
//If we reach this point at least one term was not found
return null;
};
// Split query in terms at delimiter
terms = query.split(separator).filter(Boolean);
if (!terms.length) return options;
//Sort terms according to length - longest term last
terms.sort(function(a, b) {
return a.length - b.length;
});
//Escape terms
//And store RegExp's instead of strings
terms_esc = terms.map(function (term) {
return term.replace(/[-[]{}()*+?.,^$|#s]/g, "$&");
}).map(function (term) {
return new RegExp(term, 'gi');
});
//Loop through each option
return options.map(function(option){
return divide_and_conquer_inner([option]);
}).filter(Boolean);
}
var options = ['United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People's Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People's Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];
var separator = ' ';
function divide_and_conquer(){
var query = document.getElementById('divide_and_conquer').value;
var res_elm = document.getElementById('divide_and_conquer_result');
var t0 = performance.now();
var results = divide_and_conquer_replace(query, options, separator);
var t1 = performance.now();
document.getElementById('divide_and_conquer_meta').innerHTML = 'Time: ' + (t1 - t0).toFixed(2) + 'ms';
res_elm.innerHTML = '';
for (var result of results) {
res_elm.innerHTML += '<li>' + result + '</li>';
}
};
divide_and_conquer();
<input type="text" id="divide_and_conquer" onkeyup="divide_and_conquer()">
<p id="divide_and_conquer_meta"></p>
<ul style="height:300px;overflow:auto" id="divide_and_conquer_result"></ul>
链接地址: http://www.djcxy.com/p/40088.html