Finding anagrams for a given word

2018-06-05 18:20:15

Two words are anagrams if one of them has exactly same characters as that of the another word.

Example : Anagram & Nagaram are anagrams (case-insensitive).

Now there are many questions similar to this . A couple of approaches to find whether two strings are anagrams are :

1) Sort the strings and compare them.

2) Create a frequency map for these strings and check if they are the same or not.

But in this case , we are given with a word (for the sake of simplicity let us assume a single word only and it will have single word anagrams only) and we need to find anagrams for that.

Solution which I have in mind is that , we can generate all permutations for the word and check which of these words exist in the dictionary . But clearly , this is highly inefficient. Yes , the dictionary is available too.

So what alternatives do we have here ?

I also read in a similar thread that something can be done using Tries but the person didn't explain as to what the algorithm was and why did we use a Trie in first place , just an implementation was provided that too in Python or Ruby. So that wasn't really helpful which is why I have created this new thread. If someone wants to share their implementation (other than C,C++ or Java) then kindly explain it too.

Example algorithm:

Open dictionary
Create empty hashmap H
For each word in dictionary:
  Create a key that is the word's letters sorted alphabetically (and forced to one case)
  Add the word to the list of words accessed by the hash key in H

To check for all anagrams of a given word:

Create a key that is the letters of the word, sorted (and forced to one case)
Look up that key in H
You now have a list of all anagrams

Relatively fast to build, blazingly fast on look-up.

I came up with a new solution I guess. It uses the Fundamental Theorem of Arithmetic. So the idea is to use an array of the first 26 prime numbers. Then for each letter in the input word we get the corresponding prime number A = 2, B = 3, C = 5, D = 7 … and then we calculate the product of our input word. Next we do this for each word in the dictionary and if a word matches our input word, then we add it to the resulting list. All anagrams will have the same signature because

Any integer greater than 1 is either a prime number, or can be written as a unique product of prime numbers (ignoring the order).

Here's the code. I convert the word to UPPERCASE and 65 is the position of A which corresponds to my first prime number:

private int[] PRIMES = new int[] { 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31,
        37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103,
        107, 109, 113 };

This is the method:

 private long calculateProduct(char[] letters) {
    long result = 1L;
    for (char c : letters) {
        if (c < 65) {
            return -1;
        }
        int pos = c - 65;
        result *= PRIMES[pos];
    }
    return result;
}

We know that if two words don't have the same length, they are not anagrams. So you can partition your dictionary in groups of words of the same length.

Now we focus on only one of these groups and basically all words have exactly the same length in this smaller universe.

If each letter position is a dimension, and the value in that dimension is based on the letter (say the ASCII code). Then you can calculate the length of the word vector.

For example, say 'A'=65, 'B'=66, then length("AB") = sqrt(65*65 + 66*66) . Obviously, length("AB") = length("BA") .

Clearly, if two word are anagrams, then their vectors have the same length. The next question is, if two word (of same number of letters) vectors have the same length, are they anagrams? Intuitively, I'd say no, since all vectors with that length forms a sphere, there are many. Not sure, since we're in the integer space in this case, how many there are actually.

But at the very least it allows you to partition your dictionary even further. For each word in your dictionary, calculate the vector's distance: for(each letter c) { distance += c*c }; distance = sqrt(distance); for(each letter c) { distance += c*c }; distance = sqrt(distance);

Then create a map for all words of length n , and key it with the distance and the value is a list of words of length n that yield that particular distance.

You'll create a map for each distance.

Then your lookup becomes the following algorithm:

Use the correct dictionary map based on the length of the word

Compute the length of your word's vector

Lookup the list of words that match that length

Go through the list and pick the anagrams using a naive algorithm is now the list of candidates is greatly reduced

链接地址: http://www.djcxy.com/p/18142.html

上一篇: 将新密钥添加到字典中？

下一篇: 为给定单词找到anagrams