如何将给定的文本分解成字典中的单词？

2018-06-05 18:17:08

这是一个面试问题。假设你有一个字符串text和一个dictionary （一组字符串）。如何将text分解为子字符串，以便在dictionary找到每个子字符串。

例如，您可以使用/usr/share/dict/words将"thisisatext"分解为["this", "is", "a", "text"] 。

我相信回溯可以解决这个问题（在伪Java中）：

void solve(String s, Set<String> dict, List<String> solution) {
   if (s.length == 0)
      return
   for each prefix of s found in dict
      solve(s without prefix, dict, solution + prefix)
}

List<String> solution = new List<String>()
solve(text, dict, solution)

是否有意义？你会优化在词典中搜索前缀的步骤吗？你会推荐什么样的数据结构？

该解决方案假设字典中存在Trie数据结构。此外，对于Trie中的每个节点，都承担以下功能：

node.IsWord（）：如果该节点的路径是一个单词，则返回true

node.IsChild（char x）：如果存在具有标签x的子项，则返回true

node.GetChild（char x）：返回带有标签x的子节点

Function annotate( String str, int start, int end, int root[], TrieNode node):
i = start
while i<=end:
    if node.IsChild ( str[i]):
        node = node.GetChild( str[i] )
        if node.IsWord():
            root[i+1] = start
        i+=1
    else:
        break;

end = len(str)-1
root = [-1 for i in range(len(str)+1)]
for start= 0:end:
    if start = 0 or root[start]>=0:
        annotate(str, start, end, root, trieRoot)

index  0  1  2  3  4  5  6  7  8  9  10  11
str:   t  h  i  s  i  s  a  t  e  x  t
root: -1 -1 -1 -1  0 -1  4  6 -1  6 -1   7

我将留下部分给你，通过反向遍历根来列出构成字符串的单词。

时间复杂度是O（nk），其中n是字符串的长度，k是字典中最长单词的长度。

PS：我正在假设在字典中有下列词语：这是一个文本，吃了。

在这篇博客文章中，对这个问题的解决方案有一个非常详尽的解释。

基本思想就是记住你写的函数，你将有一个O（n ^ 2）时间，O（n）空间算法。

方法1 - Trie在这里看起来非常适合。在英文字典中生成单词的字典。这栋建筑是一次性成本。在树结构建立后，你的string可以很容易地逐字比较。如果在任何时候遇到树中的叶子，你都可以假设你找到了一个词，把它添加到列表中，并随着你的遍历继续。遍历直到你到达string的末尾。该列表是输出。

搜索的时间复杂度 - O（word_length）。

空间复杂性 - O（charsize * word_length * no_words）。字典的大小。

方法2 -我听说过后缀树，从来没有使用它们，但它可能在这里很有用。

方法3 -更迂腐和糟糕的选择。你已经提出了这个建议。

你可以尝试另一种方式。通过dict运行是检查子字符串匹配。在这里，我假设在按键dict是在words的英文字典/usr/share/dict/words 。所以伪代码看起来像这样 -

(list) splitIntoWords(String str, dict d)
{
    words = []
    for (word in d)
    {
        if word in str
            words.append(word);
    }
    return words;
}

复杂性 - O（n）贯穿整个词典+ O（1）进行子串匹配。

空间 - 最坏情况O（n）如果len(words) == len(dict)

正如其他人指出的那样，这确实需要回溯。

链接地址: http://www.djcxy.com/p/18135.html

上一篇: How to break down a given text into words from the dictionary?

下一篇: Discovering "templates" in a given text?