发现给定文本中的“模板”？

2018-06-05 18:16:06

如果我有大量的文本，并且正在尝试发现最常出现的模板，我正在考虑使用N-Gram方法来解决这个问题，事实上它也被认为是这个问题的一个解决方案，但是我的要求略有不同。为了澄清，我有这样的文字：

I wake up every day morning and read the newspaper and then go to work
I wake up every day morning and eat my breakfast and then go to work
I am not sure that this is the solution but I will try
I am not sure that this is the answer but I will try
I am not feeling well today but I will get the work done and deliver it tomorrow
I was not feeling well yesterday but I will get the work done and let you know by tomorrow

并试图提取这样的“模板”：

I wake up every day morning and ... and then go to work
I am not sure that this is the ... but I will try
I ... not feeling well ... but I will get the work done and ... tomorrow

我正在寻找一种可以扩展到数百万行文本的方法，所以我只是想知道我是否可以使用相同的N-gram方法来解决这个问题，或者是否有其他方法？

数百万行文字不是真正的大数字:)

你要找的东西至少与搭配发现相似。您可以尝试计算n-grams上的逐点互信息。参见Manning＆Schütze（1999）对这个问题和其他方法的看法。

链接地址: http://www.djcxy.com/p/18133.html

上一篇: Discovering "templates" in a given text?

下一篇: Software to find the most occurring unique words in a file