Detect (predefined) topics in natural text

2018-06-20 13:05:44

Is there a library or database out there that can detect the topics of natural text?

I'm not talking about generating topics from extracted keywords, but about analysing the used vocabulary and matching it with predefined topics. Like searching for words used in cooking or certain sports (like names of football clubs or technical terms).

Update with clarification:

Example text snippet: A sentence about football, then another sentence talking about catering at the event.

Library could assign categories "sports", "football", "cooking".

I'm looking for something that can assign these categories (or "topics of interest" maybe) without me having to train thousands of models with terabytes of manually classified documents. This could for example work by matching keywords instead of statistical analysis (that's why I mentioned database earlier).

I'm searching this because I don't have the manpower to build such a big database myself.

The task you described is a classic text document classification. I recommend to read through this article and then search by known keywords.

In short, most popular approach is supervised machine learning (eg SVM) with tf-idf over words, or sometimes - word n-grams.

Scikit-learn tutorial describes this task; there are also existed libraries like LibShortText.

For datasets (more common term than 'database') look at Reuters-21578 Text Categorization Collection or here. In general, it isn't hard to collect texts from predefined categories. For example, go to news sites - maybe to specialized ones - like sports - if you want to classify texts by kinds of sport.

See also for related question on stackoverflow or quora.

There are multiple ways to address this problem and the underlying theme around the same is in the domain of Semantic Web.

Use a knowledge base like dbpedia, dbpedia is essentially wikipedia data in triple format (subject predicate object). Query dbpedia using sparql on predicate- rdfs:label, this will return you an URI for the token if it is a part of dbpedia and a predicate called dcterms:subject will have the category related to that subject. You might need to traverse the triple store to get more abstract relationship. Similar knowledge bases - ConceptNet, freebase, yago.

Check, http://www.cyc.com/

Let me know if you want me to elaborate more

Best Ankit

链接地址: http://www.djcxy.com/p/57770.html

上一篇: 使用Word2Vec进行主题建模

下一篇: 在自然文本中检测（预定义）主题