Using Word2Vec for topic modeling

2018-06-20 13:06:46

I have read that the most common technique for topic modeling (extracting possible topics from text) is Latent Dirichlet allocation (LDA).

However, I am interested whether it is a good idea to try out topic modeling with Word2Vec as it clusters words in vector space. Couldn't the clusters therefore be regarded as topics?

Do you think it makes sense to follow this approach for the sake of some research? In the end what I am interested in is to extract keywords from text according to topics.

You might want to look at the following papers:

Dat Quoc Nguyen, Richard Billingsley, Lan Du and Mark Johnson. 2015. Improving Topic Models with Latent Feature Word Representations. Transactions of the Association for Computational Linguistics, vol. 3, pp. 299-313. [CODE]

Yang Liu, Zhiyuan Liu, Tat-Seng Chua, Maosong Sun. 2015. Topical Word Embeddings. In proceedings of 29th AAAI Conference on Artificial Intelligence, 2418-2424. [CODE]

The first paper integrates word embeddings into the LDA model and the one-topic-per-document DMM model. It reports significant improvements on topic coherence, document clustering and document classification tasks, especially on small corpora or short texts (eg Tweets).

The second paper is also interesting. It uses LDA to assign topic for each word, and then employs Word2Vec to learn word embeddings based on both words and their topics.

Two people have tried to solve this.

Chris Moody at StichFix came out with LDA2Vec, and some Ph.D students at CMU wrote a paper called "Gaussian LDA for Topic Models with Word Embeddings" with code here... though I could not get the Java code there to output sensical results. Its an interesting idea of using word2vec with gaussian (actually T-distributions when you work out the math) word-topic distributions. Gaussian LDA should be able to handle out of vocabulary words from the training.

LDA2Vec attempts to train both the LDA model and word-vectors at the same time, and it also allows you to put LDA priors over non-words to get really interesting results.

In Word2Vec,Consider 3 sentences
“the dog saw a cat”,
“the dog chased the cat”,
“the cat climbed a tree”
Here we give input word 'cat', then we will get output word as 'climbed'

its based on the probability of all words given context word(cat). Its a continuous bag of words model. We will get words similar to the input word based on the context. Word2Vec works well in huge data set only.

LDA is used to abstract topics from a corpus. Its not based on context. As it uses Dirichlet distribution to draw words over topics and draw topics over documents. The problem we face here is randomness. We get different outputs each time.

The technique we choose depends upon our requirements.

链接地址: http://www.djcxy.com/p/57772.html

上一篇: 标记，使用自定义词汇表（Python）从自由文本中提取短语？

下一篇: 使用Word2Vec进行主题建模