How to find the most common words using spacy?

I'm using spacy with python and its working fine for tagging each word but I was wondering if it was possible to find the most common words in a string. Also is it possible to get the most common nouns, verbs, adverbs and so on?

There's a count_by function included but I cant seem to get it to run in any meaningful way.


This should look basically the same as counting anything else in Python. spaCy lets you just iterate over the document, and you get back a sequence of Token objects. These can be used to access the annotations.

from __future__ import print_function, unicode_literals
import spacy
from collections import defaultdict, Counter

nlp = spacy.load('en')

pos_counts = defaultdict(Counter)
doc = nlp(u'My text here.')

for token in doc:
    pos_counts[token.pos][token.orth] += 1

for pos_id, counts in sorted(pos_counts.items()):
    pos = doc.vocab.strings[pos_id]
    for orth_id, count in counts.most_common():
        print(pos, count, doc.vocab.strings[orth_id])

Note that the .orth and .pos attributes are integers. You can get the strings that they map to via the .orth_ and .pos_ attributes. The .orth attribute is the unnormalised view of the token, there's also the .lower, .lemma etc string-view. You might want to bind a .norm function, to do your own string normalisation. See the docs for details.

The integers are useful for your counts because you can make your counting program much more memory efficient, if you're counting over a large corpus. You could also store the frequent counts in a numpy array, for additional speed and efficiency. If you don't want to bother with this, feel free to count with the .orth_ attribute directly, or use its alias .text.

Note that the .pos attribute in the snippet above gives a coarse-grained set of part-of-speech tags. The richer treebank tags are available on the .tag attribute.


I recently had to count frequency of all the tokens in a text file. You can filter out words to get POS tokens you like using the pos_ attribute. Here is a simple example:

import spacy
from collections import Counter
nlp = spacy.load('en')
doc = nlp(u'Your text here')
# all tokens that arent stop words or punctuations
words = [token.text for token in self.doc if token.is_stop != True and token.is_punct != True]

# noun tokens that arent stop words or punctuations
nouns = [token.text for token in self.doc if token.is_stop != True and token.is_punct != True and token.pos_ == "NOUN"]

# five most common tokens
word_freq = Counter(words)
common_words = word_freq.most_common(5)

# five most common noun tokens
noun_freq = Counter(nouns)
common_nouns = noun_freq.most_common(5)
链接地址: http://www.djcxy.com/p/54174.html

上一篇: 在Python中为静态方法选择的语法如何?

下一篇: 如何使用spacy找到最常用的单词?