How to get rid of punctuation using NLTK tokenizer?
I'm just starting to use NLTK and I don't quite understand how to get a list of words from text. If I use nltk.word_tokenize()
, I get a list of words and punctuation. I need only the words instead. How can I get rid of punctuation? Also word_tokenize
doesn't work with multiple sentences: dots are added to the last word.
Take a look at the other tokenizing options that nltk provides here. For example, you can define a tokenizer that picks out sequences of alphanumeric characters as tokens and drops everything else:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'w+')
tokenizer.tokenize('Eighty-seven miles to go, yet. Onward!')
Output:
['Eighty', 'seven', 'miles', 'to', 'go', 'yet', 'Onward']
You do not really need NLTK to remove punctuation. You can remove it with simple python. For strings:
import string
s = '... some string with punctuation ...'
s = s.translate(None, string.punctuation)
Or for unicode:
import string
translate_table = dict((ord(char), None) for char in string.punctuation)
s.translate(translate_table)
and then use this string in your tokenizer.
PS string module have some other sets of elements that can be removed (like digits).
As noticed in comments start with sent_tokenize(), because word_tokenize() works only on a single sentence. You can filter out punctuation with filter(). And if you have an unicode strings make sure that is a unicode object (not a 'str' encoded with some encoding like 'utf-8').
from nltk.tokenize import word_tokenize, sent_tokenize
text = '''It is a blue, small, and extraordinary ball. Like no other'''
tokens = [word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
print filter(lambda word: word not in ',-', tokens)
链接地址: http://www.djcxy.com/p/65162.html
上一篇: 如何使用NLTK从文本中提取报价