How can I tag and chunk French text using NLTK and Python?
I have 30,000+ French-language articles in a JSON file. I would like to perform some text analysis on both individual articles and on the set as a whole. Before I go further, I'm starting with simple goals:
The steps I've taken so far:
Imported the data into a python list:
import json
json_articles=open('articlefile.json')
articlelist = json.load(json_articles)
Selected a single article to test, and concatenated the body text into a single string:
txt = ' '.join(data[10000]['body'])
Loaded a French sentence tokenizer and split the string into a list of sentences:
nltk.data.load('tokenizers/punkt/french.pickle')
tokens = [french_tokenizer.tokenize(s) for s in sentences]
Attempted to split the sentences into words using the WhiteSpaceTokenizer:
from nltk.tokenize import WhitespaceTokenizer
wst = WhitespaceTokenizer()
tokens = [wst.tokenize(s) for s in sentences]
This is where I'm stuck, for the following reasons:
For English, I could tag and chunk the text like so:
tagged = [nltk.pos_tag(token) for token in tokens]
chunks = nltk.batch_ne_chunk(tagged)
My main options (in order of current preference) seem to be:
If I were to do (1), I imagine I would need to create my own tagged corpus. Is this correct, or would it be possible (and premitted) to use the French Treebank?
If the French Treebank corpus format (example here) is not suitable for use with nltk-trainer, is it feasible to convert it into such a format?
What approaches have French-speaking users of NLTK taken to PoS tag and chunk text?
As of version 3.1.0 (January 2012), the Stanford PoS tagger supports French.
It should be possible to use this French tagger in NLTK, using Nitin Madnani's Interface to the Stanford POS-tagger
I haven't tried this yet, but it sounds easier than the other approaches I've considered, and I should be able to control the entire pipeline from within a Python script. I'll comment on this post when I have an outcome to share.
There is also TreeTagger (supporting french corpus) with a Python wrapper. This is the solution I am currently using and it works quite good.
Here are some suggestions:
WhitespaceTokenizer
is doing what it's meant to. If you want to split on apostrophes, try WordPunctTokenizer
, check out the other available tokenizers, or roll your own with Regexp tokenizer or directly with the re
module.
Make sure you've resolved text encoding issues (unicode or latin1), otherwise the tokenization will still go wrong.
The nltk only comes with the English tagger, as you discovered. It sounds like using TreeTagger would be the least work, since it's (almost) ready to use.
Training your own is also a practical option. But you definitely shouldn't create your own training corpus! Use an existing tagged corpus of French. You'll get best results if the genre of the training text matches your domain (articles). Also, you can use nltk-trainer but you could also use the NLTK features directly.
You can use the French Treebank corpus for training, but I don't know if there's a reader that knows its exact format. If not, you must start with XMLCorpusReader and subclass it to provide a tagged_sents() method.
If you're not already on the nltk-users mailing list, I think you'll want to get on it.
上一篇: 如何使用NLTK从诱导语法生成句子?