Best way to classify labeled sentences from a set of documents
I have a classification problem and I need to figure out the best approach to solve it. I have a set of training documents, where some the sentences and/or paragraphs within the documents are labeled with some tags. Not all sentences/paragraphs are labeled. A sentence or paragraph may have more than one tag/label. What I want to do is make some model, where given a new documents, it will give me suggested labels for each of the sentences/paragraphs within the document. Ideally, it would only give me high-probability suggestions.
If I use something like nltk NaiveBayesClassifier, it gives poor results, I think because it does not take into account the "unlabeled" sentences from the training documents, which will contain many similar words and phrases as the labeled sentences. The documents are legal/financial in nature and are filled with legal/financial jargon most of which should be discounted in the classification model.
Is there some better classification algorithm that Naive Bayes, or is there some way to push the unlabelled data into naive bayes, in addition to the labelled data from the training set?
Here's what I'd do to slightly modify your existing approach: train a single classifier for each possible tag, for each sentence. Include all sentences not expressing that tag as negative examples for the tag (this will implicitly count unlabelled examples). For a new test sentence, run all n classifiers, and retain classes scoring above some threshold as the labels for the new sentence.
I'd probably use something other than Naive Bayes. Logistic regression (MaxEnt) is the obvious choice if you want something probabilistic: SVMs are very strong if you don't care about probabilities (and I don't think you do at the moment).
This is really a sequence labelling task, and ideally you'd fold in predictions from nearby sentences too... but as far as I know, there's no principled extension to CRFs/StructSVM or other sequence tagging approaches that lets instances have multiple labels.
is there some way to push the unlabelled data into naive bayes
There is no distinction between "labeled" and "unlabeled" data, Naive Bayes builds simple conditional probabilities, in particular P(label|attributes)
and P(no label|attributes)
so it is heavily based on used processing pipeline but I highly doubt that it actually ignores the unlabelled parts. If it does so for some reason, and you do not want to modify the code, you can also introduce some artificial label "no label" to all remaining text segments.
Is there some better classification algorithm that Naive Bayes
Yes, NB is in fact the most basic model, and there are dozens better (stronger, more general) ones, which achieve better results in text tagging, including:
上一篇: 通过C#优化这个大型SQL插入的策略?
下一篇: 从一组文档中分类标记句子的最佳方法