nlp naive bayes classifier training
as part of understanding the stanford nlp api for classification, i am training the naive bayes classifier on a very simple training set (3 labels => ['happy','sad','neutral']). This training data set is
happy happy
happy glad
sad gloomy
neutral fine
this is part of the output from training the classifier (before the error)
numDatumsPerLabel: {happy=2.0, sad=1.0, neutral=1.0}
numLabels: 3 [happy, sad, neutral]
numFeatures (Phi(X) types): 4 [1-SW-happy, 1-SW-glad, 1-SW-gloomy, 1-SW-fine]
I get the an array index out of bounds error. I have attached the stack trace. I am unable to find the problem.
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at edu.stanford.nlp.classify.NaiveBayesClassifierFactory.trainWeightsJL(NaiveBayesClassifierFactory.java:171)
at edu.stanford.nlp.classify.NaiveBayesClassifierFactory.trainWeights(NaiveBayesClassifierFactory.java:146)
at edu.stanford.nlp.classify.NaiveBayesClassifierFactory.trainClassifier(NaiveBayesClassifierFactory.java:84)
at edu.stanford.nlp.classify.NaiveBayesClassifierFactory.trainClassifier(NaiveBayesClassifierFactory.java:352)
at edu.stanford.nlp.classify.ColumnDataClassifier.makeClassifier(ColumnDataClassifier.java:1458)
at edu.stanford.nlp.classify.ColumnDataClassifier.trainClassifier(ColumnDataClassifier.java:2091)
at edu.stanford.nlp.classify.demo.ClassifierDemo.main(ClassifierDemo.java:35)
As part of obtaining the weights in
private NBWeights trainWeightsJL(int[][] data, int[] labels, int numFeatures, int numClasses) {
int[] numValues = numberValues(data, numFeatures);
double[] priors = new double[numClasses];
double[][][] weights = new double[numClasses][numFeatures][];
//init weights array
for (int cl = 0; cl < numClasses; cl++) {
for (int fno = 0; fno < numFeatures; fno++) {
weights[cl][fno] = new double[numValues[fno]];
// weights[cl][fno] = new double[numFeatures];
}
}
for (int i = 0; i < data.length; i++) {
priors[labels[i]]++;
for (int fno = 0; fno < numFeatures; fno++) {
weights[labels[i]][fno][data[i][fno]]++;
}
}
for (int cl = 0; cl < numClasses; cl++) {
for (int fno = 0; fno < numFeatures; fno++) {
for (int val = 0; val < numValues[fno]; val++) {
weights[cl][fno][val] = Math.log((weights[cl][fno][val] + alphaFeature) / (priors[cl] + alphaFeature * numValues[fno]));
}
}
priors[cl] = Math.log((priors[cl] + alphaClass) / (data.length + alphaClass * numClasses));
}
return new NBWeights(priors, weights);
}
i am unable to understand what
int[] numValues = numberValues(data, numFeatures);
means. The error is from the line
weights[labels[i]][fno][data[i][fno]]++;
I would have thought weights is a 2d array to keep track of feature (fno) occurences for different classes(labels). Not sure why the third dimension is needed.
Any help will be greatly appreciated.
I'm not having any issues with these properties:
#
# Features
#
useClassFeature=true
1.useNGrams=true
1.usePrefixSuffixNGrams=true
1.maxNGramLeng=4
1.minNGramLeng=1
1.binnedLengths=10,20,30
#
# Printing
#
# printClassifier=HighWeight
printClassifierParam=200
#
# Mapping
#
goldAnswerColumn=0
displayedColumn=1
#
# Optimization
#
intern=true
sigma=3
useQN=true
QNsize=15
tolerance=1e-4
useNB=true
useClass=true
#
# Training input
#
trainFile=simple-classifier-training-set.txt
serializeTo=model.txt
And running this command:
java -Xmx8g edu.stanford.nlp.classify.ColumnDataClassifier -prop example.prop
链接地址: http://www.djcxy.com/p/40170.html
上一篇: 这是Big的泛化吗?
下一篇: nlp朴素贝叶斯分类器训练