nlp朴素贝叶斯分类器训练
作为理解斯坦福nlp api分类的一部分,我正在一个非常简单的训练集上训练朴素贝叶斯分类器(3个标签=> ['happy','sad','neutral'])。 这个训练数据集是
happy happy
happy glad
sad gloomy
neutral fine
这是训练分类器的输出的一部分(在错误之前)
numDatumsPerLabel: {happy=2.0, sad=1.0, neutral=1.0}
numLabels: 3 [happy, sad, neutral]
numFeatures (Phi(X) types): 4 [1-SW-happy, 1-SW-glad, 1-SW-gloomy, 1-SW-fine]
我得到一个数组索引超出界限的错误。 我已附加了堆栈跟踪。 我无法找到问题。
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at edu.stanford.nlp.classify.NaiveBayesClassifierFactory.trainWeightsJL(NaiveBayesClassifierFactory.java:171)
at edu.stanford.nlp.classify.NaiveBayesClassifierFactory.trainWeights(NaiveBayesClassifierFactory.java:146)
at edu.stanford.nlp.classify.NaiveBayesClassifierFactory.trainClassifier(NaiveBayesClassifierFactory.java:84)
at edu.stanford.nlp.classify.NaiveBayesClassifierFactory.trainClassifier(NaiveBayesClassifierFactory.java:352)
at edu.stanford.nlp.classify.ColumnDataClassifier.makeClassifier(ColumnDataClassifier.java:1458)
at edu.stanford.nlp.classify.ColumnDataClassifier.trainClassifier(ColumnDataClassifier.java:2091)
at edu.stanford.nlp.classify.demo.ClassifierDemo.main(ClassifierDemo.java:35)
作为获取权重的一部分
private NBWeights trainWeightsJL(int[][] data, int[] labels, int numFeatures, int numClasses) {
int[] numValues = numberValues(data, numFeatures);
double[] priors = new double[numClasses];
double[][][] weights = new double[numClasses][numFeatures][];
//init weights array
for (int cl = 0; cl < numClasses; cl++) {
for (int fno = 0; fno < numFeatures; fno++) {
weights[cl][fno] = new double[numValues[fno]];
// weights[cl][fno] = new double[numFeatures];
}
}
for (int i = 0; i < data.length; i++) {
priors[labels[i]]++;
for (int fno = 0; fno < numFeatures; fno++) {
weights[labels[i]][fno][data[i][fno]]++;
}
}
for (int cl = 0; cl < numClasses; cl++) {
for (int fno = 0; fno < numFeatures; fno++) {
for (int val = 0; val < numValues[fno]; val++) {
weights[cl][fno][val] = Math.log((weights[cl][fno][val] + alphaFeature) / (priors[cl] + alphaFeature * numValues[fno]));
}
}
priors[cl] = Math.log((priors[cl] + alphaClass) / (data.length + alphaClass * numClasses));
}
return new NBWeights(priors, weights);
}
我无法理解什么
int[] numValues = numberValues(data, numFeatures);
手段。 错误来自线路
weights[labels[i]][fno][data[i][fno]]++;
我会认为权重是一个二维数组,以跟踪不同类(标签)的特征(fno)出现次数。 不知道为什么需要第三个维度。
任何帮助将不胜感激。
我对这些属性没有任何问题:
#
# Features
#
useClassFeature=true
1.useNGrams=true
1.usePrefixSuffixNGrams=true
1.maxNGramLeng=4
1.minNGramLeng=1
1.binnedLengths=10,20,30
#
# Printing
#
# printClassifier=HighWeight
printClassifierParam=200
#
# Mapping
#
goldAnswerColumn=0
displayedColumn=1
#
# Optimization
#
intern=true
sigma=3
useQN=true
QNsize=15
tolerance=1e-4
useNB=true
useClass=true
#
# Training input
#
trainFile=simple-classifier-training-set.txt
serializeTo=model.txt
并运行此命令:
java -Xmx8g edu.stanford.nlp.classify.ColumnDataClassifier -prop example.prop
链接地址: http://www.djcxy.com/p/40169.html