nlp朴素贝叶斯分类器训练

作为理解斯坦福nlp api分类的一部分,我正在一个非常简单的训练集上训练朴素贝叶斯分类器(3个标签=> ['happy','sad','neutral'])。 这个训练数据集是

happy   happy
happy   glad
sad gloomy
neutral fine

这是训练分类器的输出的一部分(在错误之前)

numDatumsPerLabel: {happy=2.0, sad=1.0, neutral=1.0}
numLabels: 3 [happy, sad, neutral]
numFeatures (Phi(X) types): 4 [1-SW-happy, 1-SW-glad, 1-SW-gloomy, 1-SW-fine]

我得到一个数组索引超出界限的错误。 我已附加了堆栈跟踪。 我无法找到问题。

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
    at edu.stanford.nlp.classify.NaiveBayesClassifierFactory.trainWeightsJL(NaiveBayesClassifierFactory.java:171)
    at edu.stanford.nlp.classify.NaiveBayesClassifierFactory.trainWeights(NaiveBayesClassifierFactory.java:146)
    at edu.stanford.nlp.classify.NaiveBayesClassifierFactory.trainClassifier(NaiveBayesClassifierFactory.java:84)
    at edu.stanford.nlp.classify.NaiveBayesClassifierFactory.trainClassifier(NaiveBayesClassifierFactory.java:352)
    at edu.stanford.nlp.classify.ColumnDataClassifier.makeClassifier(ColumnDataClassifier.java:1458)
    at edu.stanford.nlp.classify.ColumnDataClassifier.trainClassifier(ColumnDataClassifier.java:2091)
    at edu.stanford.nlp.classify.demo.ClassifierDemo.main(ClassifierDemo.java:35)

作为获取权重的一部分

 private NBWeights trainWeightsJL(int[][] data, int[] labels, int numFeatures, int numClasses) {
    int[] numValues = numberValues(data, numFeatures);
    double[] priors = new double[numClasses];
    double[][][] weights = new double[numClasses][numFeatures][];
    //init weights array
    for (int cl = 0; cl < numClasses; cl++) {
      for (int fno = 0; fno < numFeatures; fno++) {
        weights[cl][fno] = new double[numValues[fno]];
//        weights[cl][fno] = new double[numFeatures];
      }
    }
    for (int i = 0; i < data.length; i++) {
      priors[labels[i]]++;
      for (int fno = 0; fno < numFeatures; fno++) {
        weights[labels[i]][fno][data[i][fno]]++;
      }
    }
    for (int cl = 0; cl < numClasses; cl++) {
      for (int fno = 0; fno < numFeatures; fno++) {
        for (int val = 0; val < numValues[fno]; val++) {
          weights[cl][fno][val] = Math.log((weights[cl][fno][val] + alphaFeature) / (priors[cl] + alphaFeature * numValues[fno]));
        }
      }
      priors[cl] = Math.log((priors[cl] + alphaClass) / (data.length + alphaClass * numClasses));
    }
    return new NBWeights(priors, weights);
  }

我无法理解什么

int[] numValues = numberValues(data, numFeatures);

手段。 错误来自线路

weights[labels[i]][fno][data[i][fno]]++;

我会认为权重是一个二维数组,以跟踪不同类(标签)的特征(fno)出现次数。 不知道为什么需要第三个维度。

任何帮助将不胜感激。


我对这些属性没有任何问题:

#
# Features
#
useClassFeature=true
1.useNGrams=true
1.usePrefixSuffixNGrams=true
1.maxNGramLeng=4
1.minNGramLeng=1
1.binnedLengths=10,20,30
#
# Printing
#
# printClassifier=HighWeight
printClassifierParam=200
#
# Mapping
#
goldAnswerColumn=0
displayedColumn=1
#
# Optimization
#
intern=true
sigma=3
useQN=true
QNsize=15
tolerance=1e-4
useNB=true
useClass=true
#
# Training input
#
trainFile=simple-classifier-training-set.txt
serializeTo=model.txt

并运行此命令:

java -Xmx8g edu.stanford.nlp.classify.ColumnDataClassifier -prop example.prop
链接地址: http://www.djcxy.com/p/40169.html

上一篇: nlp naive bayes classifier training

下一篇: Naive Bayes, dataset choice(sentences vs dictionary)