nlp naive bayes classifier training

as part of understanding the stanford nlp api for classification, i am training the naive bayes classifier on a very simple training set (3 labels => ['happy','sad','neutral']). This training data set is

happy   happy
happy   glad
sad gloomy
neutral fine

this is part of the output from training the classifier (before the error)

numDatumsPerLabel: {happy=2.0, sad=1.0, neutral=1.0}
numLabels: 3 [happy, sad, neutral]
numFeatures (Phi(X) types): 4 [1-SW-happy, 1-SW-glad, 1-SW-gloomy, 1-SW-fine]

I get the an array index out of bounds error. I have attached the stack trace. I am unable to find the problem.

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
    at edu.stanford.nlp.classify.NaiveBayesClassifierFactory.trainWeightsJL(NaiveBayesClassifierFactory.java:171)
    at edu.stanford.nlp.classify.NaiveBayesClassifierFactory.trainWeights(NaiveBayesClassifierFactory.java:146)
    at edu.stanford.nlp.classify.NaiveBayesClassifierFactory.trainClassifier(NaiveBayesClassifierFactory.java:84)
    at edu.stanford.nlp.classify.NaiveBayesClassifierFactory.trainClassifier(NaiveBayesClassifierFactory.java:352)
    at edu.stanford.nlp.classify.ColumnDataClassifier.makeClassifier(ColumnDataClassifier.java:1458)
    at edu.stanford.nlp.classify.ColumnDataClassifier.trainClassifier(ColumnDataClassifier.java:2091)
    at edu.stanford.nlp.classify.demo.ClassifierDemo.main(ClassifierDemo.java:35)

As part of obtaining the weights in

 private NBWeights trainWeightsJL(int[][] data, int[] labels, int numFeatures, int numClasses) {
    int[] numValues = numberValues(data, numFeatures);
    double[] priors = new double[numClasses];
    double[][][] weights = new double[numClasses][numFeatures][];
    //init weights array
    for (int cl = 0; cl < numClasses; cl++) {
      for (int fno = 0; fno < numFeatures; fno++) {
        weights[cl][fno] = new double[numValues[fno]];
//        weights[cl][fno] = new double[numFeatures];
      }
    }
    for (int i = 0; i < data.length; i++) {
      priors[labels[i]]++;
      for (int fno = 0; fno < numFeatures; fno++) {
        weights[labels[i]][fno][data[i][fno]]++;
      }
    }
    for (int cl = 0; cl < numClasses; cl++) {
      for (int fno = 0; fno < numFeatures; fno++) {
        for (int val = 0; val < numValues[fno]; val++) {
          weights[cl][fno][val] = Math.log((weights[cl][fno][val] + alphaFeature) / (priors[cl] + alphaFeature * numValues[fno]));
        }
      }
      priors[cl] = Math.log((priors[cl] + alphaClass) / (data.length + alphaClass * numClasses));
    }
    return new NBWeights(priors, weights);
  }

i am unable to understand what

int[] numValues = numberValues(data, numFeatures);

means. The error is from the line

weights[labels[i]][fno][data[i][fno]]++;

I would have thought weights is a 2d array to keep track of feature (fno) occurences for different classes(labels). Not sure why the third dimension is needed.

Any help will be greatly appreciated.


I'm not having any issues with these properties:

#
# Features
#
useClassFeature=true
1.useNGrams=true
1.usePrefixSuffixNGrams=true
1.maxNGramLeng=4
1.minNGramLeng=1
1.binnedLengths=10,20,30
#
# Printing
#
# printClassifier=HighWeight
printClassifierParam=200
#
# Mapping
#
goldAnswerColumn=0
displayedColumn=1
#
# Optimization
#
intern=true
sigma=3
useQN=true
QNsize=15
tolerance=1e-4
useNB=true
useClass=true
#
# Training input
#
trainFile=simple-classifier-training-set.txt
serializeTo=model.txt

And running this command:

java -Xmx8g edu.stanford.nlp.classify.ColumnDataClassifier -prop example.prop
链接地址: http://www.djcxy.com/p/40170.html

上一篇: 这是Big的泛化吗?

下一篇: nlp朴素贝叶斯分类器训练