Basic concepts: Naive Bayes algorithm for classification
I think I understand Naive Bayes more or less, but I have a few questions regarding its implementation for a simple binary text classification tast.
Let's say that document D_i
is some subset of the vocabulary x_1, x_2, ...x_n
There are two classes c_i
any document can fall on, and I want to compute P(c_i|D)
for some input document D which is proportional to P(D|c_i)P(c_i)
I have three questions
P(c_i)
is #docs in c_i/ #total docs
or #words in c_i/ #total words
P(x_j|c_i)
be the #times x_j appears in D/ #times x_j appears in c_i
x_j
doesn't exist in the training set, do I give it a probability of 1 so that it doesn't alter the calculations? For example, let us say that I have a training set of one:
training = [("hello world", "good")
("bye world", "bad")]
so the classes would have
good_class = {"hello": 1, "world": 1}
bad_class = {"bye":1, "world:1"}
all = {"hello": 1, "world": 2, "bye":1}
so now if I want to compute probability of a test string being good
test1 = ["hello", "again"]
p_good = sum(good_class.values())/sum(all.values())
p_hello_good = good_class["hello"]/all["hello"]
p_again_good = 1 # because "again" doesn't exist in our training set
p_test1_good = p_good * p_hello_good * p_again_good
As this question is too broad so I can only answer in a limiting way:-
1st:- P(c_i) is #docs in c_i/ #total docs or #words in c_i/ #total words
P(c_i) = #c_i/#total docs
2nd:- Should P(x_j|c_i) be the #times x_j appears in D/ #times x_j appears in c_i.
After @larsmans noticed..
It is exactly occurrence of word in a document
by total number of words in that class in whole dataset.
3rd:- Suppose an x_j doesn't exist in the training set, do I give it a probability of 1 so that it doesn't alter the calculations?
For That we have laplace correction or Additive smoothing. It is applied on
p(x_j|c_i)=(#times x_j appears in D+1)/ (#times x_j +|V|) which will neutralize
the effect not occurring features.
链接地址: http://www.djcxy.com/p/40158.html
上一篇: 朴素贝叶斯文本分类算法
下一篇: 基本概念:朴素贝叶斯算法进行分类