基本概念:朴素贝叶斯算法进行分类

我认为我或多或少地理解朴素贝叶斯,但是我对于其简单的二进制文本分类tast的实现有几个问题。

假设文档D_i是词汇表x_1, x_2, ...x_n一些子集

有两个类c_i任何文档都可以落在上面,我想为一些与P(D|c_i)P(c_i)成比例的输入文档D计算P(c_i|D) P(D|c_i)P(c_i)

我有三个问题

  • P(c_i)#docs in c_i/ #total docs #words in c_i/ #total words#words in c_i/ #total words
  • 如果P(x_j|c_i)#times x_j appears in D/ #times x_j appears in c_i
  • 假设训练集中不存在x_j ,我是否给它一个1的概率以便它不改变计算?
  • 例如,让我们说我有一套训练集:

    training = [("hello world", "good")
                ("bye world", "bad")]
    

    所以课堂会有

    good_class = {"hello": 1, "world": 1}
    bad_class = {"bye":1, "world:1"}
    all = {"hello": 1, "world": 2, "bye":1}
    

    所以现在如果我想计算一个测试字符串好的概率

    test1 = ["hello", "again"]
    p_good = sum(good_class.values())/sum(all.values())
    p_hello_good = good_class["hello"]/all["hello"]
    p_again_good = 1 # because "again" doesn't exist in our training set
    
    p_test1_good = p_good * p_hello_good * p_again_good
    

    由于这个问题太广泛,所以我只能以一种限制的方式回答:

    第1位: - P(c_i)在c_i / #total文档中为#docs或在c_i /#总词中为#words

    P(c_i) = #c_i/#total docs
    

    第二: -如果P(x_j | c_i)是#x_j出现在D / #times x_j出现在c_i中。
    @larsmans注意到后..

    It is exactly occurrence of word in a document
    by total number of words in that class in whole dataset.
    

    第三:假设训练集中不存在x_j,我是否给它一个1的概率,以便它不会改变计算?

    For That we have laplace correction or Additive smoothing. It is applied on
    p(x_j|c_i)=(#times x_j appears in D+1)/ (#times x_j +|V|) which will neutralize
    the effect not occurring features.
    
    链接地址: http://www.djcxy.com/p/40157.html

    上一篇: Basic concepts: Naive Bayes algorithm for classification

    下一篇: Naive Bayes: Imbalanced Test Dataset