spacy lemmatizer如何工作?

为了词法化,spacy有一个单词列表:形容词,副词,动词......还有例外情况列表:adverbs_irreg ...对于常规情况,有一组规则

我们以“更宽”这个词为例,

因为它是一个形容词,所以应该从这个列表中获取词形化的规则:

ADJECTIVE_RULES = [
    ["er", ""],
    ["est", ""],
    ["er", "e"],
    ["est", "e"]
] 

据我了解,这个过程将是这样的:

1)获取单词的POS标签以知道它是否是名词,动词...
2)如果单词在不规则情况列表中被直接替换,如果没有应用规则之一。

现在,如何决定使用“er” - >“e”而不是“er” - >“”来获得“宽”而不是“wid”?

在这里它可以被测试。


TLDR: spaCy检查它试图产生的引理是否在已知的单词列表或该词类的例外情况中。

长答案:

查看lemmatizer.py文件,特别是底部的lemmatize函数。

def lemmatize(string, index, exceptions, rules):
    string = string.lower()
    forms = []
    forms.extend(exceptions.get(string, []))
    oov_forms = []
    for old, new in rules:
        if string.endswith(old):
            form = string[:len(string) - len(old)] + new
            if not form:
                pass
            elif form in index or not form.isalpha():
                forms.append(form)
            else:
                oov_forms.append(form)
    if not forms:
        forms.extend(oov_forms)
    if not forms:
        forms.append(string)
    return set(forms)

例如,对于英语形容词来说,它会从这个目录(英文模式)中抽取我们正在评估的字符串,已知形容词的indexexceptionsrules (如您所引用的)。

在字符串小写之后,我们首先进行lemmatize ,检查字符串是否在我们的已知例外列表中,其中包括像“糟糕” - >“坏”这样的词的引理规则。

然后我们通过我们的rules并适用每一个适用于字符串。 对于wider的字眼,我们将应用以下规则:

["er", ""],
["est", ""],
["er", "e"],
["est", "e"]

我们会输出以下形式: ["wid", "wide"]

然后,我们检查这个表单是否在我们的已知形容词index中。 如果是,我们将它附加到表格中。 否则,我们将它添加到oov_forms ,我猜测它是词汇表外的缩写。 wide在索引中,因此它被添加。 wid会被添加到oov_forms

最后,我们返回一组找到的引理,或者匹配规则但不在我们的索引中的任何引理,或者只是这个词本身。

你在上面发布的单词lemmatize链接的作用wider ,因为wide是在词索引。 尝试像He is blandier than I.卑鄙。spaCy会将blandier (我编造的词)标记为形容词,但它不在索引中,因此它只会作为引理返回blandier


让我们从类定义开始:https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py

它从初始化3个变量开始:

class Lemmatizer(object):
    @classmethod
    def load(cls, path, index=None, exc=None, rules=None):
        return cls(index or {}, exc or {}, rules or {})

    def __init__(self, index, exceptions, rules):
        self.index = index
        self.exc = exceptions
        self.rules = rules

现在,看着self.exc英语,我们看到它指向https://github.com/explosion/spaCy/blob/master/spacy/en/lemmatizer/ 初始化的.py在那里它从目录中加载文件的https ://github.com/explosion/spaCy/tree/master/spacy/en/lemmatizer

为什么不Spacy只读一个文件?

最可能的原因是声明字符串in-code的速度更快,通过I / O流式传输字符串。


这些指数,例外和规则来自哪里?

仔细观察,他们似乎都来自原来的普林斯顿WordNet https://wordnet.princeton.edu/man/wndb.5WN.html

规则

更仔细地看,https: _morphy规则类似于nltk https://github.com/中的_morphy规则NLTK / NLTK / BLOB /开发/ NLTK /胼/读卡器/ wordnet.py#L1749

这些规则最初来自Morphy软件https://wordnet.princeton.edu/man/morphy.7WN.html

此外, spacy还包括一些不属于Princeton Morphy的标点规则:

PUNCT_RULES = [
    ["“", """],
    ["”", """],
    ["u2018", "'"],
    ["u2019", "'"]
]

例外

至于例外,它们存储在*_irreg.py文件spacy ,他们看起来像他们也来自普林斯顿WORDNET。

很明显,如果我们看一下原始WordNet .exc (排除)文件的镜像(例如https://github.com/extjwnl/extjwnl-data-wn21/blob/master/src/main/resources/net/sf /extjwnl/data/wordnet/wn21/adj.exc),如果您从nltk下载wordnet软件包,我们会看到它是相同的列表:

alvas@ubi:~/nltk_data/corpora/wordnet$ ls
adj.exc       cntlist.rev  data.noun  index.adv    index.verb  noun.exc
adv.exc       data.adj     data.verb  index.noun   lexnames    README
citation.bib  data.adv     index.adj  index.sense  LICENSE     verb.exc
alvas@ubi:~/nltk_data/corpora/wordnet$ wc -l adj.exc 
1490 adj.exc

指数

如果我们看一下spacy lemmatizer的index ,我们会看到它也来自Wordnet,例如https://github.com/explosion/spaCy/blob/master/spacy/en/lemmatizer/_adjectives.py和重新分发的副本在nltk的wordnet:

alvas@ubi:~/nltk_data/corpora/wordnet$ head -n40 data.adj 

  1 This software and database is being provided to you, the LICENSEE, by  
  2 Princeton University under the following license.  By obtaining, using  
  3 and/or copying this software and database, you agree that you have  
  4 read, understood, and will comply with these terms and conditions.:  
  5   
  6 Permission to use, copy, modify and distribute this software and  
  7 database and its documentation for any purpose and without fee or  
  8 royalty is hereby granted, provided that you agree to comply with  
  9 the following copyright notice and statements, including the disclaimer,  
  10 and that the same appear on ALL copies of the software, database and  
  11 documentation, including modifications that you make for internal  
  12 use or for distribution.  
  13   
  14 WordNet 3.0 Copyright 2006 by Princeton University.  All rights reserved.  
  15   
  16 THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS" AND PRINCETON  
  17 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR  
  18 IMPLIED.  BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON  
  19 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANT-  
  20 ABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE  
  21 OF THE LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT  
  22 INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR  
  23 OTHER RIGHTS.  
  24   
  25 The name of Princeton University or Princeton may not be used in  
  26 advertising or publicity pertaining to distribution of the software  
  27 and/or database.  Title to copyright in this software, database and  
  28 any associated documentation shall at all times remain with  
  29 Princeton University and LICENSEE agrees to preserve same.  
00001740 00 a 01 able 0 005 = 05200169 n 0000 = 05616246 n 0000 + 05616246 n 0101 + 05200169 n 0101 ! 00002098 a 0101 | (usually followed by `to') having the necessary means or skill or know-how or authority to do something; "able to swim"; "she was able to program her computer"; "we were at last able to buy a car"; "able to get a grant for the project"  
00002098 00 a 01 unable 0 002 = 05200169 n 0000 ! 00001740 a 0101 | (usually followed by `to') not having the necessary means or skill or know-how; "unable to get to town without a car"; "unable to obtain funds"  
00002312 00 a 02 abaxial 0 dorsal 4 002 ;c 06037666 n 0000 ! 00002527 a 0101 | facing away from the axis of an organ or organism; "the abaxial surface of a leaf is the underside or side facing away from the stem"  
00002527 00 a 02 adaxial 0 ventral 4 002 ;c 06037666 n 0000 ! 00002312 a 0101 | nearest to or facing toward the axis of an organ or organism; "the upper side of a leaf is known as the adaxial surface"  
00002730 00 a 01 acroscopic 0 002 ;c 06066555 n 0000 ! 00002843 a 0101 | facing or on the side toward the apex  
00002843 00 a 01 basiscopic 0 002 ;c 06066555 n 0000 ! 00002730 a 0101 | facing or on the side toward the base  
00002956 00 a 02 abducent 0 abducting 0 002 ;c 06080522 n 0000 ! 00003131 a 0101 | especially of muscles; drawing away from the midline of the body or from an adjacent part  
00003131 00 a 03 adducent 0 adductive 0 adducting 0 003 ;c 06080522 n 0000 + 01449236 v 0201 ! 00002956 a 0101 | especially of muscles; bringing together or drawing toward the midline of the body or toward an adjacent part  
00003356 00 a 01 nascent 0 005 + 07320302 n 0103 ! 00003939 a 0101 & 00003553 a 0000 & 00003700 a 0000 & 00003829 a 0000 |  being born or beginning; "the nascent chicks"; "a nascent insurgency"   
00003553 00 s 02 emergent 0 emerging 0 003 & 00003356 a 0000 + 02625016 v 0102 + 00050693 n 0101 | coming into existence; "an emergent republic"  
00003700 00 s 01 dissilient 0 002 & 00003356 a 0000 + 07434782 n 0101 | bursting open with force, as do some ripe seed vessels  

根据spacy引理器使用的字典,例外和规则主要来自普林斯顿WordNet及其Morphy软件,我们可以继续看到spacy如何使用索引和例外应用规则的实际实现。

我们回到https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py

主要的作用来自功能而不是Lemmatizer类:

def lemmatize(string, index, exceptions, rules):
    string = string.lower()
    forms = []
    # TODO: Is this correct? See discussion in Issue #435.
    #if string in index:
    #    forms.append(string)
    forms.extend(exceptions.get(string, []))
    oov_forms = []
    for old, new in rules:
        if string.endswith(old):
            form = string[:len(string) - len(old)] + new
            if not form:
                pass
            elif form in index or not form.isalpha():
                forms.append(form)
            else:
                oov_forms.append(form)
    if not forms:
        forms.extend(oov_forms)
    if not forms:
        forms.append(string)
    return set(forms)

为什么Lemmatizer类之外的lemmatize方法?

我并不完全确定,但也许是确保可以在类实例之外调用@staticmethod ,但考虑到@staticmethod@classmethod存在,可能还有其他一些考虑事项,以解释为什么函数和类已被解耦

莫菲vs Spacy

spacy morphy()函数与nltk中的morphy()函数进行比较(最初来自http://blog.osteele.com/2004/04/pywordnet-20/十多年前创建), morphy() , Oliver Steele的WordNet morphy的Python端口中的主要流程是:

  • 检查例外列表
  • 将规则一次应用于输入以获得y1,y2,y3等。
  • 返回数据库中的所有内容(并检查原始内容)
  • 如果没有匹配,请继续使用规则,直到找到匹配
  • 如果我们找不到任何东西,请返回空列表
  • 对于spacy ,可能它仍在开发中,考虑到https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py#L76上的TODO

    但总体过程似乎是:

  • 寻找例外情况,如果单词在其中,则从异常列表中引理。
  • 应用规则
  • 保存索引列表中的那些
  • 如果步骤1-3中没有引理,那么只需跟踪词汇外词(OOV),并将原始字符串附加到词法表单
  • 返回引理形式
  • 就OOV处理而言,如果没有找到词形化表单,spacy会返回原始字符串,在这方面, nltk实现morphy如此,例如

    >>> from nltk.stem import WordNetLemmatizer
    >>> wnl = WordNetLemmatizer()
    >>> wnl.lemmatize('alvations')
    'alvations'
    

    在词形化之前检查不定式

    可能的另一个不同之处是morphyspacy决定了POS分配给单词的方式。 在这方面, spacy将一些语言学规则放入Lemmatizer()来决定一个单词是否是基本形式,并且如果该单词已经处于不定式形式(is_base_form())中,则完全跳过该单词化,这将节省相当多的对于语料库中的所有单词都要进行词形化,而其中相当大的一部分是不定式(已经是引理形式)。

    但这在spacy中是可能的,因为它允许lemmatizer访问与某些形态规则紧密联系的POS。 虽然morphy虽然可以使用细粒度的PTB POS标签找出一些形态,但仍然需要花费一些努力才能morphy出哪些形式是不定式的。

    概括而言,形态特征的3个主要信号需要在POS标签中进行梳理:

  • 性别

  • 结语

    我想现在我们知道它适用于语言学规则和所有其他问题,另一个问题是“是否有任何非基于规则的词式化方法?”

    但在还没有回答这个问题之前,“究竟是一个引理?” 可能会提出更好的问题。


    对于每个单词类型(形容词,名词,动词,副词)都有一套规则和一组已知的单词。 映射发生在这里:

    INDEX = {
        "adj": ADJECTIVES,
        "adv": ADVERBS,
        "noun": NOUNS,
        "verb": VERBS
    }
    
    
    EXC = {
        "adj": ADJECTIVES_IRREG,
        "adv": ADVERBS_IRREG,
        "noun": NOUNS_IRREG,
        "verb": VERBS_IRREG
    }
    
    
    RULES = {
        "adj": ADJECTIVE_RULES,
        "noun": NOUN_RULES,
        "verb": VERB_RULES,
        "punct": PUNCT_RULES
    }
    

    然后在lemmatizer.py的这一行中,正确的索引,规则和exc(不包括我相信的例外情况,例如不规则的例子)被加载:

    lemmas = lemmatize(string, self.index.get(univ_pos, {}),
                       self.exc.get(univ_pos, {}),
                       self.rules.get(univ_pos, []))
    

    所有其余的逻辑都在函数解析中,并且令人惊讶的短。 我们执行以下操作:

  • 如果包含所提供的字符串的例外(即单词不规则),请使用它并将其添加到词形化表单中
  • 对于所选单词类型的顺序中的每个规则,检查它是否与给定单词匹配。 如果它尝试应用它。

    2A。 如果在应用该规则之后该单词在已知单词列表(即,索引)中,则将其添加到单词的词形化形式

    2B。 否则,将该单词添加到名为oov_forms的单独列表中(这里我相信oov代表“超出词汇量”)

  • 如果我们使用上述规则找到至少一个表单,则返回找到的表单列表,否则返回oov_forms列表。
  • 链接地址: http://www.djcxy.com/p/9187.html

    上一篇: How does spacy lemmatizer works?

    下一篇: Access static variable from static method