How does spacy lemmatizer works?
For lemmatization spacy has a lists of words: adjectives, adverbs, verbs... and also lists for exceptions: adverbs_irreg... for the regular ones there is a set of rules
Let's take as example the word "wider"
As it is an adjective the rule for lemmatization should be take from this list:
ADJECTIVE_RULES = [
["er", ""],
["est", ""],
["er", "e"],
["est", "e"]
]
As I understand the process will be like this:
1) Get the POS tag of the word to know whether it is a noun, a verb...
2) If the word is in the list of irregular cases is replaced directly if not one of the rules is applied.
Now, how is decided to use "er" -> "e" instead of "er"-> "" to get "wide" and not "wid"?
Here it can be tested.
TLDR: spaCy checks whether the lemma it's trying to generate is in the known list of words or exceptions for that part of speech.
Long Answer:
Check out the lemmatizer.py file, specifically the lemmatize
function at the bottom.
def lemmatize(string, index, exceptions, rules):
string = string.lower()
forms = []
forms.extend(exceptions.get(string, []))
oov_forms = []
for old, new in rules:
if string.endswith(old):
form = string[:len(string) - len(old)] + new
if not form:
pass
elif form in index or not form.isalpha():
forms.append(form)
else:
oov_forms.append(form)
if not forms:
forms.extend(oov_forms)
if not forms:
forms.append(string)
return set(forms)
For English adjectives, for instance, it takes in the string we're evaluating, the index
of known adjectives, the exceptions
, and the rules
, as you've referenced, from this directory (for English model).
The first thing we do in lemmatize
after making the string lower case is check whether the string is in our list of known exceptions, which includes lemma rules for words like "worse" -> "bad".
Then we go through our rules
and apply each one to the string if it is applicable. For the word wider
, we would apply the following rules:
["er", ""],
["est", ""],
["er", "e"],
["est", "e"]
and we would output the following forms: ["wid", "wide"]
.
Then, we check if this form is in our index
of known adjectives. If it is, we append it to the forms. Otherwise, we add it to oov_forms
, which I'm guessing is short for out of vocabulary. wide
is in the index, so it gets added. wid
gets added to oov_forms
.
Lastly, we return a set of either the lemmas found, or any lemmas that matched rules but weren't in our index, or just the word itself.
The word-lemmatize link you posted above works for wider
, because wide
is in the word index. Try something like He is blandier than I.
spaCy will mark blandier
(word I made up) as an adjective, but it's not in the index, so it will just return blandier
as the lemma.
Let's start with the class definition: https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py
Class
It starts off with initializing 3 variables:
class Lemmatizer(object):
@classmethod
def load(cls, path, index=None, exc=None, rules=None):
return cls(index or {}, exc or {}, rules or {})
def __init__(self, index, exceptions, rules):
self.index = index
self.exc = exceptions
self.rules = rules
Now, looking at the self.exc
for english, we see that it points to https://github.com/explosion/spaCy/blob/master/spacy/en/lemmatizer/ init .py where it's loading files from the directory https://github.com/explosion/spaCy/tree/master/spacy/en/lemmatizer
Why don't Spacy just read a file?
Most probably because declaring the string in-code is faster that streaming strings through I/O.
Where does these index, exceptions and rules come from?
Looking at it closely, they all seem to come from the original Princeton WordNet https://wordnet.princeton.edu/man/wndb.5WN.html
Rules
Looking at it even closer, the rules on https://github.com/explosion/spaCy/blob/master/spacy/en/lemmatizer/_lemma_rules.py is similar to the _morphy
rules from nltk
https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1749
And these rules originally comes from the Morphy
software https://wordnet.princeton.edu/man/morphy.7WN.html
Additionally, spacy
had included some punctuation rules that isn't from Princeton Morphy:
PUNCT_RULES = [
["“", """],
["”", """],
["u2018", "'"],
["u2019", "'"]
]
Exceptions
As for the exceptions, they were stored in the *_irreg.py
files in spacy
, and they look like they also come from the Princeton Wordnet.
It is evident if we look at some mirror of the original WordNet .exc
(exclusion) files (eg https://github.com/extjwnl/extjwnl-data-wn21/blob/master/src/main/resources/net/sf/extjwnl/data/wordnet/wn21/adj.exc) and if you download the wordnet
package from nltk
, we see that it's the same list:
alvas@ubi:~/nltk_data/corpora/wordnet$ ls
adj.exc cntlist.rev data.noun index.adv index.verb noun.exc
adv.exc data.adj data.verb index.noun lexnames README
citation.bib data.adv index.adj index.sense LICENSE verb.exc
alvas@ubi:~/nltk_data/corpora/wordnet$ wc -l adj.exc
1490 adj.exc
Index
If we look at the spacy
lemmatizer's index
, we see that it also comes from Wordnet, eg https://github.com/explosion/spaCy/blob/master/spacy/en/lemmatizer/_adjectives.py and the re-distributed copy of wordnet in nltk
:
alvas@ubi:~/nltk_data/corpora/wordnet$ head -n40 data.adj
1 This software and database is being provided to you, the LICENSEE, by
2 Princeton University under the following license. By obtaining, using
3 and/or copying this software and database, you agree that you have
4 read, understood, and will comply with these terms and conditions.:
5
6 Permission to use, copy, modify and distribute this software and
7 database and its documentation for any purpose and without fee or
8 royalty is hereby granted, provided that you agree to comply with
9 the following copyright notice and statements, including the disclaimer,
10 and that the same appear on ALL copies of the software, database and
11 documentation, including modifications that you make for internal
12 use or for distribution.
13
14 WordNet 3.0 Copyright 2006 by Princeton University. All rights reserved.
15
16 THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS" AND PRINCETON
17 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
18 IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON
19 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANT-
20 ABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE
21 OF THE LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT
22 INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR
23 OTHER RIGHTS.
24
25 The name of Princeton University or Princeton may not be used in
26 advertising or publicity pertaining to distribution of the software
27 and/or database. Title to copyright in this software, database and
28 any associated documentation shall at all times remain with
29 Princeton University and LICENSEE agrees to preserve same.
00001740 00 a 01 able 0 005 = 05200169 n 0000 = 05616246 n 0000 + 05616246 n 0101 + 05200169 n 0101 ! 00002098 a 0101 | (usually followed by `to') having the necessary means or skill or know-how or authority to do something; "able to swim"; "she was able to program her computer"; "we were at last able to buy a car"; "able to get a grant for the project"
00002098 00 a 01 unable 0 002 = 05200169 n 0000 ! 00001740 a 0101 | (usually followed by `to') not having the necessary means or skill or know-how; "unable to get to town without a car"; "unable to obtain funds"
00002312 00 a 02 abaxial 0 dorsal 4 002 ;c 06037666 n 0000 ! 00002527 a 0101 | facing away from the axis of an organ or organism; "the abaxial surface of a leaf is the underside or side facing away from the stem"
00002527 00 a 02 adaxial 0 ventral 4 002 ;c 06037666 n 0000 ! 00002312 a 0101 | nearest to or facing toward the axis of an organ or organism; "the upper side of a leaf is known as the adaxial surface"
00002730 00 a 01 acroscopic 0 002 ;c 06066555 n 0000 ! 00002843 a 0101 | facing or on the side toward the apex
00002843 00 a 01 basiscopic 0 002 ;c 06066555 n 0000 ! 00002730 a 0101 | facing or on the side toward the base
00002956 00 a 02 abducent 0 abducting 0 002 ;c 06080522 n 0000 ! 00003131 a 0101 | especially of muscles; drawing away from the midline of the body or from an adjacent part
00003131 00 a 03 adducent 0 adductive 0 adducting 0 003 ;c 06080522 n 0000 + 01449236 v 0201 ! 00002956 a 0101 | especially of muscles; bringing together or drawing toward the midline of the body or toward an adjacent part
00003356 00 a 01 nascent 0 005 + 07320302 n 0103 ! 00003939 a 0101 & 00003553 a 0000 & 00003700 a 0000 & 00003829 a 0000 | being born or beginning; "the nascent chicks"; "a nascent insurgency"
00003553 00 s 02 emergent 0 emerging 0 003 & 00003356 a 0000 + 02625016 v 0102 + 00050693 n 0101 | coming into existence; "an emergent republic"
00003700 00 s 01 dissilient 0 002 & 00003356 a 0000 + 07434782 n 0101 | bursting open with force, as do some ripe seed vessels
On the basis that the dictionary, exceptions and rules that spacy
lemmatizer uses is largely from Princeton WordNet and their Morphy software, we can move on to see the actual implementation of how spacy
applies the rules using the index and exceptions.
We go back to the https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py
The main action comes from the function rather than the Lemmatizer
class:
def lemmatize(string, index, exceptions, rules):
string = string.lower()
forms = []
# TODO: Is this correct? See discussion in Issue #435.
#if string in index:
# forms.append(string)
forms.extend(exceptions.get(string, []))
oov_forms = []
for old, new in rules:
if string.endswith(old):
form = string[:len(string) - len(old)] + new
if not form:
pass
elif form in index or not form.isalpha():
forms.append(form)
else:
oov_forms.append(form)
if not forms:
forms.extend(oov_forms)
if not forms:
forms.append(string)
return set(forms)
Why is the lemmatize
method outside of the Lemmatizer
class?
That I'm not exactly sure but perhaps, it's to ensure that the lemmatization function can be called outside of a class instance but given that @staticmethod
and @classmethod
exist perhaps there are other considerations as to why the function and class has been decoupled
Morphy vs Spacy
Comparing spacy
lemmatize() function against the morphy()
function in nltk (which originally comes from http://blog.osteele.com/2004/04/pywordnet-20/ created more than a decade ago), morphy()
, the main processes in Oliver Steele's Python port of the WordNet morphy are:
For spacy
, possibly, it's still under development, given the TODO
at line https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py#L76
But the general process seems to be:
In terms of OOV handling, spacy returns the original string if no lemmatized form is found, in that respect, the nltk
implementation of morphy
does the same,eg
>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> wnl.lemmatize('alvations')
'alvations'
Checking for infinitive before lemmatization
Possibly another point of difference is how morphy
and spacy
decides what POS to assign to the word. In that respect, spacy
puts some linguistics rule in the Lemmatizer()
to decide whether a word is the base form and skips the lemmatization entirely if the word is already in the infinitive form (is_base_form()), this will save quite a bit if lemmatization was to be done for all words in the corpus and quite a chunk of it are infinitives (already the lemma form).
But that's possible in spacy
because it allowed the lemmatizer to access the POS that's tied closely to some morphological rules. While for morphy
although it's possible to figure out some morphology using the fine-grained PTB POS tags, it still takes some effort to sort them out to know which forms are infinitive.
Generalment, the 3 primary signals of morphology features needs to be teased out in the POS tag:
Epilogue
I guess now that we know it works with linguistics rules and all, the other question is "are there any non rule-based methods for lemmatization?"
But before even answering the question before, "What exactly is a lemma?" might the better question to ask.
There is a set of rules and a set of words known for each word type(adjective, noun, verb, adverb). The mapping happens here:
INDEX = {
"adj": ADJECTIVES,
"adv": ADVERBS,
"noun": NOUNS,
"verb": VERBS
}
EXC = {
"adj": ADJECTIVES_IRREG,
"adv": ADVERBS_IRREG,
"noun": NOUNS_IRREG,
"verb": VERBS_IRREG
}
RULES = {
"adj": ADJECTIVE_RULES,
"noun": NOUN_RULES,
"verb": VERB_RULES,
"punct": PUNCT_RULES
}
Then on this line in lemmatizer.py the correct index, rules and exc (excl I believe stands for exceptions eg irregular examples) get loaded:
lemmas = lemmatize(string, self.index.get(univ_pos, {}),
self.exc.get(univ_pos, {}),
self.rules.get(univ_pos, []))
All the remaining logic is in the function lemmatize and is surprisingly short. We perform the following operations:
For each rule in the order they are given for the selected word type check if it matches the given word. If it does try to apply it.
2a. If after applying the rule the word is in the list of known words(ie index), add it to the lemmatized forms of the word
2b. Otherwise add the word to a separate list called oov_forms
(here I believe oov stands for "out of vocabulary")
上一篇: 在python中调用一个类方法