Tag, extract phrases from free text using a custom vocabulary (python)?
I have a custom vocabulary with approx. 1M rows in a SQL table. Each row has a UID and a corresponding phrase that can be many words in length. This table rarely changes.
I need tag, extract, chunk or recognize (NER ?) entity phrases in a free-text document against the above mentioned custom vocabulary. So for a phrase found in the free text, I can pull its UID.
It would be nice if partial matches and also phrase tokens appearing in a different order would be tagged / extracted according to some threshold / algorithm settings.
Thanks!
After many hours of checking various API, we've decided to go with TextRazor.
Quality of NLP phrase extraction / classification results is superb - TextRazor uses Freebase and DBpedia (among other repositories) and this allows TextRazor to classify / categorize / extract PHRASES such as "computer security" - correctly as one entity (and not as many other APIs - incorrectly classifying this example as one class of "computer" AND another class as "security"). Programmatic control over which terms TextRazor will use and which ones will not - is again, very simple.
In terms of speed - TextRazor is amazingly fast. If I understand correctly, it uses parallel computing on many (hundreds ? thousands?) of Amazon on-demand machines.
Cost - we compared it to others and did an in-depth analysis with one of their competitors (a very large 3 letters company) - and they are definitely competitive and reasonable.
Integration with their API using Python was (relatively) straight-forward, except some minor issue with https when working locally on a Web2Py framework. If you hit an obstacle while using TextRazor on Web2Py locally - feel free to ping me and I'll gladly share our solution.
Service / support - almost instantaneous - they usually reply within 12 hours to all inquiries.
Disclosure - I have no interests, shares or any other financial benefits related to TextRazor and we are actually still on their free plan - so we didn't pay them yet for their API services.
链接地址: http://www.djcxy.com/p/57774.html上一篇: 识别给定单词的主题/域