Tag, extract phrases from free text using a custom vocabulary (python)?

I have a custom vocabulary with approx. 1M rows in a SQL table. Each row has a UID and a corresponding phrase that can be many words in length. This table rarely changes.

I need tag, extract, chunk or recognize (NER ?) entity phrases in a free-text document against the above mentioned custom vocabulary. So for a phrase found in the free text, I can pull its UID.

It would be nice if partial matches and also phrase tokens appearing in a different order would be tagged / extracted according to some threshold / algorithm settings.

  • Which NLP tool, preferably Python based, can make use of a custom vocabulary in its tagging, extraction, chunking or NER from free text ?
  • Knowing the goal is to extract phrases from free text - which format is best suited for this custom vocabulary to work with the NLP tool ? XML, JSON, trees, IOB chunks, other ?
  • Any tool to help transform the SQL table (original custom vocabulary) into the format of the vocabulary the NLP algorithm requires to work with ?
  • Do I need integrate with other (non-pythonic) tools such as GATE, KEA, Lingpipe, Apache Stanbol or OpenNLP ?
  • Is there an API for both tagging / extracting and for creating a custom vocabulary ?
  • Any experience with RapidMiner or TextRazor ? Can these tools help with the above ?
  • Thanks!


    After many hours of checking various API, we've decided to go with TextRazor.

    Quality of NLP phrase extraction / classification results is superb - TextRazor uses Freebase and DBpedia (among other repositories) and this allows TextRazor to classify / categorize / extract PHRASES such as "computer security" - correctly as one entity (and not as many other APIs - incorrectly classifying this example as one class of "computer" AND another class as "security"). Programmatic control over which terms TextRazor will use and which ones will not - is again, very simple.

    In terms of speed - TextRazor is amazingly fast. If I understand correctly, it uses parallel computing on many (hundreds ? thousands?) of Amazon on-demand machines.

    Cost - we compared it to others and did an in-depth analysis with one of their competitors (a very large 3 letters company) - and they are definitely competitive and reasonable.

    Integration with their API using Python was (relatively) straight-forward, except some minor issue with https when working locally on a Web2Py framework. If you hit an obstacle while using TextRazor on Web2Py locally - feel free to ping me and I'll gladly share our solution.

    Service / support - almost instantaneous - they usually reply within 12 hours to all inquiries.

    Disclosure - I have no interests, shares or any other financial benefits related to TextRazor and we are actually still on their free plan - so we didn't pay them yet for their API services.

    链接地址: http://www.djcxy.com/p/57774.html

    上一篇: 识别给定单词的主题/域

    下一篇: 标记,使用自定义词汇表(Python)从自由文本中提取短语?