Extract terminology from sentences quickly
I am working in Text Mining and my work is focused on biomedical entities (genes, proteins, drugs and diseases). I would like to share with you some questions.
Now, my goal is to find biomedical entities in biomedical text (from Medline) and through of dictionaries of terms, I can identify each entity found with its unique identifier.
To store text, dicitionaries and results, I am using MongoDB (a nonSQL database). Each abstract is splitted in sentences, and each sentence is store in a new record (with list of tokens, chunks and Part-of-Speech tags). To find entities, I get all senteces and for each one I create a regular expresion for each term in the dictionary (in Python):
for term in dicitonary:
matches = re.finditer(r'(' + term + ')', sentence)
for m in matches:
ini = m.start()
end = m.end()
result.append(ini, end, dictionary.get_identification[term])
But it is really slow, I have several subsets of 150,000 abstracts (>1,000,000 of sentences).
For me, it is really interesting soft-matching to extract more entities where their terminology is not exactly in my dictionary, but it can increase my running time.
I think that my problem is to do a lot of regular expressions (I have dictionary with 300,000 entries) for each sentence, because I have to find the terms in sentence. Without Machine Learning algorithm, how could you resolve this problem? And with ML algorithms? Now, I am flexible to change my programming language, databases...
Thank you very much!!!
Regards,
àlex.
Instead of building one RE per term
, build a single, disjunctive one that can catch all of them:
pattern = re.compile("(%s)" % "|".join(re.escape(term) for term in dictionary))
then use pattern.finditer
.
As for "how to use machine learning", that's far too broad a question, IMHO. Start out by googling for "biomedical named entity recognition" -- there's a vast amount of literature about that problem and assorted tools.
链接地址: http://www.djcxy.com/p/57768.html上一篇: 在自然文本中检测(预定义)主题
下一篇: 快速从句子中提取术语