Extract terminology from sentences quickly

I am working in Text Mining and my work is focused on biomedical entities (genes, proteins, drugs and diseases). I would like to share with you some questions.

Now, my goal is to find biomedical entities in biomedical text (from Medline) and through of dictionaries of terms, I can identify each entity found with its unique identifier.

To store text, dicitionaries and results, I am using MongoDB (a nonSQL database). Each abstract is splitted in sentences, and each sentence is store in a new record (with list of tokens, chunks and Part-of-Speech tags). To find entities, I get all senteces and for each one I create a regular expresion for each term in the dictionary (in Python):

for term in dicitonary:
     matches = re.finditer(r'(' + term + ')', sentence)
     for m in matches:
          ini = m.start()
          end = m.end()
          result.append(ini, end, dictionary.get_identification[term])

But it is really slow, I have several subsets of 150,000 abstracts (>1,000,000 of sentences).

For me, it is really interesting soft-matching to extract more entities where their terminology is not exactly in my dictionary, but it can increase my running time.

I think that my problem is to do a lot of regular expressions (I have dictionary with 300,000 entries) for each sentence, because I have to find the terms in sentence. Without Machine Learning algorithm, how could you resolve this problem? And with ML algorithms? Now, I am flexible to change my programming language, databases...

Thank you very much!!!

Regards,

àlex.


Instead of building one RE per term , build a single, disjunctive one that can catch all of them:

pattern = re.compile("(%s)" % "|".join(re.escape(term) for term in dictionary))

then use pattern.finditer .

As for "how to use machine learning", that's far too broad a question, IMHO. Start out by googling for "biomedical named entity recognition" -- there's a vast amount of literature about that problem and assorted tools.

链接地址: http://www.djcxy.com/p/57768.html

上一篇: 在自然文本中检测(预定义)主题

下一篇: 快速从句子中提取术语