如何使用nltk从字符串中提取名称
我试图从非结构化字符串中提取名称(印度)。
这里来我的代码:
text = "Balaji Chandrasekaran Bangalore | Senior Business Analyst/ Lead Business Analyst An accomplished Senior Business Analyst with a track record of handling complex projects in given period of time, exceeding above the expectation. Successful at developing product road maps and leading cross-functional software teams from prototype to release. Professional Competencies Systems Development Life Cycle (SDLC) Agile methodologies Business process improvement Requirements gathering & Analysis Project Management UML Specification UI & UX (Wireframe Designing) Functional Specification Test Scenario Creation SharePoint Admin Work History Senior Business Analyst (Aug 2012 Current) YouBox Technology pvt ltd, Chennai Translating business goals, feature concepts and customer needs into prioritized product requirements and use cases. Expertized in designing innovative wireframes combining user experience analysis and technology models. Extensive Experience in implementing soft wares for Shipping/Logistics firms to handle CRM, Finance, Logistics, Operations, Intermodal, and documentation. Strong interpersonal skills, highly adept at diplomatically facilitating discussions and negotiations with stakeholders. Education Bachelor of Engineering: Electronics & Communication, 2011 CES Tech Hosur Accomplishment Successful onsite implementation at various locations around the globe for Europe Shipping Company. - (Pre Study, General Design, and Functional Specification) Organized Business Analyst Forum and conducted various activities to develop skill sets of Business Analysts."
if text != "":
grammar = """PERSON: {<NNP>}"""
chunkParser = nltk.RegexpParser(grammar)
tagged = nltk.pos_tag(nltk.word_tokenize(text))
tree = chunkParser.parse(tagged)
for subtree in tree.subtrees():
if subtree.label() == "PERSON":
pronouns.append(' '.join([c[0] for c in subtree]))
print(pronouns)
['Balaji','Chandrasekaran','Bangalore','|','Senior','Business','Analys','/','Lead','Business','Analyst','Senior'商业','分析','成功','发展','生活','循环','SDLC','敏捷','商业','需求','分析','项目' ,UI,UX,Wireframe,Designing,Functional,Specification,Test,Scenario,Creation,SharePoint,管理','工作','历史','高级','商业','分析师','8月','当前','技术','金奈','翻译','客户关系管理' ,'物流','运营','多式联运','教育','学士','工程','电子','沟通','成就''成功''地中海''船' MSC','Georgia','MSC','Cambodia','MSC','MSC','South','Successful','Stake','MSC','Geneva','Switzerland' ,'Pre','Study','General','Design','Functional','Specification','O','Business','Analyst','Forum','Business']
但实际上我只需要得到Balaji Chandrasekaran ,我甚至尝试使用Standford ner lib.Which没有选择Balaji Chandrasekaran
任何人都可以帮助从un strcuture字符串中提取名字,或者向我推荐任何好的教程来做到这一点。
先谢谢你。
就像我在评论中所说的那样,您必须为印度人的名字创建自己的语料库,并根据这些语言测试您的文本。 NLTK Book教你如何在第2章中做到这一点(准确地说,第1.9节)。
from nltk.corpus import PlaintextCorpusReader
# You can use a regular expression to find the files, or pass a list of files
files = ".*.txt"
new_corpus = PlaintextCorpusReader("/path/", files)
corpus = nltk.Text(new_corpus.words())
另请参阅:使用NLTK创建新的语料库
命名实体的识别不仅仅是寻找已知的名字, 识别器使用线索的组合,包括词的形式和文本的结构。 您无法识别的名称出现在标题中,而不是正在运行的文本中,因此nltk的识别器(反正不是那么棒)找不到它。 看看如果你在文本中使用这个名字会发生什么:
>>> text = "Balaji Chandrasekaran is a senior business analyst and lives in Bangalore."
>>> words = nltk.word_tokenize(text)
>>> print(nltk.ne_chunk(nltk.pos_tag(words)))
(S
(PERSON Balaji/NNP)
Chandrasekaran/NNP
is/VBZ
a/DT
senior/JJ
business/NN
analyst/NN
and/CC
lives/NNS
in/IN
(GPE Bangalore/NNP)
./.)
它错过了最后的名字(就像我说的识别器不是那么棒),但是它能够发现这里有一个名字。
换句话说:你的问题是你不是挖掘文本,而是继续。 唯一好的解决方案是用你想要处理的相同格式来构建和训练一些带有注释简历的识别器。 这不是非常简单:您需要注释训练语料库,并找出“特征提取功能”将放置在字典中的有用特征(来自单词形式和文档结构的线索)。 你需要的一切在nltk书的第6章和第7章的各个部分都有描述。
链接地址: http://www.djcxy.com/p/65169.html