获取不在括号内的维基百科文章中的第一个链接
所以我对这个理论很感兴趣,如果你去看一篇随机的维基百科文章,那么95%的案例最终会在关于哲学的文章中重复点击第一个不在括号内的链接。
我想用Python编写一个脚本,为我提供链接,并最终打印出访问过哪些文章的好列表( linkA -> linkB -> linkC
)等。
我设法获取了网页的HTML DOM,并设法去掉了一些不必要的链接和导致消歧页面的顶部描述栏。 到目前为止,我的结论是:
<p>
元素作为它们的祖先(如果它在<b>
标签或类似的内部,通常是父类或祖父类)。导致消除歧义页面的顶部栏似乎不包含任何<p>
元素。 Wikipedia:
到现在为止还挺好。 但这是括号给我的。 例如,在关于Human的文章中,不在括号内的第一个链接是“/ wiki / Species”,但脚本在其中找到“/ wiki / Taxonomy”。
我不知道如何去编程,因为我必须在父/子节点的某些组合中查找文本,而这些文本可能并不总是相同的。 有任何想法吗?
我的代码可以在下面看到,但是这是我编写得非常快,并且不是很自豪。 然而,它的评论,所以你可以看到我的思路(我希望:))。
"""Wikipedia fun"""
import urllib2
from xml.dom.minidom import parseString
import time
def validWikiArticleLinkString(href):
""" Takes a string and returns True if it contains the substring
'/wiki/' in the beginning and does not contain any of the
"special" wiki pages.
"""
return (href.find("/wiki/") == 0
and href.find("(disambiguation)") == -1
and href.find("File:") == -1
and href.find("Wikipedia:") == -1
and href.find("Portal:") == -1
and href.find("Special:") == -1
and href.find("Help:") == -1
and href.find("Template_talk:") == -1
and href.find("Template:") == -1
and href.find("Talk:") == -1
and href.find("Category:") == -1
and href.find("Bibcode") == -1
and href.find("Main_Page") == -1)
if __name__ == "__main__":
visited = [] # a list of visited links. used to avoid getting into loops
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')] # need headers for the api
currentPage = "Human" # the page to start with
while True:
infile = opener.open('http://en.wikipedia.org/w/index.php?title=%s&printable=yes' % currentPage)
html = infile.read() # retrieve the contents of the wiki page we are at
htmlDOM = parseString(html) # get the DOM of the parsed HTML
aTags = htmlDOM.getElementsByTagName("a") # find all <a> tags
for tag in aTags:
if "href" in tag.attributes.keys(): # see if we have the href attribute in the tag
href = tag.attributes["href"].value # get the value of the href attribute
if validWikiArticleLinkString(href): # if we have one of the link types we are looking for
# Now come the tricky parts. We want to look for links in the main content area only,
# and we want the first link not in parentheses.
# assume the link is valid.
invalid = False
# tables which appear to the right on the site appear first in the DOM, so we need to make sure
# we are not looking at a <a> tag somewhere inside a <table>.
pn = tag.parentNode
while pn is not None:
if str(pn).find("table at") >= 0:
invalid = True
break
else:
pn = pn.parentNode
if invalid: # go to next link
continue
# Next we look at the descriptive texts above the article, if any; e.g
# This article is about .... or For other uses, see ... (disambiguation).
# These kinds of links will lead into loops so we classify them as invalid.
# We notice that this text does not appear to be inside a <p> block, so
# we dismiss <a> tags which aren't inside any <p>.
pnode = tag.parentNode
while pnode is not None:
if str(pnode).find("p at") >= 0:
break
pnode = pnode.parentNode
# If we have reached the root node, which has parentNode None, we classify the
# link as invalid.
if pnode is None:
invalid = True
if invalid:
continue
###### this is where I got stuck:
# now we need to look if the link is inside parentheses. below is some junk
# for elem in tag.parentNode.childNodes:
# while elem.firstChild is not None:
# elem = elem.firstChid
# print elem.nodeValue
print href # this will be the next link
newLink = href[6:] # except for the /wiki/ part
break
# if we have been to this link before, break the loop
if newLink in visited:
print "Stuck in loop."
break
# or if we have reached Philosophy
elif newLink == "Philosophy":
print "Ended up in Philosophy."
break
else:
visited.append(currentPage) # mark this currentPage as visited
currentPage = newLink # make the the currentPage we found the new page to fetch
time.sleep(5) # sleep some to see results as debug
我在Github上找到了一个python脚本(http://github.com/JensTimmerman/scripts/blob/master/philosophy.py)来玩这个游戏。 它使用Beautifulsoup进行HTML解析,并解决parantheses问题,他只是在解析链接之前删除括号内的文本。
链接地址: http://www.djcxy.com/p/10771.html上一篇: Get the first link in a Wikipedia article not inside parentheses
下一篇: Spring autowire using annotations and a type defined in a properties file?