获取不在括号内的维基百科文章中的第一个链接

2018-06-03 02:28:11

所以我对这个理论很感兴趣，如果你去看一篇随机的维基百科文章，那么95％的案例最终会在关于哲学的文章中重复点击第一个不在括号内的链接。

我想用Python编写一个脚本，为我提供链接，并最终打印出访问过哪些文章的好列表（ linkA -> linkB -> linkC ）等。

我设法获取了网页的HTML DOM，并设法去掉了一些不必要的链接和导致消歧页面的顶部描述栏。到目前为止，我的结论是：

DOM开始于您在某些页面右侧看到的表格，例如在Human中。我们希望忽略这些链接。

有效的链接元素都有一个<p>元素作为它们的祖先（如果它在<b>标签或类似的内部，通常是父类或祖父类）。导致消除歧义页面的顶部栏似乎不包含任何<p>元素。

无效链接包含一些特殊字词，后面跟冒号，例如Wikipedia:

到现在为止还挺好。但这是括号给我的。例如，在关于Human的文章中，不在括号内的第一个链接是“/ wiki / Species”，但脚本在其中找到“/ wiki / Taxonomy”。

我不知道如何去编程，因为我必须在父/子节点的某些组合中查找文本，而这些文本可能并不总是相同的。有任何想法吗？

我的代码可以在下面看到，但是这是我编写得非常快，并且不是很自豪。然而，它的评论，所以你可以看到我的思路（我希望:)）。

"""Wikipedia fun"""
import urllib2
from xml.dom.minidom import parseString
import time

def validWikiArticleLinkString(href):
    """ Takes a string and returns True if it contains the substring
        '/wiki/' in the beginning and does not contain any of the
        "special" wiki pages. 
    """
    return (href.find("/wiki/") == 0
            and href.find("(disambiguation)") == -1 
            and href.find("File:") == -1 
            and href.find("Wikipedia:") == -1
            and href.find("Portal:") == -1
            and href.find("Special:") == -1
            and href.find("Help:") == -1
            and href.find("Template_talk:") == -1
            and href.find("Template:") == -1
            and href.find("Talk:") == -1
            and href.find("Category:") == -1
            and href.find("Bibcode") == -1
            and href.find("Main_Page") == -1)


if __name__ == "__main__":
    visited = []    # a list of visited links. used to avoid getting into loops

    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', 'Mozilla/5.0')] # need headers for the api

    currentPage = "Human"  # the page to start with

    while True:
        infile = opener.open('http://en.wikipedia.org/w/index.php?title=%s&printable=yes' % currentPage)
        html = infile.read()    # retrieve the contents of the wiki page we are at

        htmlDOM = parseString(html) # get the DOM of the parsed HTML
        aTags = htmlDOM.getElementsByTagName("a")   # find all <a> tags

        for tag in aTags:
            if "href" in tag.attributes.keys():         # see if we have the href attribute in the tag
                href = tag.attributes["href"].value     # get the value of the href attribute
                if validWikiArticleLinkString(href):                             # if we have one of the link types we are looking for

                    # Now come the tricky parts. We want to look for links in the main content area only,
                    # and we want the first link not in parentheses.

                    # assume the link is valid.
                    invalid = False            

                    # tables which appear to the right on the site appear first in the DOM, so we need to make sure
                    # we are not looking at a <a> tag somewhere inside a <table>.
                    pn = tag.parentNode                     
                    while pn is not None:
                        if str(pn).find("table at") >= 0:
                            invalid = True
                            break
                        else:
                            pn = pn.parentNode 

                    if invalid:     # go to next link
                        continue               

                    # Next we look at the descriptive texts above the article, if any; e.g
                    # This article is about .... or For other uses, see ... (disambiguation).
                    # These kinds of links will lead into loops so we classify them as invalid.

                    # We notice that this text does not appear to be inside a <p> block, so
                    # we dismiss <a> tags which aren't inside any <p>.
                    pnode = tag.parentNode
                    while pnode is not None:
                        if str(pnode).find("p at") >= 0:
                            break
                        pnode = pnode.parentNode
                    # If we have reached the root node, which has parentNode None, we classify the
                    # link as invalid.
                    if pnode is None:
                        invalid = True

                    if invalid:
                        continue


                    ######  this is where I got stuck:
                    # now we need to look if the link is inside parentheses. below is some junk

#                    for elem in tag.parentNode.childNodes:
#                        while elem.firstChild is not None:
#                            elem = elem.firstChid
#                        print elem.nodeValue

                    print href      # this will be the next link
                    newLink = href[6:]  # except for the /wiki/ part
                    break

        # if we have been to this link before, break the loop
        if newLink in visited:
            print "Stuck in loop."
            break
        # or if we have reached Philosophy
        elif newLink == "Philosophy":
            print "Ended up in Philosophy."
            break
        else:
            visited.append(currentPage)     # mark this currentPage as visited
            currentPage = newLink           # make the the currentPage we found the new page to fetch
            time.sleep(5)                   # sleep some to see results as debug

我在Github上找到了一个python脚本（http://github.com/JensTimmerman/scripts/blob/master/philosophy.py）来玩这个游戏。它使用Beautifulsoup进行HTML解析，并解决parantheses问题，他只是在解析链接之前删除括号内的文本。

链接地址: http://www.djcxy.com/p/10771.html

上一篇: Get the first link in a Wikipedia article not inside parentheses

下一篇: Spring autowire using annotations and a type defined in a properties file?