Get the first link in a Wikipedia article not inside parentheses

So I'm interested in this theory that if you go to a random Wikipedia article, click the first link not inside parentheses repeatedly, in 95% of the cases you will end up on the article about Philosophy.

I wanted to write a script in Python that does the link fetching for me and in the end, print a nice list of which articles were visited ( linkA -> linkB -> linkC ) etc.

I managed to get the HTML DOM of the web pages, and managed to strip out some unnecessary links and the top description bar which leads disambiguation pages. So far I have concluded that:

  • The DOM begins with the table which you see on the right on some pages, for example in Human. We want to ignore these links.
  • The valid link elements all have a <p> element somewhere as their ancestor (most often parent or grandparent if it's inside a <b> tag or similar. The top bar which leads to disambiguation pages, does not seem to contain any <p> elements.
  • Invalid links contain some special words followed by a colon, eg Wikipedia:
  • So far, so good. But it's the parentheses that get me. In the article about Human for example, the first link not inside parentheses is "/wiki/Species", but the script finds "/wiki/Taxonomy" which is inside them.

    I have no idea how to go about this programmatically, since I have to look for text in some combination of parent/child nodes which may not always be the same. Any ideas?

    My code can be seen below, but it's something I made up really quickly and not very proud of. It's commented however, so you can see my line of thoughts (I hope :) ).

    """Wikipedia fun"""
    import urllib2
    from xml.dom.minidom import parseString
    import time
    
    def validWikiArticleLinkString(href):
        """ Takes a string and returns True if it contains the substring
            '/wiki/' in the beginning and does not contain any of the
            "special" wiki pages. 
        """
        return (href.find("/wiki/") == 0
                and href.find("(disambiguation)") == -1 
                and href.find("File:") == -1 
                and href.find("Wikipedia:") == -1
                and href.find("Portal:") == -1
                and href.find("Special:") == -1
                and href.find("Help:") == -1
                and href.find("Template_talk:") == -1
                and href.find("Template:") == -1
                and href.find("Talk:") == -1
                and href.find("Category:") == -1
                and href.find("Bibcode") == -1
                and href.find("Main_Page") == -1)
    
    
    if __name__ == "__main__":
        visited = []    # a list of visited links. used to avoid getting into loops
    
        opener = urllib2.build_opener()
        opener.addheaders = [('User-agent', 'Mozilla/5.0')] # need headers for the api
    
        currentPage = "Human"  # the page to start with
    
        while True:
            infile = opener.open('http://en.wikipedia.org/w/index.php?title=%s&printable=yes' % currentPage)
            html = infile.read()    # retrieve the contents of the wiki page we are at
    
            htmlDOM = parseString(html) # get the DOM of the parsed HTML
            aTags = htmlDOM.getElementsByTagName("a")   # find all <a> tags
    
            for tag in aTags:
                if "href" in tag.attributes.keys():         # see if we have the href attribute in the tag
                    href = tag.attributes["href"].value     # get the value of the href attribute
                    if validWikiArticleLinkString(href):                             # if we have one of the link types we are looking for
    
                        # Now come the tricky parts. We want to look for links in the main content area only,
                        # and we want the first link not in parentheses.
    
                        # assume the link is valid.
                        invalid = False            
    
                        # tables which appear to the right on the site appear first in the DOM, so we need to make sure
                        # we are not looking at a <a> tag somewhere inside a <table>.
                        pn = tag.parentNode                     
                        while pn is not None:
                            if str(pn).find("table at") >= 0:
                                invalid = True
                                break
                            else:
                                pn = pn.parentNode 
    
                        if invalid:     # go to next link
                            continue               
    
                        # Next we look at the descriptive texts above the article, if any; e.g
                        # This article is about .... or For other uses, see ... (disambiguation).
                        # These kinds of links will lead into loops so we classify them as invalid.
    
                        # We notice that this text does not appear to be inside a <p> block, so
                        # we dismiss <a> tags which aren't inside any <p>.
                        pnode = tag.parentNode
                        while pnode is not None:
                            if str(pnode).find("p at") >= 0:
                                break
                            pnode = pnode.parentNode
                        # If we have reached the root node, which has parentNode None, we classify the
                        # link as invalid.
                        if pnode is None:
                            invalid = True
    
                        if invalid:
                            continue
    
    
                        ######  this is where I got stuck:
                        # now we need to look if the link is inside parentheses. below is some junk
    
    #                    for elem in tag.parentNode.childNodes:
    #                        while elem.firstChild is not None:
    #                            elem = elem.firstChid
    #                        print elem.nodeValue
    
                        print href      # this will be the next link
                        newLink = href[6:]  # except for the /wiki/ part
                        break
    
            # if we have been to this link before, break the loop
            if newLink in visited:
                print "Stuck in loop."
                break
            # or if we have reached Philosophy
            elif newLink == "Philosophy":
                print "Ended up in Philosophy."
                break
            else:
                visited.append(currentPage)     # mark this currentPage as visited
                currentPage = newLink           # make the the currentPage we found the new page to fetch
                time.sleep(5)                   # sleep some to see results as debug
    

    I found a python script on Github (http://github.com/JensTimmerman/scripts/blob/master/philosophy.py) to play this game. It uses Beautifulsoup for HTML parsing and to cope with the parantheses issue he just removes text between brackets before parsing links.

    链接地址: http://www.djcxy.com/p/10772.html

    上一篇: iPhone后台应用程序在接到电话时更新屏幕

    下一篇: 获取不在括号内的维基百科文章中的第一个链接