获取不在括号内的维基百科文章中的第一个链接

所以我对这个理论很感兴趣,如果你去看一篇随机的维基百科文章,那么95%的案例最终会在关于哲学的文章中重复点击第一个不在括号内的链接。

我想用Python编写一个脚本,为我提供链接,并最终打印出访问过哪些文章的好列表( linkA -> linkB -> linkC )等。

我设法获取了网页的HTML DOM,并设法去掉了一些不必要的链接和导致消歧页面的顶部描述栏。 到目前为止,我的结论是:

  • DOM开始于您在某些页面右侧看到的表格,例如在Human中。 我们希望忽略这些链接。
  • 有效的链接元素都有一个<p>元素作为它们的祖先(如果它在<b>标签或类似的内部,通常是父类或祖父类)。导致消除歧义页面的顶部栏似乎不包含任何<p>元素。
  • 无效链接包含一些特殊字词,后面跟冒号,例如Wikipedia:
  • 到现在为止还挺好。 但这是括号给我的。 例如,在关于Human的文章中,不在括号内的第一个链接是“/ wiki / Species”,但脚本在其中找到“/ wiki / Taxonomy”。

    我不知道如何去编程,因为我必须在父/子节点的某些组合中查找文本,而这些文本可能并不总是相同的。 有任何想法吗?

    我的代码可以在下面看到,但是这是我编写得非常快,并且不是很自豪。 然而,它的评论,所以你可以看到我的思路(我希望:))。

    """Wikipedia fun"""
    import urllib2
    from xml.dom.minidom import parseString
    import time
    
    def validWikiArticleLinkString(href):
        """ Takes a string and returns True if it contains the substring
            '/wiki/' in the beginning and does not contain any of the
            "special" wiki pages. 
        """
        return (href.find("/wiki/") == 0
                and href.find("(disambiguation)") == -1 
                and href.find("File:") == -1 
                and href.find("Wikipedia:") == -1
                and href.find("Portal:") == -1
                and href.find("Special:") == -1
                and href.find("Help:") == -1
                and href.find("Template_talk:") == -1
                and href.find("Template:") == -1
                and href.find("Talk:") == -1
                and href.find("Category:") == -1
                and href.find("Bibcode") == -1
                and href.find("Main_Page") == -1)
    
    
    if __name__ == "__main__":
        visited = []    # a list of visited links. used to avoid getting into loops
    
        opener = urllib2.build_opener()
        opener.addheaders = [('User-agent', 'Mozilla/5.0')] # need headers for the api
    
        currentPage = "Human"  # the page to start with
    
        while True:
            infile = opener.open('http://en.wikipedia.org/w/index.php?title=%s&printable=yes' % currentPage)
            html = infile.read()    # retrieve the contents of the wiki page we are at
    
            htmlDOM = parseString(html) # get the DOM of the parsed HTML
            aTags = htmlDOM.getElementsByTagName("a")   # find all <a> tags
    
            for tag in aTags:
                if "href" in tag.attributes.keys():         # see if we have the href attribute in the tag
                    href = tag.attributes["href"].value     # get the value of the href attribute
                    if validWikiArticleLinkString(href):                             # if we have one of the link types we are looking for
    
                        # Now come the tricky parts. We want to look for links in the main content area only,
                        # and we want the first link not in parentheses.
    
                        # assume the link is valid.
                        invalid = False            
    
                        # tables which appear to the right on the site appear first in the DOM, so we need to make sure
                        # we are not looking at a <a> tag somewhere inside a <table>.
                        pn = tag.parentNode                     
                        while pn is not None:
                            if str(pn).find("table at") >= 0:
                                invalid = True
                                break
                            else:
                                pn = pn.parentNode 
    
                        if invalid:     # go to next link
                            continue               
    
                        # Next we look at the descriptive texts above the article, if any; e.g
                        # This article is about .... or For other uses, see ... (disambiguation).
                        # These kinds of links will lead into loops so we classify them as invalid.
    
                        # We notice that this text does not appear to be inside a <p> block, so
                        # we dismiss <a> tags which aren't inside any <p>.
                        pnode = tag.parentNode
                        while pnode is not None:
                            if str(pnode).find("p at") >= 0:
                                break
                            pnode = pnode.parentNode
                        # If we have reached the root node, which has parentNode None, we classify the
                        # link as invalid.
                        if pnode is None:
                            invalid = True
    
                        if invalid:
                            continue
    
    
                        ######  this is where I got stuck:
                        # now we need to look if the link is inside parentheses. below is some junk
    
    #                    for elem in tag.parentNode.childNodes:
    #                        while elem.firstChild is not None:
    #                            elem = elem.firstChid
    #                        print elem.nodeValue
    
                        print href      # this will be the next link
                        newLink = href[6:]  # except for the /wiki/ part
                        break
    
            # if we have been to this link before, break the loop
            if newLink in visited:
                print "Stuck in loop."
                break
            # or if we have reached Philosophy
            elif newLink == "Philosophy":
                print "Ended up in Philosophy."
                break
            else:
                visited.append(currentPage)     # mark this currentPage as visited
                currentPage = newLink           # make the the currentPage we found the new page to fetch
                time.sleep(5)                   # sleep some to see results as debug
    

    我在Github上找到了一个python脚本(http://github.com/JensTimmerman/scripts/blob/master/philosophy.py)来玩这个游戏。 它使用Beautifulsoup进行HTML解析,并解决parantheses问题,他只是在解析链接之前删除括号内的文本。

    链接地址: http://www.djcxy.com/p/10771.html

    上一篇: Get the first link in a Wikipedia article not inside parentheses

    下一篇: Spring autowire using annotations and a type defined in a properties file?