找到元素的直接子元素

2018-06-22 09:32:05

我正在编写一个解决方案来在Python中测试这种现象。我完成了大部分逻辑，但在维基百科文章中关注链接时会出现许多边缘案例。

我遇到的问题出现在这样一个页面上，其中第一个具有多个子级元素，第一个括号之后的第一个<a>标签需要提取。在这种情况下（为了提取这个链接），你必须跳过括号，然后到达下一个定位标记/ href。在大多数文章中，我的算法可以跳过括号，但是以它在圆括号前面寻找链接的方式（或者如果它们不存在），它会在错误的地方找到锚标记。具体来说，这里： <a href="/wiki/Geographic_coordinate_system" title="Geographic coordinate system">Coordinates</a>

该算法通过迭代遍历第一段落标记中的元素（在文章的主体中），迭代地对每个元素进行字符串化，首先检查它是否包含'（'或'

是否有任何直接的方式来避免嵌入式锚定标记，并只采取第一个链接是第一个的直接子 ？

以下是此代码的功能以供参考：

**def getValidLink(self, currResponse):
        currRoot = BeautifulSoup(currResponse.text,"lxml")
        temp = currRoot.body.findAll('p')[0]
        parenOpened = False
        parenCompleted = False
        openCount = 0
        foundParen = False
        while temp.next:
            temp = temp.next
            curr = str(temp)
            if '(' in curr and str(type(temp)) == "<class 'bs4.element.NavigableString'>":
                foundParen = True
                break
            if '<a' in curr and str(type(temp)) == "<class 'bs4.element.Tag'>":
                link = temp
                break

        temp = currRoot.body.findAll('p')[0]
        if foundParen:
            while temp.next and not parenCompleted:
                temp = temp.next
                curr = str(temp)
                if '(' in curr:
                    openCount += 1
                    if parenOpened is False:
                        parenOpened = True
                if ')' in curr and parenOpened and openCount > 1:
                    openCount -= 1
                elif ')' in curr and parenOpened and openCount == 1:
                    parenCompleted = True
            try:
                return temp.findNext('a').attrs['href']
            except KeyError:
                print "nReached article with no main body!n"
                return None
        try:
            return str(link.attrs['href'])
        except KeyError:
            print "nReached article with no main bodyn"
            return None**

我认为你是在严肃地解决这个问题。

有多种方式可以使用BeautifulSoup元素之间的直接亲子关系。一种方法是> CSS选择器：

In [1]: import requests  

In [2]: from bs4 import BeautifulSoup   

In [3]: url = "https://en.wikipedia.org/wiki/Sierra_Leone"    

In [4]: response = requests.get(url)    

In [5]: soup = BeautifulSoup(response.content, "html.parser")

In [6]: [a.get_text() for a in soup.select("#mw-content-text > p > a")]
Out[6]: 
['West Africa',
 'Guinea',
 'Liberia',
 ...
 'Allen Iverson',
 'Magic Johnson',
 'Victor Oladipo',
 'Frances Tiafoe']

在这里我们发现a是直接位于下元素p直接与元素的元素id="mw-content-text" -从我的理解，这是其中主要的维基百科文章位于英寸

如果您需要单个元素，请使用select_one()而不是select() 。

另外，如果你想通过find*()来解决它，传递recursive=False参数。

链接地址: http://www.djcxy.com/p/62853.html

上一篇: Finding direct child of an element

下一篇: Web Scraping a wikipedia page