网页刮一个维基百科页面

2018-06-22 09:31:03

在某些维基百科页面中，在文章标题（以粗体显示）之后，括号内有一些文字用于解释标题中单词的发音和语音。例如，在这里，在<p>的粗体标题diglossia之后，有一个左括号。为了找到相应的右括号，您必须逐个遍历文本节点才能找到它，这很简单。我想要做的是找到下一个href链接并存储它。

这里的问题是（AFAIK），没有办法用左括号唯一标识文本节点，然后获得以下href。 是否有任何直接的（而不是复杂的）方式来获得第一个括号外的第一个链接？

编辑

在这里提供的链接中，要存储的href应该是：https：//en.wikipedia.org/wiki/Dialects，因为这是括号外的第一个链接

这是你想要的吗？

import requests
rs = requests.get('https://en.wikipedia.org/wiki/Diglossia', verify=False)
parsed_html = BeautifulSoup(rs.text)
print parsed_html.body.findAll('p')[0].findAll('a')[0]

这给出：

<a href="/wiki/Linguistics" title="Linguistics">linguistics</a>

如果你想提取href，那么你可以使用这个：

parsed_html.body.findAll('p')[0].findAll('a')[0].attrs[0][1]

更新似乎你希望在括号之后而不是之前的href。我为它写了脚本。尝试这个：

import requests
from BeautifulSoup import BeautifulSoup
rs = requests.get('https://en.wikipedia.org/wiki/Diglossia', verify=False)
parsed_html = BeautifulSoup(rs.text)

temp = parsed_html.body.findAll('p')[0]

start_count = 0
started = False
found = False

while temp.next and found is False:
    temp = temp.next
    if '(' in temp:
        start_count += 1
        if started is False:
            started = True
    if ')' in temp and started and start_count > 1:
        start_count -= 1
    elif ')' in temp and started and start_count == 1:
        found = True

print temp.findNext('a').attrs[0][1]

链接地址: http://www.djcxy.com/p/62851.html

上一篇: Web Scraping a wikipedia page

下一篇: Getting all Wikipedia articles with coordinates inside London