Getting a particular image from Wikipedia with BeautifulSoup
I'm trying to get a particular image from some Wikipedia pages by using BeautifulSoup 4 with lxml as the parser. For example I'm trying to get the album cover on the right from this wikipedia page: http://en.wikipedia.org/wiki/Animal_House_(UDO_album)
The function that does the scraping is this:
def get_cover_from_wikipedia(url):
r = requests.get(url)
if r.status_code == 200:
soup = BeautifulSoup(r.content, 'lxml')
elements = soup.find_all('a', class_='image')
for element in elements:
print '%snn' % element.prettify()
return False
the output of the print is as follows:
<a class="image" href="/wiki/File:Question_book-new.svg">
<img alt="" data-file-height="204" data-file-width="262" height="39" src="//upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/50px-Question_book-new.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/75px-Question_book-new.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/100px-Question_book-new.svg.png 2x" width="50"/>
</a>
<a class="image" href="/wiki/File:UDO_animal_house.jpg">
<img alt="" data-file-height="302" data-file-width="300" height="221" src="//upload.wikimedia.org/wikipedia/en/thumb/4/4e/UDO_animal_house.jpg/220px-UDO_animal_house.jpg" srcset="//upload.wikimedia.org/wikipedia/en/4/4e/UDO_animal_house.jpg 1.5x, //upload.wikimedia.org/wikipedia/en/4/4e/UDO_animal_house.jpg 2x" width="220"/>
</a>
the image I want to pull out is the image in the second block that starts with <a class...
, not the book image which is the image in the first block
what I want to accomplish here is:
I only want to get the links specified with src
, not everything that comes with the class.
I want to be able to distinguish between the book image and image I want to pull out. The book image is there because if you check the Wikipedia page, it says the article need citations and there is a book image there. Apparently it matches my search for tag a
and class image
but it might or might not be there depending on the article in question.
What's the best way to get only the image I'm interested in, which is the image in the right side of the article?
Your search is not specific enough. The book image is nested in a metadata table:
<table class="metadata plainlinks ambox ambox-content ambox-Refimprove" role="presentation">
while the album cover is nested inside another:
<table class="infobox vevent haudio" style="width:22em">
Use that to your advantage.
Using the CSS selector support makes this trivial:
covers = soup.select('table.infobox a.image img[src]')
for cover in covers:
print cover['src']
The CSS selector asks for <img>
tags with a src
attribute, provided they are nested in a <a class="image">
element, inside a <table class="infobox">
element. There is but one such image:
>>> from bs4 import BeautifulSoup
>>> import requests
>>> r = requests.get('http://en.wikipedia.org/wiki/Animal_House_(U.D.O._album)')
>>> soup = BeautifulSoup(r.content)
>>> covers = soup.select('table.infobox a.image img[src]')
>>> for cover in covers:
... print cover['src']
...
//upload.wikimedia.org/wikipedia/en/thumb/4/4e/UDO_animal_house.jpg/220px-UDO_animal_house.jpg
Well you've already got 99% of what you want, so that's the main thing. My first thought is to tighten your filter a little bit. If this is a one off case, and you don't need this program to apply in many places, the 'text' argument in BeautifulSoup.find_all() may help you:
if r.status_code == 200:
soup = BeautifulSoup(r.content, 'lxml')
elements = soup.find_all('a', text='.jpg' class_='image')
for element in
print '%snn' % element.prettify()
return False
As your target image is the only .jpg file on the page, this should help. You've probably already looked, but this should help if you get stuck: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all
链接地址: http://www.djcxy.com/p/62848.html上一篇: 获取伦敦内部所有维基百科文章