Getting all Wikipedia articles with coordinates inside London
Generally I want to get the links (and titles) of all Wikipedia articles with coordinates inside London. I tried using Google, but unfortunately did't come with proper search terms. Any hints?
This is really just a collection of ideas that was too big for a comment.
Your best bet is probably DBpedia. It's a semantic mirror of Wikipedia, with much more sophisticated query possibilities than Wikipedia's API has. As you can see in this paper it can handle fairly complex spatial queries, but you'll need to get into SPARQL. Here's a figure from that paper:
That said, Wikipedia's API has a relatively new feature for spatial queries: Showing nearby wiki information. I don't think you can search in a polygon, but it's a good start.
Here's a previous answer I wrote about using mwclient
to get the coordinates from articles, but that user had the advantage of having a list of articles to scrape.
Geonames.org might be able to help narrow down the search to geolocated articles. It wouldn't be too bad to check the 806,000 geolocated articles in English Wikipedia.
For performance reasons, and to avoid causing trouble for Wikipedia's servers, you might consider working from a dump of Wikipedia or DBpedia.
Looks like a task for OpenStreetMap and Overpass API.
For constructing our query we go to overpass turbo (a nice frontend for Overpass API), open the wizard and enter "wikipedia=* in London" because we are interested in the wikipedia tag.
The automagically generated and executed query will be this one.
[out:json][timeout:25];
// fetch area “London” to search in
{{geocodeArea:London}}->.searchArea;
// gather results
(
// query part for: “wikipedia=*”
node["wikipedia"](area.searchArea);
way["wikipedia"](area.searchArea);
relation["wikipedia"](area.searchArea);
);
// print results
out body;
>;
out skel qt;
This will return too much elements, also heavily burdening your browser. And might fail because of the timeout being too low.
We modify it slightly. We increase the timeout and we remove the recursion step ( >;
) because we are only interested in the direct results and not any related objects. The resulting query will be this one:
[out:json][timeout:90];
// fetch area “London” to search in
{{geocodeArea:London}}->.searchArea;
// gather results
(
// query part for: “wikipedia=*”
node["wikipedia"](area.searchArea);
way["wikipedia"](area.searchArea);
relation["wikipedia"](area.searchArea);
);
// print results
out body;
out skel qt;
You can view the result here.
Now there are various options to export it. On overpass turbo you can go to export and either safe the results directly to a file or get the raw query that is send to Overpass API. You can now run this query directly from your python script.
Note that there are different output formats available: JSON, XML and CVS. And next to the wikipedia tag you might also be interested in the wikidata tag.
Also note that this won't get you all wikipedia pages with coordinates inside London, just the one that are contained in the OSM database.
链接地址: http://www.djcxy.com/p/62850.html上一篇: 网页刮一个维基百科页面
下一篇: 获取伦敦内部所有维基百科文章