How to scrape JSON data streamed via websockets on a target site
I've been asked to scrape a site which receives data via websockets and then renders that to the page via javascript/jquery. Is it possible to bypass the middleman (the DOM) and consume/scrape the data coming over the socket? Might this be possible with a headless webkit like phantomJS? The target site is using socket.io.
I need to consume the data and trigger alerts based on keywords in the data. I'm considering the Goutte library and will be building the scraper in PHP.
Socket.io is not exactly the same as websockets. Since you know they use socket.io i'm focussing on that. The easiest way to scrape this socket is using the socket.io client.
Put this on your page:
<script src="https://github.com/LearnBoost/socket.io-client/blob/0.9/dist/socket.io.js"></script>
<script src="scraper.js"></script>
Create file scraper.js:
var keywords = /foo|bar/ig;
var socket = io.connect('http://host-to-scrape:portnumber/path');
socket.on('<socket.io-eventname>', function (data) {
// The scraped data is in 'data', do whatever you want with it
console.log(data);
// Assuming data.body contains a string containing keywords:
if(keywords.test(data.body)) callOtherFunction(data.body);
// Talk back:
// socket.emit('eventname', { my: 'data' });
});
UPDATE 6-1-2014
Instead of running this on the server it looks like your trying to run this in a browser window, looking at the StackOverflow question you referenced below. So I removed everything about NodeJS as that is not needed.
This would be the best way for you in my opinion :
Scrap the data directly from the client page of your app using javascript without using php as middle end. With this way your server will have not absolutely any load and i will recommend this. As your target site is using socket.io, use a socket.io client to scrap the data. Form socke.io offiscial site:
<script src="/socket.io/socket.io.js"></script>
<script>
var socket = io.connect('http://target_website.com');
//look the next line closely
socket.on('event_name', function (data) {
console.log(data);
//do something with data here
});
</script>
As the question arises , how will you know *event_name* ? You have to find that by doing research on the target site's js. There is no work around. At least i do not know any of them without them.
链接地址: http://www.djcxy.com/p/75376.html