Scraping a web site has never been easier. Node.js runs on any platform, and if you’re using Windows, you can install it with Microsoft’s Web Platform Installer. From there, it’s as simple as launching a command window and typing:

node myJavascriptFile.js

This executes the code from the console, without the need for a browser. Pretty cool, huh? Because it runs from the console, you can request all sorts of web resources at once and log to any kind of database you want.

Let’s say you want to grab all the front page web links from reddit.com. Here’s the step-by-step process we would like our robot to follow:

  1. Request the web page
  2. Inject jQuery into the web page
  3. Grab a bunch of stuff from the site using jQuery’s syntax, since it’s super simple
  4. Log that stuff to a text file

With node and the jsdom package, we can do this in less than 20 lines of code. First, open a command window, navigate to your working directory, and use npm to install jsdom:

npm install jsdom

Then, create a new file named ParseReddit.js and put this in it:

var jsdom = require('jsdom');

jsdom.env({
    html: 'http://reddit.com',
    scripts: ['http://code.jquery.com/jquery.js'],
    done: (function (errors, window) {
        var $ = window.$;
        // we're interested in the title, subreddit, URL,
        // score, and the number of comments for each link
        var stories = $.map($('#siteTable .thing'), function (thing) {
            return {
                title: $('a.title', thing).text(),
                subreddit: $('a.subreddit', thing).text(),
                href: $('a.title', thing).attr('href'),
                score: $('.score.unvoted', thing).text(),
                numComments: $('a.comments', thing).text().match(/^[0-9]*/)[0] || 0
            };
        });

        console.log(stories);

    })
});

After that, just switch back to the command line and run:

node ParseReddit.js

You’ll see a bunch of JSON show up in a few seconds, this is our parsed home page. Pipe that into a file to recall it later:

node ParseReddit.js > stories.json

From there, you can look at it through JS Beautifier or open it in Excel using json-csv.com.

The Code Explained

jsdom.env: This is the main function which takes care of most of the legwork. It opens the specified URL and injects any scripts you like, giving you a working DOM. Querying the DOM is the easiest way to extract information from a web page, since jQuery was built for this.

The done() function: This is fired when the web page is finished. The window object is the exact same object you would use if you opened the javascript console (Ctrl+Shift+I in Chrome, Cmd+Shift+I in Chrome for Mac).

$: We’re mapping window.$ to just $ to save us some work in this function.

$.map: This is a super handy function for transforming one array into another. I could have used .each(), but using non-iterative code is always better. In this case, we’re translating each div with a class of “thing” on the site into our custom array, which consists of objects we are defining. These objects have title, category, href, score and numComments properties, but really, we can use any terminology we want. For each of those properties, we’re querying inside the “thing” div for more information (yes, reddit’s HTML code really calls each story a “thing”). In Chrome, I used the “Inspect Element” feature to show the DOM and help me craft these.

Another hint: I plugged in the entire line after “var stories =” into Chrome’s javascript console to help me quickly test these out. Chrome is great about letting us inspect each of the returned objects to make sure everything is working OK.

What’s with that .match()? Oh no… are you using a regular expression?: Yep. And I’m really sorry. I’m using it here to translate “29 comments” into just “29”, since the “comments” is redundant. I dropped an “|| 0” on there since reddit just says “comment” if there are no comments yet, so if there are no matches, .match returns undefined and “ is used for the numComments property.

A Word of Caution

If you automate this to pull down more than a couple pages on a site, you might get caught and your IP address could be blacklisted. Web sites implicitly trust all users until they start pulling down more than their fair share of resources, and bandwidth costs money. Don’t be mean.

Site Owners

It’s important to recognize the reality of scrapers. With the right configuration, it’s not hard to imagine a script similar to this one pulling down the entire content of reddit by following every link. And with a little more effort, a programmer could make a scraper look just like regular traffic, by simply changing the user agent, carefully timing the web requests, and spreading the requests across a rented server farm. I don’t think it would take a B-level programmer longer than a weekend to pull down a site like Amazon.

In the past, I worked with a company which dedicated umpteen programmer hours to preventing this. IP address limits, user agent sniffing and even honeypots (hidden links which only scrapers could find, which would immediately blacklist their IP address) were employed on a regular basis, as well as CAPTCHAs at every turn. Regular users were incorrectly identified as scrapers every day, and at one point, Google de-listed them after their bot (which is just another type of scraper) triggered their filters. Preventing scrapers is probably impossible and even dangerous, because it defies the open nature of the web. Instead, try and figure out the other party’s motivations—maybe a paid API or a referral program could crack open a new revenue stream for you.

Posted Thu 05 July 2012