Writing a Web Scraper with Node.js

Node.js is, according to their website, "a platform built on Chrome's JavaScript runtime for easily building fast, scalable network applications. Node.js uses an event-driven, non-blocking I/O model that makes it lightweight and efficient, perfect for data-intensive real-time applications that run across distributed devices." It is essentially a javascript interpreter for the command line. With Node.js, you can write scripts in JavaScript just like you would with PHP and Python. You can also enter a console shell, just like Python, or the console in browser development tools. It is a relatively young project but a very exciting one!

For a useful script check out Node.js HREF scraper by Extension. This is a very simple and generic script that will let you grab all HREFs on a given URL that point to a file extension you specify.

For me, jQuery is my go to when I need to write JavaScript. Luckily, there are libraries that make essentially give us jQuery functionality with nodejs. All three of these can be installed using npm install jquery nquery cheerio. Npm is the package manager, similar to ruby gem. Also, this is a nifty jQuery cheatsheet. My first experience was with Cheerio, and getting started was as simple as:

var cheerio = require('cheerio');  
$ = cheerio.load('<body><html>.......</html></body>');

After that, you can use $ the same way you would with jQuery.

var page_title = $('#title').text();
console.log(page_title);

The last piece of the puzzle is the request package. You can install it by running npm install request. Combining the request package with cheerio leads us to this:

var request = require('request');
var cheerio = require('cheerio');

request('http://www.google.com/', function(err, resp, body) {
        if (err)
                throw err;
        $ = cheerio.load(body);
        console.log(body);
});

This snippet will output the HTML contents of www.google.com to your shell. Now that we have the ability to grab remote pages and work on them with jQuery the rest is the fun part. I am not going to be too specific here because the code is going to differ based on what page you are looking at and what information you are after. I will leave you with this example demonstrating how to output a list of all links pointing to PDF files.

// Output list of all URLs pointing to a .pdf
$('a').each(function() {
  url = ($(this).attr('href'));
  if (typeof url == 'string') {
    extension = url.split('.').pop();
    if (ext == 'pdf') {
      console.log(url);
    }
  }
});