Building a Simple Web Scraper with Node.js for Offline Documentation

To create an offline archive of web-based documentation, we can leverage Node.js core modules. This approach relies on the native http module for network requests, the fs module for saving files, and ES6 Promises to manage asynchronous operations.

1. Extracting URLs via the Browser

The first step involves identifying the specific pages to download. Using the browser's Developer Tools (F12), we can inspect the table of contents. Typically, the navigation links are contained within a specific list element. We can extract these URLs directly from the console.

Assuming the links are located within an ordered list, the following snippet can be executed in the browser console to compile the URLs:

const urlList = [];
const anchors = document.querySelectorAll('ol a');

anchors.forEach(link => {
    urlList.push(link.getAttribute('href'));
});

console.log(JSON.stringify(urlList));

Once the array of URLs is logged, it can be copied and used within the Node.js script.

2. Setting Up the Node.js Environment

Create a new JavaScript file (e.g., scraper.js) and import the necessary modules. Define the base URL and the list of endpoints extracted in the previous step.

const fs = require('fs');
const http = require('http');

const baseUrl = 'http://example.com/';
// Endpoints derived from browser extraction
const endpoints = [
    'intro.md', 
    'chapter1.md', 
    'chapter2.md'
];

3. Implementing the Fetch Logic

We need a function to handle the HTTP request and file writing. This function will return a Promise, allowing us to control the flow of asynchronous requests. It constructs the full URL, fetches the content, and writes the resposne to a local file.

function fetchAndSave(targetUrl, outputFilename) {
    return new Promise((resolve, reject) => {
        let fileContent = '';
        
        const request = http.get(targetUrl, (response) => {
            response.setEncoding('utf8');
            
            // Accumulate data chunks
            response.on('data', (chunk) => {
                fileContent += chunk;
            });

            // Handle end of response
            response.on('end', () => {
                fs.writeFile(`./docs/${outputFilename}`, fileContent, 'utf8', (err) => {
                    if (err) {
                        console.error(`Error writing file ${outputFilename}: ${err}`);
                        reject(err);
                    } else {
                        console.log(`Successfully saved: ${outputFilename}`);
                        resolve();
                    }
                });
            });
        });

        request.on('error', (err) => {
            console.error(`Request failed for ${targetUrl}: ${err.message}`);
            reject(err);
        });
    });
}

4. Executing the Scraper

Since network requests are asynchronous, iterating through the URLs requires careful handling to ensure requests are processed sequentially. We can iterate through the endpoint array and chain the Promises together.

let sequence = Promise.resolve();

endpoints.forEach((endpoint, index) => {
    const fullUrl = baseUrl + endpoint;
    const fileName = `${index + 1}.md`;
    
    // Chain the promises to ensure sequential execution
    sequence = sequence.then(() => fetchAndSave(fullUrl, fileName));
});

sequence.then(() => {
    console.log('All documents have been downloaded.');
}).catch((error) => {
    console.error('An error occurred during the scraping process:', error);
});

This script initiates a sequential download process. It ensures that each file is requested and saved in order, preventing potential issues with connection limits or race conditions on the file system.

Tags: Node.js web scraping ES6 HTTP File System

Posted on Fri, 26 Jun 2026 17:09:46 +0000 by RussellReal

Freaks City

Building a Simple Web Scraper with Node.js for Offline Documentation

1. Extracting URLs via the Browser

2. Setting Up the Node.js Environment

3. Implementing the Fetch Logic

4. Executing the Scraper

Hot Tags