To create an offline archive of web-based documentation, we can leverage Node.js core modules. This approach relies on the native http module for network requests, the fs module for saving files, and ES6 Promises to manage asynchronous operations.
1. Extracting URLs via the Browser
The first step involves identifying the specific pages to download. Using the browser's Developer Tools (F12), we can inspect the table of contents. Typically, the navigation links are contained within a specific list element. We can extract these URLs directly from the console.
Assuming the links are located within an ordered list, the following snippet can be executed in the browser console to compile the URLs:
const urlList = [];
const anchors = document.querySelectorAll('ol a');
anchors.forEach(link => {
urlList.push(link.getAttribute('href'));
});
console.log(JSON.stringify(urlList));
Once the array of URLs is logged, it can be copied and used within the Node.js script.
2. Setting Up the Node.js Environment
Create a new JavaScript file (e.g., scraper.js) and import the necessary modules. Define the base URL and the list of endpoints extracted in the previous step.
const fs = require('fs');
const http = require('http');
const baseUrl = 'http://example.com/';
// Endpoints derived from browser extraction
const endpoints = [
'intro.md',
'chapter1.md',
'chapter2.md'
];
3. Implementing the Fetch Logic
We need a function to handle the HTTP request and file writing. This function will return a Promise, allowing us to control the flow of asynchronous requests. It constructs the full URL, fetches the content, and writes the resposne to a local file.
function fetchAndSave(targetUrl, outputFilename) {
return new Promise((resolve, reject) => {
let fileContent = '';
const request = http.get(targetUrl, (response) => {
response.setEncoding('utf8');
// Accumulate data chunks
response.on('data', (chunk) => {
fileContent += chunk;
});
// Handle end of response
response.on('end', () => {
fs.writeFile(`./docs/${outputFilename}`, fileContent, 'utf8', (err) => {
if (err) {
console.error(`Error writing file ${outputFilename}: ${err}`);
reject(err);
} else {
console.log(`Successfully saved: ${outputFilename}`);
resolve();
}
});
});
});
request.on('error', (err) => {
console.error(`Request failed for ${targetUrl}: ${err.message}`);
reject(err);
});
});
}
4. Executing the Scraper
Since network requests are asynchronous, iterating through the URLs requires careful handling to ensure requests are processed sequentially. We can iterate through the endpoint array and chain the Promises together.
let sequence = Promise.resolve();
endpoints.forEach((endpoint, index) => {
const fullUrl = baseUrl + endpoint;
const fileName = `${index + 1}.md`;
// Chain the promises to ensure sequential execution
sequence = sequence.then(() => fetchAndSave(fullUrl, fileName));
});
sequence.then(() => {
console.log('All documents have been downloaded.');
}).catch((error) => {
console.error('An error occurred during the scraping process:', error);
});
This script initiates a sequential download process. It ensures that each file is requested and saved in order, preventing potential issues with connection limits or race conditions on the file system.