Often we need to download tabular data displayed on a web page as an Excel file. When the site does not provide a suitable download option, a custom script can be used to gather the data and export it. Using a Tampermonkey (userscript) makes it easy to share the solution with others. Below are two practical approaches: one that simulates manual pagination and another that intercepts the underlying HTTP requests to collect data more efficiently.
Approach 1: Simulating Pagination to Collect Table Data
This method works when the table data is rendered inside an <iframe> and the page exposes its data through a jQuery plugin. The idea is to automate page turning, capture each page’s data, and merge them.
Because the table resides in a separate iframe (MFrame), we need to obtain a jQuery instance scoped to that iframe:
var frame = document.getElementById("MFrame");
var frameDoc = (frame.contentWindow || frame.contentDocument);
var frameJQ = frameDoc.jQuery;
After the page loads, a small "Start" button appears in the bottom-left corner. When clicked, the script:
- Sorts the table by clicking the appropriate column header to ensure consistent ordering.
- Retrieves the total number of records and the total page count from the page elements.
- Iterates through each page using
setIntervalto detect when the next page has loaded, then appends the new data. - Builds an HTML table string and finally triggers a download via a data URI.
Below is a simplified version of the script (adapted with different variable names and streamlined logic):
// ==UserScript==
// @name TableScraper
// @namespace http://tampermonkey.net/
// @version 1.0
// @description Scrape paginated table inside iframe and export to Excel
// @match http://example.com/Index.aspx
// @require https://ajax.aspnetcdn.com/ajax/jQuery/jquery-3.5.1.min.js
// @grant GM_addStyle
// ==/UserScript==
GM_addStyle('.scrape-btn { position:fixed; bottom:5px; left:5px; padding:12px 20px; border-radius:10px; cursor:pointer; background:transparent; color:#ccc; }');
var fullData = '';
var sheetName = '';
var totalRecords = 0;
var totalPages = 0;
var recordCount = 0;
var prevCaseNumber = '';
const COLUMNS = ['courseId', 'courseTime', 'courseContent'];
$(document).ready(function () {
var checkExist = setInterval(function () {
if (document.querySelector('.m-top-box')) {
clearInterval(checkExist);
addStartButton();
}
}, 800);
});
function addStartButton() {
var btn = document.createElement('button');
btn.textContent = 'START';
btn.className = 'scrape-btn';
btn.addEventListener('click', startScraping, false);
document.body.appendChild(btn);
}
function startScraping() {
var frame = document.getElementById('MFrame');
var frameWin = frame.contentWindow || frame.contentDocument;
var $f = frameWin.jQuery;
// Sort by the 6th column to keep order
$f('.th_td:nth-child(6)').click();
totalRecords = $f('.tdlist').data('ordertable').COUNT;
totalPages = parseInt($f('#AllPageNum').text());
var now = new Date();
var y = now.getFullYear();
var m = String(now.getMonth() + 1).padStart(2, '0');
var d = String(now.getDate()).padStart(2, '0');
var h = String(now.getHours()).padStart(2, '0');
var min = String(now.getMinutes()).padStart(2, '0');
sheetName = `${y}-${m}-${d}-${h}-${min}`;
// Build header row
fullData = '<tr>';
COLUMNS.forEach(function (col) {
fullData += '<th>' + col + '</th>';
});
fullData += '</tr>';
recordCount = 0;
prevCaseNumber = '';
// Fetch first page after a short delay (simulate clicking next)
setTimeout(function () {
processNextPage($f, 1);
}, 500);
}
function processNextPage($f, pageNum) {
var timer = setInterval(function () {
var currentData = $f('.tdlist').data('ordertable').TABLE;
var currentFirstCase = currentData[0].courseId;
var currentPage = parseInt($f('#txtPage').val());
if (currentFirstCase !== prevCaseNumber && currentPage === pageNum) {
clearInterval(timer);
prevCaseNumber = currentFirstCase;
recordCount += currentData.length;
appendRows(currentData);
// Check if next page button is disabled
var nextDisabled = $f('#NextPageBtn .btn_disabled').text();
if (nextDisabled === '0') {
// Still have next page
$f('.page-next').click();
setTimeout(function () {
processNextPage($f, pageNum + 1);
}, 200);
} else {
// No more pages -> download
setTimeout(function () {
downloadExcel(fullData, sheetName);
console.log(`Total records: ${recordCount}/${totalRecords}`);
}, 500);
}
}
}, 300);
}
function appendRows(tableData) {
tableData.forEach(function (row) {
fullData += '<tr>';
COLUMNS.forEach(function (col) {
fullData += '<td style="text-align:center;">' + (row[col] || '') + '</td>';
});
fullData += '</tr>';
});
}
This approach is straightforward but tightly coupled to the page’s DOM structure and the behaviour of its pagination.
Approach 2: Intercepting HTTP Requests to Fetch Data
When the page code is minified/obfuscated, mimicking clicks becomes impractical. Instead, we can intercept the AJAX requests made by the page, extract the required authentication token, and then issue our own requests to the backend API. This gives us direct access to the JSON data.
We can break the XMLHttpRequest.prototype.open method to capture outgoing requests and store the token:
function createRequestInterceptor(logFn) {
var originalOpen = XMLHttpRequest.prototype.open;
return {
enable: function () {
var self = this;
XMLHttpRequest.prototype.open = function () {
var args = Array.prototype.slice.call(arguments);
if (args[0] === 'POST') {
var token = extractParam(args[1], 'accessToken');
if (token) {
console.log('Token captured:', token);
self.disable(); // stop intercepting after token found
}
}
originalOpen.apply(this, args);
};
},
disable: function () {
XMLHttpRequest.prototype.open = originalOpen;
}
};
}
function extractParam(url, param) {
var regex = new RegExp('[?&]' + param + '=([^&]+)');
var match = regex.exec(url);
return match ? match[1] : null;
}
Once the token is captured, we can call the API endpoints directly. For example, to download the "to-do" list and the "completed" list (each requiring additional sub-requests), we can chain the calls:
function fetchTodoList(baseUrl, token) {
var pageSize = 2; // initial small request to get total count
var url = `${baseUrl}distribute/disEventInfo/toDoList?accessToken=${token}&boxType=1&pageNumber=1&pageSize=${pageSize}`;
return $.post(url).then(function (res) {
var total = res.total;
var fullUrl = `${baseUrl}distribute/disEventInfo/toDoList?accessToken=${token}&boxType=1&pageNumber=1&pageSize=${total + 88}`;
return $.post(fullUrl);
}).then(function (res) {
var items = res.data;
var htmlRows = '';
items.forEach(function (item) {
htmlRows += '<tr>';
// map columns
htmlRows += '<td>' + item.caseNumber + '</td>';
htmlRows += '<td>' + item.eventType + '</td>';
htmlRows += '<td>' + item.regCompleteTime + '</td>';
// phone may contain asterisks -> need another request
var phone = item.reportPhone;
if (phone && phone.indexOf('*') !== -1) {
// synchronous call to get real phone
var detailUrl = `${baseUrl}distribute/disEventInfo/${item.id}?accessToken=${token}`;
$.ajax({ url: detailUrl, async: false }).done(function (detail) {
phone = detail.data.reportPhone;
});
}
htmlRows += '<td>' + (phone || '') + '</td>';
htmlRows += '</tr>';
});
return htmlRows;
});
}
Similarly, the completed list requires a sub-request for each item to get the processing details (handling departments, satisfaction, etc.). This can be slow if done sequentially; therefore we introduce parallelism with concurrency control.
Parallel Requests with Concurrency Limit
To speed up the sub-requests for the "completed" list, we can use a promise-based multiRequest function that issues up to MAX_CON requests in parallel:
function multiRequest(urls, maxConcurrency) {
var len = urls.length;
var results = new Array(len);
var completed = 0;
return new Promise(function (resolve, reject) {
function next() {
var current = completed++; // capture index before increment
if (current >= len) {
if (results.indexOf(false) === -1) resolve(results);
return;
}
var url = urls[current];
fetch(url)
.then(function (resp) { return resp.json(); })
.then(function (data) {
// Process and store in results[current]
var processed = processDisposeData(data.data);
results[current] = processed;
if (current < len) next();
})
.catch(function (err) {
results[current] = err;
if (current < len) next();
});
}
for (var i = 0; i < maxConcurrency; i++) next();
});
}
function processDisposeData(items) {
var depts = [];
var times = [];
var msgs = [];
var sats = [];
items.forEach(function (d) {
depts.push(d.handleDeptName);
times.push(d.createTime);
msgs.push(d.message);
var review = d.disReviewSituationModels;
if (review && review.length) {
var sat = review[review.length - 1].satisfactionType;
sats.push(sat === '1' ? 'Satisfied' : (sat === '0' ? 'Unsatisfied' : 'No feedback'));
} else {
sats.push('No review');
}
});
return {
handleDeptNames: depts.join('<br/>=====<br/>'),
last_time: maxDate(times),
messages: msgs.join('<br/>===========<br/>'),
satisfactions: sats.join('<br/>=====<br/>')
};
}
With this approach, we can collect all processing details efficiently and then build the final HTML table.
Exporting to Excel via HTML Table
Regardless of the method, the final HTML table string is converted to an Excel file using a data URI. The trick is to embed the table inside an HTML document that mimics an Excel workbook, with appropriate XML headers to make the .xls file render correctly.
function downloadExcel(tableHtml, filename) {
var template = `
<html xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel">
<head>
<title></title>
<xml>
<x:ExcelWorkbook>
<x:ExcelWorksheets>
<x:ExcelWorksheet>
<x:Name>${filename}</x:Name>
<x:WorksheetOptions>
<x:DisplayGridlines/>
</x:WorksheetOptions>
</x:ExcelWorksheet>
</x:ExcelWorksheets>
</x:ExcelWorkbook>
</xml>
<style>br {mso-data-placement:same-cell;}</style>
</head>
<body>
<table>${tableHtml}</table>
</body>
</html>`;
var encoded = window.btoa(unescape(encodeURIComponent(template)));
var dataUri = 'data:application/vnd.ms-excel;base64,' + encoded;
var link = document.createElement('a');
link.href = dataUri;
link.download = filename + '.xls';
link.click();
}
The CSS rule br {mso-data-placement:same-cell;} ensures that line breaks within a cell do not split the cell into multiple rows.
Utility Functions
Date Helpers
function maxDate(dateArray) {
var timestamps = dateArray.map(function (d) { return new Date(d.replace(/-/g, '/')).getTime(); });
var maxTs = Math.max.apply(null, timestamps);
return dateArray[timestamps.indexOf(maxTs)];
}
function compareDates(d1, d2) {
return new Date(d1.replace(/-/g, '/')) > new Date(d2.replace(/-/g, '/'));
}
URL Parameter Extraction
function getParam(url, param) {
var regex = new RegExp('[?&]' + param + '=([^&]+)');
var match = regex.exec(url);
return match ? match[1] : null;
}
Console Interaction Enhancements
To make the script more engaging, you can log ASCII art (converted from an image) or fetch a random poem from an API like http://api.tianapi.com/qingshi/index?key=your_key. The poem API returns JSON similar to:
{"code":200,"newslist":[{"content":"..." ,"source":"...","author":"..."}]}
This can be displayed in the console with coloured text:
if (poem) {
console.log('%c✿ ' + poem.content, 'color:#ba0707;font-size:20px;');
console.log('%c—— ' + poem.author + '《' + poem.source + '》', 'color:#740505;font-size:14px;');
}
When to Attach the Start Button
Wait for a crucial DOM element to exist, then inject the button:
$(document).ready(function () {
var interval = setInterval(function () {
if (document.querySelector('.target-element')) {
clearInterval(interval);
createStartButton();
}
}, 800);
});
Installing the Script
To use these scripts, install the Tampermonkey extension in your browser, create a new script, and paste the code. Then enable the script on the target website. The script will automatically inject the start button when the page is ready.
These techniques demonstrate how to extract table data from a web page using either DOM manipulation or network request interception, and then export the data as an Excel file—all within a single userscript.