Extracting Web Table Data and Exporting to Excel Using a Tampermonkey Script

Often we need to download tabular data displayed on a web page as an Excel file. When the site does not provide a suitable download option, a custom script can be used to gather the data and export it. Using a Tampermonkey (userscript) makes it easy to share the solution with others. Below are two practical approaches: one that simulates manual pagination and another that intercepts the underlying HTTP requests to collect data more efficiently.

Approach 1: Simulating Pagination to Collect Table Data

This method works when the table data is rendered inside an <iframe> and the page exposes its data through a jQuery plugin. The idea is to automate page turning, capture each page’s data, and merge them.

Because the table resides in a separate iframe (MFrame), we need to obtain a jQuery instance scoped to that iframe:

var frame = document.getElementById("MFrame");
var frameDoc = (frame.contentWindow || frame.contentDocument);
var frameJQ = frameDoc.jQuery;

After the page loads, a small "Start" button appears in the bottom-left corner. When clicked, the script:

  • Sorts the table by clicking the appropriate column header to ensure consistent ordering.
  • Retrieves the total number of records and the total page count from the page elements.
  • Iterates through each page using setInterval to detect when the next page has loaded, then appends the new data.
  • Builds an HTML table string and finally triggers a download via a data URI.

Below is a simplified version of the script (adapted with different variable names and streamlined logic):

// ==UserScript==
// @name         TableScraper
// @namespace    http://tampermonkey.net/
// @version      1.0
// @description  Scrape paginated table inside iframe and export to Excel
// @match        http://example.com/Index.aspx
// @require      https://ajax.aspnetcdn.com/ajax/jQuery/jquery-3.5.1.min.js
// @grant        GM_addStyle
// ==/UserScript==

GM_addStyle('.scrape-btn { position:fixed; bottom:5px; left:5px; padding:12px 20px; border-radius:10px; cursor:pointer; background:transparent; color:#ccc; }');

var fullData = '';
var sheetName = '';
var totalRecords = 0;
var totalPages = 0;
var recordCount = 0;
var prevCaseNumber = '';
const COLUMNS = ['courseId', 'courseTime', 'courseContent'];

$(document).ready(function () {
  var checkExist = setInterval(function () {
    if (document.querySelector('.m-top-box')) {
      clearInterval(checkExist);
      addStartButton();
    }
  }, 800);
});

function addStartButton() {
  var btn = document.createElement('button');
  btn.textContent = 'START';
  btn.className = 'scrape-btn';
  btn.addEventListener('click', startScraping, false);
  document.body.appendChild(btn);
}

function startScraping() {
  var frame = document.getElementById('MFrame');
  var frameWin = frame.contentWindow || frame.contentDocument;
  var $f = frameWin.jQuery;

  // Sort by the 6th column to keep order
  $f('.th_td:nth-child(6)').click();

  totalRecords = $f('.tdlist').data('ordertable').COUNT;
  totalPages = parseInt($f('#AllPageNum').text());

  var now = new Date();
  var y = now.getFullYear();
  var m = String(now.getMonth() + 1).padStart(2, '0');
  var d = String(now.getDate()).padStart(2, '0');
  var h = String(now.getHours()).padStart(2, '0');
  var min = String(now.getMinutes()).padStart(2, '0');
  sheetName = `${y}-${m}-${d}-${h}-${min}`;

  // Build header row
  fullData = '<tr>';
  COLUMNS.forEach(function (col) {
    fullData += '<th>' + col + '</th>';
  });
  fullData += '</tr>';

  recordCount = 0;
  prevCaseNumber = '';

  // Fetch first page after a short delay (simulate clicking next)
  setTimeout(function () {
    processNextPage($f, 1);
  }, 500);
}

function processNextPage($f, pageNum) {
  var timer = setInterval(function () {
    var currentData = $f('.tdlist').data('ordertable').TABLE;
    var currentFirstCase = currentData[0].courseId;
    var currentPage = parseInt($f('#txtPage').val());

    if (currentFirstCase !== prevCaseNumber && currentPage === pageNum) {
      clearInterval(timer);
      prevCaseNumber = currentFirstCase;

      recordCount += currentData.length;
      appendRows(currentData);

      // Check if next page button is disabled
      var nextDisabled = $f('#NextPageBtn .btn_disabled').text();
      if (nextDisabled === '0') {
        // Still have next page
        $f('.page-next').click();
        setTimeout(function () {
          processNextPage($f, pageNum + 1);
        }, 200);
      } else {
        // No more pages -> download
        setTimeout(function () {
          downloadExcel(fullData, sheetName);
          console.log(`Total records: ${recordCount}/${totalRecords}`);
        }, 500);
      }
    }
  }, 300);
}

function appendRows(tableData) {
  tableData.forEach(function (row) {
    fullData += '<tr>';
    COLUMNS.forEach(function (col) {
      fullData += '<td style="text-align:center;">' + (row[col] || '') + '</td>';
    });
    fullData += '</tr>';
  });
}

This approach is straightforward but tightly coupled to the page’s DOM structure and the behaviour of its pagination.

Approach 2: Intercepting HTTP Requests to Fetch Data

When the page code is minified/obfuscated, mimicking clicks becomes impractical. Instead, we can intercept the AJAX requests made by the page, extract the required authentication token, and then issue our own requests to the backend API. This gives us direct access to the JSON data.

We can break the XMLHttpRequest.prototype.open method to capture outgoing requests and store the token:

function createRequestInterceptor(logFn) {
  var originalOpen = XMLHttpRequest.prototype.open;
  return {
    enable: function () {
      var self = this;
      XMLHttpRequest.prototype.open = function () {
        var args = Array.prototype.slice.call(arguments);
        if (args[0] === 'POST') {
          var token = extractParam(args[1], 'accessToken');
          if (token) {
            console.log('Token captured:', token);
            self.disable();  // stop intercepting after token found
          }
        }
        originalOpen.apply(this, args);
      };
    },
    disable: function () {
      XMLHttpRequest.prototype.open = originalOpen;
    }
  };
}

function extractParam(url, param) {
  var regex = new RegExp('[?&]' + param + '=([^&]+)');
  var match = regex.exec(url);
  return match ? match[1] : null;
}

Once the token is captured, we can call the API endpoints directly. For example, to download the "to-do" list and the "completed" list (each requiring additional sub-requests), we can chain the calls:

function fetchTodoList(baseUrl, token) {
  var pageSize = 2; // initial small request to get total count
  var url = `${baseUrl}distribute/disEventInfo/toDoList?accessToken=${token}&boxType=1&pageNumber=1&pageSize=${pageSize}`;

  return $.post(url).then(function (res) {
    var total = res.total;
    var fullUrl = `${baseUrl}distribute/disEventInfo/toDoList?accessToken=${token}&boxType=1&pageNumber=1&pageSize=${total + 88}`;
    return $.post(fullUrl);
  }).then(function (res) {
    var items = res.data;
    var htmlRows = '';
    items.forEach(function (item) {
      htmlRows += '<tr>';
      // map columns
      htmlRows += '<td>' + item.caseNumber + '</td>';
      htmlRows += '<td>' + item.eventType + '</td>';
      htmlRows += '<td>' + item.regCompleteTime + '</td>';
      // phone may contain asterisks -> need another request
      var phone = item.reportPhone;
      if (phone && phone.indexOf('*') !== -1) {
        // synchronous call to get real phone
        var detailUrl = `${baseUrl}distribute/disEventInfo/${item.id}?accessToken=${token}`;
        $.ajax({ url: detailUrl, async: false }).done(function (detail) {
          phone = detail.data.reportPhone;
        });
      }
      htmlRows += '<td>' + (phone || '') + '</td>';
      htmlRows += '</tr>';
    });
    return htmlRows;
  });
}

Similarly, the completed list requires a sub-request for each item to get the processing details (handling departments, satisfaction, etc.). This can be slow if done sequentially; therefore we introduce parallelism with concurrency control.

Parallel Requests with Concurrency Limit

To speed up the sub-requests for the "completed" list, we can use a promise-based multiRequest function that issues up to MAX_CON requests in parallel:

function multiRequest(urls, maxConcurrency) {
  var len = urls.length;
  var results = new Array(len);
  var completed = 0;

  return new Promise(function (resolve, reject) {
    function next() {
      var current = completed++;  // capture index before increment
      if (current >= len) {
        if (results.indexOf(false) === -1) resolve(results);
        return;
      }
      var url = urls[current];
      fetch(url)
        .then(function (resp) { return resp.json(); })
        .then(function (data) {
          // Process and store in results[current]
          var processed = processDisposeData(data.data);
          results[current] = processed;
          if (current < len) next();
        })
        .catch(function (err) {
          results[current] = err;
          if (current < len) next();
        });
    }
    for (var i = 0; i < maxConcurrency; i++) next();
  });
}

function processDisposeData(items) {
  var depts = [];
  var times = [];
  var msgs = [];
  var sats = [];
  items.forEach(function (d) {
    depts.push(d.handleDeptName);
    times.push(d.createTime);
    msgs.push(d.message);
    var review = d.disReviewSituationModels;
    if (review && review.length) {
      var sat = review[review.length - 1].satisfactionType;
      sats.push(sat === '1' ? 'Satisfied' : (sat === '0' ? 'Unsatisfied' : 'No feedback'));
    } else {
      sats.push('No review');
    }
  });
  return {
    handleDeptNames: depts.join('<br/>=====<br/>'),
    last_time: maxDate(times),
    messages: msgs.join('<br/>===========<br/>'),
    satisfactions: sats.join('<br/>=====<br/>')
  };
}

With this approach, we can collect all processing details efficiently and then build the final HTML table.

Exporting to Excel via HTML Table

Regardless of the method, the final HTML table string is converted to an Excel file using a data URI. The trick is to embed the table inside an HTML document that mimics an Excel workbook, with appropriate XML headers to make the .xls file render correctly.

function downloadExcel(tableHtml, filename) {
  var template = `
    <html xmlns:o="urn:schemas-microsoft-com:office:office" 
          xmlns:x="urn:schemas-microsoft-com:office:excel">
      <head>
        <title></title>
        <xml>
          <x:ExcelWorkbook>
            <x:ExcelWorksheets>
              <x:ExcelWorksheet>
                <x:Name>${filename}</x:Name>
                <x:WorksheetOptions>
                  <x:DisplayGridlines/>
                </x:WorksheetOptions>
              </x:ExcelWorksheet>
            </x:ExcelWorksheets>
          </x:ExcelWorkbook>
        </xml>
        <style>br {mso-data-placement:same-cell;}</style>
      </head>
      <body>
        <table>${tableHtml}</table>
      </body>
    </html>`;

  var encoded = window.btoa(unescape(encodeURIComponent(template)));
  var dataUri = 'data:application/vnd.ms-excel;base64,' + encoded;
  var link = document.createElement('a');
  link.href = dataUri;
  link.download = filename + '.xls';
  link.click();
}

The CSS rule br {mso-data-placement:same-cell;} ensures that line breaks within a cell do not split the cell into multiple rows.

Utility Functions

Date Helpers

function maxDate(dateArray) {
  var timestamps = dateArray.map(function (d) { return new Date(d.replace(/-/g, '/')).getTime(); });
  var maxTs = Math.max.apply(null, timestamps);
  return dateArray[timestamps.indexOf(maxTs)];
}

function compareDates(d1, d2) {
  return new Date(d1.replace(/-/g, '/')) > new Date(d2.replace(/-/g, '/'));
}

URL Parameter Extraction

function getParam(url, param) {
  var regex = new RegExp('[?&]' + param + '=([^&]+)');
  var match = regex.exec(url);
  return match ? match[1] : null;
}

Console Interaction Enhancements

To make the script more engaging, you can log ASCII art (converted from an image) or fetch a random poem from an API like http://api.tianapi.com/qingshi/index?key=your_key. The poem API returns JSON similar to:

{"code":200,"newslist":[{"content":"..." ,"source":"...","author":"..."}]}

This can be displayed in the console with coloured text:

if (poem) {
  console.log('%c✿ ' + poem.content, 'color:#ba0707;font-size:20px;');
  console.log('%c—— ' + poem.author + '《' + poem.source + '》', 'color:#740505;font-size:14px;');
}

When to Attach the Start Button

Wait for a crucial DOM element to exist, then inject the button:

$(document).ready(function () {
  var interval = setInterval(function () {
    if (document.querySelector('.target-element')) {
      clearInterval(interval);
      createStartButton();
    }
  }, 800);
});

Installing the Script

To use these scripts, install the Tampermonkey extension in your browser, create a new script, and paste the code. Then enable the script on the target website. The script will automatically inject the start button when the page is ready.

These techniques demonstrate how to extract table data from a web page using either DOM manipulation or network request interception, and then export the data as an Excel file—all within a single userscript.

Tags: Tampermonkey javascript Excel web scraping HTTP Interception

Posted on Sun, 17 May 2026 06:41:49 +0000 by WendyB