Efficient PDF Table Data Extraction to Text and Excel Using Python Libraries

Extracting tabular data from PDF documents, while crucial for analytics and automation workflows, can be challenging due to the format's non-editable nature. Manual copy-pasting is inefficient and prone to errors like data misalignment or omissions. This guide outlines a streamlined approach using Python with dedicated libraries for precise PDF table data extraction and subsequent export to either plain text (TXT) or spreadsheet (Excel) formats. This method avoids complex regular expressions or optical character recognition (OCR), offering a straightforward solution for batch processing diverse PDF files.

Environment Setup: Required Python Library Installation

This demonstration relies on two primary Python libraries, each fulfilling a distinct role in the data extraction and export process:

  • Spire.PDF for Python: A lightweight library specifically designed for PDF manipulation. It provides capabilities to directly identify table structures within PDFs and accurately retrieve cell data without requiring external OCR tools.
  • Spire.XLS for Python: A specialized library for Excel file operations. It enables the creation of new Excel workbooks, writting data to sheets, and applying formatting optimizations, ensuring compatibility across various Excel versions.

To install these dependencies, open your command-line interface and execute the following pip commands. It is recommended to perform this within a virtual environment to prevent potential version conflicts:

pip install Spire.PDF
pip install Spire.XLS

Scenario 1: Exporting PDF Table Content to a TXT File

TXT files are advantageous for their small size and broad compatibility, making them suitable for quick data review and transfer. The core mechanism for extracting data using Spire.PDF for Python and writing it to a TXT file involves systematically navigating the PDF content by page → table → row → column and formatting the extracted information.

Procedural Steps

  1. Load the PDF Document: Utilize the LoadFromFile() method of the PdfDocument class to open the target PDF. Both relative and absolute paths are supported.
  2. Initialize Extraction Utility: Instantiate a PdfTableExtractor object, which is responsible for identifying and pulling tabular data from PDF pages.
  3. Iterate and Extract Data: Traverse through each page of the PDF, then each table found on that page, and subsequently each row and column within the table. Cell text is retrieved using the GetText(rowIndex, columnIndex) method.
  4. Write to TXT File: Arrange the extracted data, typically by separating columns with spaces and rows with newline characters, then write this structured text to a .txt file. Specify UTF-8 encoding to correctly handle international characters.

Comprehensive Code Example

from spire.pdf.common import *
from spire.pdf import *

# 1. Load the input PDF document. Update 'sample_tables.pdf' with your file path.
input_document = PdfDocument()
input_document.LoadFromFile("sample_tables.pdf")

# 2. Prepare a list to accumulate all extracted table lines.
extracted_lines = []

# 3. Create an object for extracting tables from the PDF.
table_processor = PdfTableExtractor(input_document)

# 4. Iterate through each page of the PDF.
for page_number in range(input_document.Pages.Count):
    # Retrieve all tables present on the current page.
    page_tables = table_processor.ExtractTable(page_number)
    
    # Check if any tables were found on the page to avoid errors.
    if page_tables is not None and len(page_tables) > 0:
        # Process each table individually.
        for current_table in page_tables:
            num_rows = current_table.GetRowCount()  # Get total rows in table
            num_cols = current_table.GetColumnCount() # Get total columns in table
            
            # Iterate through each row of the current table.
            for row_idx in range(num_rows):
                current_row_cells = []
                # Iterate through each column in the current row.
                for col_idx in range(num_cols):
                    # Fetch the text content of the current cell.
                    cell_content = current_table.GetText(row_idx, col_idx)
                    current_row_cells.append(cell_content)
                # Join cell contents with a space and add to the list of lines.
                extracted_lines.append(" ".join(current_row_cells))
            
            # Add an empty line between distinct tables for better readability in TXT.
            extracted_lines.append("")

# 5. Write all collected data to a TXT file.
with open("output_pdf_tables.txt", "w", encoding="utf-8") as text_file:
    text_file.write("\n".join(extracted_lines))

# 6. Release resources associated with the PDF document.
input_document.Close()

Scenario 2: Exporting PDF Table Content to an Excel File

Excel files are preferred for data analysis due to their robust features like filtering, formula support, and chart generation. This example integrates both Spire.PDF for Python for data extraction and Spire.XLS for Python for creating and populating the Excel file.

Procedural Steps

The data extraction portion mirrors the TXT example. The additional steps for Excel export include:

  1. Initialize Excel Workbook: Create a Workbook instance and clear any default worksheets to prepare for adding new ones.
  2. Populate Cells: Note that Excel uses 1-based indexing for rows and columns (unlike Python's 0-based indexing). Adjust extracted row and column indices by adding 1 before assigning data to ensure correct placement.
  3. Format and Save: Employ the AutoFitColumns() method to adjust column widths automatically based on content, enhancing readability. Finally, save the workbook to a specified file path and Excel version.

Comprehensive Code Example

from spire.pdf import *
from spire.xls import *

# 1. Load the PDF document containing tables.
input_document = PdfDocument()
input_document.LoadFromFile("sample_tables.pdf")

# 2. Create a new Excel workbook instance and clear any default sheets.
output_spreadsheet = Workbook()
output_spreadsheet.Worksheets.Clear()

# 3. Initialize the table extraction component for the PDF.
table_processor = PdfTableExtractor(input_document)

# 4. Counter for naming new worksheets distinctly.
sheet_counter = 1

# 5. Iterate through each page of the PDF.
for page_number in range(input_document.Pages.Count):
    page_tables = table_processor.ExtractTable(page_number)
    if page_tables is not None and len(page_tables) > 0:
        # Process each table found on the current page.
        for current_table in page_tables:
            # Add a new worksheet to the Excel workbook for each table.
            active_sheet = output_spreadsheet.Worksheets.Add(f"Sheet{sheet_counter}")
            
            num_rows = current_table.GetRowCount()
            num_cols = current_table.GetColumnCount()
            
            # Iterate through table cells to write data to Excel.
            for row_idx in range(num_rows):
                for col_idx in range(num_cols):
                    cell_content = current_table.GetText(row_idx, col_idx)
                    # Assign cell value, adjusting for Excel's 1-based indexing.
                    active_sheet.Range[row_idx + 1, col_idx + 1].Value = cell_content
            
            # Automatically adjust column widths for optimal display.
            active_sheet.AllocatedRange.AutoFitColumns()
            # Increment the sheet counter for the next table.
            sheet_counter += 1

# 6. Save the Excel workbook to a file, specifying path and version.
output_spreadsheet.SaveToFile("exported_pdf_tables.xlsx", ExcelVersion.Version2016)

# 7. Release resources.
input_document.Close()
output_spreadsheet.Dispose()

Important Operational Notes

  1. File Path Verification: If a 'file not found' error occurs during execution, double-check the PDF file path. Using an absolute path (e.g., C:/documents/data/report.pdf) can help prevent such issues.
  2. Character Encoding: When writing to TXT files, always ensure encoding="utf-8" is specified in the open() function to correctly handle multi-byte characters and avoid garbled text. For Excel exports, Spire.XLS for Python ganerally supports various character sets, including Chinese, by default, requiring no additional configuration.
  3. Resource Management: It is critical to invoke Close() on PdfDocument objects and Dispose() on Workbook objects after they are no longer needed. This practice prevents memory leaks, especially in long-running processes or when handling many files consecutively.

Tags: python PDF Data Extraction Table Extraction Excel

Posted on Wed, 13 May 2026 15:36:24 +0000 by Grayda