Working with Word documents in Python requires specialized libraries to handle the .docx file format. The python-docx package provides comprehensive functionality for document manipulation.
Library Installation
Install the package using pip:
pip install python-docx
Basic Document Reading
Extract text content from a Word document with this implementation:
from docx import Document
def extract_document_content(filepath):
doc = Document(filepath)
# Process paragraphs
for paragraph in doc.paragraphs:
print(paragraph.text)
# Handle tables
for table in doc.tables:
for row in table.rows:
for cell in row.cells:
print(cell.text)
# Identify headings
for section in doc.paragraphs:
if section.style.name.startswith('Heading'):
print(f'Heading: {section.text}')
# Example usage
extract_document_content('sample.docx')
Key Considerations
- The library primarily handles text content; complex elements like images require additional processing
- Text extraction may include formatting artifacts that need cleaning
- Document styles and formatting can be accessed through style attributes
- Memory usage increases with document size
Advanced Functionality
Document Modification
Update document content programmatically:
def update_document_text(input_file, output_file, old_text, new_text):
doc = Document(input_file)
for paragraph in doc.paragraphs:
if old_text in paragraph.text:
for run in paragraph.runs:
run.text = run.text.replace(old_text, new_text)
doc.save(output_file)
Content Addition
Insert new elements into documents:
def append_document_content(filename):
doc = Document(filename)
doc.add_paragraph('Additional content')
doc.add_heading('New Section', level=2)
new_table = doc.add_table(rows=2, cols=2)
header = new_table.rows[0].cells
header[0].text = 'Column A'
header[1].text = 'Column B'
doc.save(f'updated_{filename}')
Structural Analysis
The library enables extraction of document hierarchy including:
- Heading levels
- List items
- Section breaks
Style Management
Access and modify formatting attributes:
- Font properties
- Paragraph spacing
- Color schemes
Performance Optimization
For large documents:
- Implement streaming reads
- Process in chunks
- Use efficient memory management techniques