Python offers a variety of mechanisms to persist data, each suited for different formats, use cases, and performance requirements. Below is a comprehensive overview of the most widely used approaches, with practical examples.
CSV Files
CSV is a lightweight, human-readable format ideal for tabular data. The standard library provides csv, while pandas simplifies operations on structured data.
import csv
with open('output.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['product', 'price', 'category'])
writer.writerow(['Laptop', 999, 'Electronics'])
writer.writerow(['Book', 25, 'Education'])
Using pandas for direct DataFrame export:
import pandas as pd
df = pd.DataFrame({
'product': ['Laptop', 'Book', 'Pen'],
'price': [999, 25, 2],
'category': ['Electronics', 'Education', 'Stationery']
})
df.to_csv('products.csv', index=False)
Excel Files
Excel (.xlsx) is preferred for reports and human inspection. Pandas supports writing to multiple sheets within a single workbook.
import pandas as pd
sales = pd.DataFrame({'month': ['Jan', 'Feb', 'Mar'], 'revenue': [12000, 15000, 18000]})
inventory = pd.DataFrame({'item': ['A', 'B', 'C'], 'stock': [45, 30, 12]})
with pd.ExcelWriter('report.xlsx', engine='xlsxwriter') as writer:
sales.to_excel(writer, sheet_name='Sales', index=False)
inventory.to_excel(writer, sheet_name='Inventory', index=False)
To append data to an existing file without overwriting:
import pandas as pd
# Append new rows to an existing sheet
new_data = pd.DataFrame({'month': ['Apr'], 'revenue': [21000]})
existing = pd.read_excel('report.xlsx', sheet_name='Sales')
combined = pd.concat([existing, new_data], ignore_index=True)
with pd.ExcelWriter('report.xlsx', engine='openpyxl', mode='a', if_sheet_exists='replace') as writer:
combined.to_excel(writer, sheet_name='Sales', index=False)
Image Files
Images can be generated or saved using various libraries depending on the source data.
Using PIL (Pillow) for synthetic images:
from PIL import Image
import numpy as np
# Create a 100x100 red image
img_array = np.full((100, 100, 3), [255, 0, 0], dtype=np.uint8)
img = Image.fromarray(img_array)
img.save('red_image.jpg', 'JPEG')
With OpenCV (BGR format):
import cv2
import numpy as np
img = np.zeros((100, 100, 3), dtype=np.uint8)
img[:] = [0, 0, 255] # BGR: red
cv2.imwrite('red_image_cv2.png', img)
Using matplotlib to save plots:
import matplotlib.pyplot as plt
import numpy as np
data = np.random.randn(50, 50)
plt.imshow(data, cmap='viridis')
plt.axis('off')
plt.savefig('heatmap.png', dpi=150, bbox_inches='tight', pad_inches=0)
Text Files
Plain text is useful for logs or simple key-value storage.
lines = [
"2024-05-01 INFO: System started",
"2024-05-01 WARN: Low memory",
"2024-05-01 INFO: Process completed"
]
with open('log.txt', 'w') as f:
f.write('\n'.join(lines))
Save NumPy arrays as formatted text:
import numpy as np
matrix = np.array([[1.1, 2.2], [3.3, 4.4]])
np.savetxt('matrix.txt', matrix, fmt='%.2f', delimiter='\t')
JSON Files
JSON is ideal for structured, hierarchical data and API communication.
import json
config = {
"database": {
"host": "localhost",
"port": 5432,
"timeout": 30
},
"features": ["auth", "logging", "backup"]
}
with open('config.json', 'w') as f:
json.dump(config, f, indent=2)
Pandas can export DataFrames as JSON arrays:
import pandas as pd
records = pd.DataFrame({
"id": [1, 2, 3],
"name": ["Alice", "Bob", "Charlie"],
"score": [88.5, 92.0, 76.5]
})
records.to_json('records.json', orient='records', lines=True)
SQL Databases
SQLite is built into Python and suitable for lightwieght applications.
import sqlite3
import pandas as pd
data = pd.DataFrame({
"user_id": [101, 102, 103],
"email": ["a@example.com", "b@example.com", "c@example.com"],
"joined": ["2024-01-15", "2024-02-20", "2024-03-10"]
})
conn = sqlite3.connect('users.db')
data.to_sql('users', conn, if_exists='replace', index=False)
conn.close()
For remote databases like MySQL, use SQLAlchemy with a connector:
from sqlalchemy import create_engine
engine = create_engine('mysql+pymysql://user:pass@localhost/mydb')
data.to_sql('users', engine, if_exists='replace', index=False)
Binary Files
For raw data or serialized objects, binary formats offer efficiency and compactness.
# Save raw bytes
raw_bytes = bytes([0x10, 0x20, 0x30, 0x40])
with open('data.bin', 'wb') as f:
f.write(raw_bytes)
NumPy’s native format preserves array structure and dtype:
import numpy as np
arr = np.random.randint(0, 100, size=(1000,))
np.save('array_data.npy', arr)
# Load back
loaded = np.load('array_data.npy')
Pickle Files
Pickle serializes Python objects, including custom classes, making it powerful but not portable across languages.
import pickle
model = {
'algorithm': 'RandomForest',
'parameters': {'n_estimators': 100, 'max_depth': 5},
'accuracy': 0.94
}
with open('model.pkl', 'wb') as f:
pickle.dump(model, f)
with open('model.pkl', 'rb') as f:
restored = pickle.load(f)
Parquet Files
Parquet is a columnar storage format optimized for analytics and large datasets, commonly used in data engineering.
import pandas as pd
large_dataset = pd.DataFrame({
'timestamp': pd.date_range('2024-01-01', periods=10000, freq='min'),
'value': np.random.randn(10000),
'category': np.random.choice(['A', 'B', 'C'], 10000)
})
large_dataset.to_parquet('analytics.parquet')