Chapter 15: Generating and Visualizing Data
This chapter explores how to use matplotlib and Pygal to generate data and create practical visualizations. We'll cover the fundamentals of data visualization, which involves exploring data through visual representations, and data mining, which uses code to examine patterns and relationships within datasets.
15.1 Introduction to Visualization Libraries
matplotlib is a mathematical plotting library that enables creation of simple charts such as line charts and scatter plots. It provides extensive formatting options for customizing the appearance of visualizations.
Plotly focuses on generating charts optimized for digital devices. Charts created with Plotly automatically adjust to fit different screen sizes and include interactive features that highlight specific data points when users hover over different chart regions.
15.2 Creating Line Charts and Scatter Plots
To begin creating visualizations, import the pyplot module from matplotlib, commonly aliased as plt:
import matplotlib.pyplot as plt
Create a figure with a single subplot using the subplots() function:
fig, ax = plt.subplots(figsize=(15, 9), dpi=128)
This returns the figure object (fig) representing the entire chart and the axes object (ax) for plotting. The dpi parameter controls resolution, while figsize specifies dimensions in inches as a tuple.
Drawing Line Charts
Use plot() to create line charts:
ax.plot(x_data, y_data, color='green', marker='o', linestyle='-',
label='Data Series', linewidth=2, alpha=0.8)
Parameters include:
- First argument: x-coordinates (iterable, defaults to 0, 1, 2... if omitted)
- Second argument: y-coordinates (required)
- Third argument: style string combining color, marker, and line style
label: legend text (requires callinglegend())linewidth: line thicknessalpha: transparency (0 to 1)
Color options include: g (green), b (blue), r (red), c (cyan), m (magenta). Custom colors can use RGB tuples like (0.02, 0.31, 0.62).
Marker styles: . (point), o (circle), ^ (triangle), v (inverted triangle), * (star), + (plus).
Line styles: - (solid), -- (dashed), -. (dash-dot), : (dotted).
Creating Scatter Plots
The scatter() function creates scatter plots:
ax.scatter(x_values, y_values, edgecolor='black', c='blue', s=50)
Parameters include x and y coordinates (both required), edgecolor for outline color (use 'none' to remove outlines), c for point color (accepts color names or RGB tuples), and s for point size.
Use the cmap parameter to apply color maps that gradient from start to end colors based on values:
ax.scatter(x_values, y_values, c=y_values, cmap=plt.cm.Blues, s=40)
Chart Formatting
Add titles and axis labels:
plt.title('Chart Title', fontsize=24)
plt.xlabel('X Axis Label', fontsize=14)
plt.ylabel('Y Axis Label', fontsize=14)
Configure tick parameters:
plt.tick_params(axis='both', labelsize=14, which='major')
The axis parameter accepts 'x', 'y', or 'both'. The which parameter specifies 'major', 'minor', or 'both' tick marks.
Rotate x-axis labels to prevent overlapping:
fig.autofmt_xdate()
Display the legend:
ax.legend()
# Or set labels programmatically
line, = ax.plot([1, 2, 3])
line.set_label('My Label')
ax.legend()
Set axis ranges:
plt.axis([xmin, xmax, ymin, ymax])
Hide axes:
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
Adjust figure size and resolution:
plt.figure(dpi=128, figsize=(10, 6))
Save the chart to a file:
plt.savefig('chart.png', bbox_inches='tight')
The bbox_inches='tight' parameter trims excess whitespace.
Complete Example: Line Chart
import matplotlib.pyplot as plt
x_data = [1, 2, 3, 4, 5]
y_data = [1, 4, 9, 16, 25]
plt.plot(x_data, y_data, linewidth=5, label='Squared Values')
plt.title("Square Numbers", fontsize=24)
plt.xlabel("Value", fontsize=14)
plt.ylabel("Square of Value", fontsize=14)
plt.legend()
plt.tick_params(axis='both', labelsize=14)
plt.show()
Complete Example: Scatter Plot
import matplotlib.pyplot as plt
x_values = list(range(1, 1001))
y_values = [x**2 for x in x_values]
plt.scatter(x_values, y_values, edgecolor='none', c=(0, 0, 0.8), s=40, label='square')
plt.title("Square Numbers", fontsize=24)
plt.xlabel("Value", fontsize=14)
plt.ylabel("Square of Value", fontsize=14)
plt.legend()
plt.axis([0, 1100, 0, 1100000])
plt.tick_params(axis='both', labelsize=14)
plt.show()
Using Built-in Styles
matplotlib provides various built-in styles that configure background colors, grid lines, line widths, fonts, and sizes:
import matplotlib.pyplot as plt
print(plt.style.available) # List available styles
plt.style.use('seaborn') # Apply a specific style
fig, ax = plt.subplots()
ax.plot([1, 2, 3], [1, 4, 9], linewidth=3)
plt.show()
15.3 Random Walks
A random walk is a path determined by successive random decisions, with no clear direction. This concept is useful for simulating various natural phenomena.
Create a class to generate random walk data:
from random import choice
class RandomWalk:
"""A class to generate random walk data points."""
def __init__(self, num_points=5000):
"""Initialize random walk attributes."""
self.num_points = num_points
self.x_values = [0]
self.y_values = [0]
def fill_walk(self):
"""Calculate all points in the random walk."""
while len(self.x_values) < self.num_points:
x_direction = choice([1, -1])
x_distance = choice([0, 1, 2, 3, 4])
x_step = x_direction * x_distance
y_direction = choice([1, -1])
y_distance = choice([0, 1, 2, 3, 4])
y_step = y_direction * x_distance
if x_step == 0 and y_step == 0:
continue
next_x = self.x_values[-1] + x_step
next_y = self.y_values[-1] + y_step
self.x_values.append(next_x)
self.y_values.append(next_y)
Visualize the random walk:
import matplotlib.pyplot as plt
from random_walk import RandomWalk
rw = RandomWalk()
rw.fill_walk()
plt.scatter(rw.x_values, rw.y_values, s=5)
plt.show()
Simulate multiple random walks with color mapping:
import matplotlib.pyplot as plt
from random_walk import RandomWalk
while True:
rw = RandomWalk()
rw.fill_walk()
point_numbers = list(range(rw.num_points))
plt.scatter(rw.x_values, rw.y_values, s=5, c=point_numbers,
cmap=plt.cm.Blues, edgecolors='none')
# Highlight start and end points
plt.scatter(rw.x_values[0], rw.y_values[0], c='green', edgecolors='none', s=50)
plt.scatter(rw.x_values[-1], rw.y_values[-1], c='red', edgecolors='none', s=50)
plt.axes().get_xaxis().set_visible(False)
plt.axes().get_yaxis().set_visible(False)
plt.figure(figsize=(10, 6))
plt.show()
keep_running = input("Generate another walk? (y/n): ")
if keep_running == 'n':
break
15.4 Simulating Dice Rolls with Pygal
Pygal creates interactive, scalable vector graphics ideal for charts that must display across different screen sizes. Charts render as SVG files that can be opened in web browsers with interactive features.
A histogram is a bar chart showing the frequency of different outcomes.
Creating Bar Charts with Pygal
Pygal offers several bar chart types:
import pygal
basic_chart = pygal.Bar() # Standard bar chart
stacked_chart = pygal.StackedBar() # Stacked bar chart
horizontal_chart = pygal.HorizontalBar() # Horizontal bar chart
Customize chart colors using style classes:
import pygal
from pygal.style import LightenStyle as LS, LightColorizedStyle as LCS
# Single color base
my_style = LS('#333366')
chart = pygal.Bar(style=my_style)
# Lightened color scheme
my_style = LS('#333366', base_style=LCS)
chart = pygal.Bar(style=my_style)
Configure chart appearance with parameters or a Config object:
import pygal
# Direct parameters
chart = pygal.Bar(x_label_rotation=45, show_legend=False)
# Using Config for multiple settings
my_config = pygal.Config()
my_config.x_label_rotation = 45
my_config.show_legend = False
my_config.title_font_size = 24
my_config.label_font_size = 14
my_config.major_label_font_size = 18
my_config.truncate_label = 15
my_config.show_y_guides = False
my_config.width = 1000
chart = pygal.Bar(my_config)
Set chart properties and add data:
chart.title = "Chart Title"
chart.x_labels = ['Label1', 'Label2', 'Label3']
chart.x_title = "X Axis Title"
chart.y_title = "Y Axis Title"
chart.add('Data Series', [value1, value2, value3])
chart.render_to_file('chart.svg')
Dice Simulation
Create a Die class:
from random import randint
class Die:
"""Represents a single die."""
def __init__(self, num_sides=6):
"""Default to 6-sided die."""
self.num_sides = num_sides
def roll(self):
"""Return random value between 1 and number of sides."""
return randint(1, self.num_sides)
Simulate rolling a six-sided die 1000 times:
import pygal
from die import Die
die = Die()
results = []
for roll_num in range(1000):
result = die.roll()
results.append(result)
frequencies = []
for value in range(2, die.num_sides + 1):
frequency = results.count(value)
frequencies.append(frequency)
hist = pygal.Bar()
hist.title = "Rolling 1000 Times with D6"
hist.x_labels = ['1', '2', '3', '4', '5', '6']
hist.x_title = "Result"
hist.y_title = "Frequency"
hist.add('D6', frequencies)
hist.render_to_file('die_visual.svg')
Simulate rolling two different dice:
import pygal
from die import Die
die_1 = Die()
die_2 = Die(10) # 10-sided die
results = []
for roll_num in range(50000):
result = die_1.roll() + die_2.roll()
results.append(result)
frequencies = []
for value in range(2, die_1.num_sides + die_2.num_sides + 1):
frequency = results.count(value)
frequencies.append(frequency)
hist = pygal.Bar()
hist.title = "Rolling D6 and D10 50000 Times"
hist.x_labels = [str(i) for i in range(2, 17)]
hist.x_title = "Result"
hist.y_title = "Frequency"
hist.add('D6 + D10', frequencies)
hist.render_to_file('dice_combination.svg')
Chapter 16: Downloading Data
Data is commonly stored in CSV (comma-separated values) and JSON formats. This chapter covers processing weather data from CSV files and population data from JSON files.
16.1 Working with CSV Files
CSV files contain data as comma-separated values. Use Python's built-in csv module to process these files.
Read and parse CSV files:
import csv
filename = 'weather_data.csv'
with open(filename) as f:
reader = csv.reader(f)
header_row = next(reader)
for index, column_header in enumerate(header_row):
print(index, column_header)
Parse dates using datetime:
from datetime import datetime
date_obj = datetime.strptime('2014-7-1', '%Y-%m-%d')
Format codes: %Y (4-digit year), %y (2-digit year), %m (month 01-12), %d (day 01-31), %A (weekday), %B (month name), %H (24-hour), %I (12-hour), %M (minutes), %S (seconds), %p (am/pm).
Fill area between lines:
plt.fill_between(dates, highs, lows, facecolor='blue', alpha=0.1)
Complete example parsing weather data:
import csv
from datetime import datetime
import matplotlib.pyplot as plt
filename = 'sitka_weather_2014.csv'
with open(filename) as f:
reader = csv.reader(f)
header_row = next(reader)
dates, highs, lows = [], [], []
for row in reader:
current_date = datetime.strptime(row[0], "%Y-%m-%d")
dates.append(current_date)
highs.append(int(row[1]))
lows.append(int(row[3]))
fig = plt.figure(dpi=128, figsize=(10, 6))
plt.plot(dates, highs, c='red')
plt.plot(dates, lows, c='blue')
plt.fill_between(dates, highs, lows, facecolor='blue', alpha=0.1)
plt.title("Daily high and low temperatures - 2014", fontsize=24)
plt.xlabel('', fontsize=16)
fig.autofmt_xdate()
plt.ylabel("Temperature (F)", fontsize=16)
plt.tick_params(axis='both', which='major', labelsize=16)
plt.show()
Eror Handling in Data Processing
Handle missing or invalid data using try-except blocks:
import csv
from datetime import datetime
import matplotlib.pyplot as plt
filename = 'death_valley_2014.csv'
with open(filename) as f:
reader = csv.reader(f)
header_row = next(reader)
dates, highs, lows = [], [], []
for row in reader:
try:
current_date = datetime.strptime(row[0], "%Y-%m-%d")
high = int(row[1])
low = int(row[3])
except ValueError:
print(current_date, 'missing data')
else:
dates.append(current_date)
highs.append(high)
lows.append(low)
fig = plt.figure(dpi=128, figsize=(10, 6))
plt.plot(dates, highs, c='red')
plt.plot(dates, lows, c='blue')
plt.fill_between(dates, highs, lows, facecolor='blue', alpha=0.1)
plt.title("Daily high and low temperatures - 2014\nDeath Valley, CA", fontsize=24)
plt.xlabel('', fontsize=16)
fig.autofmt_xdate()
plt.ylabel("Temperature (F)", fontsize=16)
plt.tick_params(axis='both', which='major', labelsize=16)
plt.show()
16.2 Creating World Population Maps with JSON
JSON (JavaScript Object Notation) is a common format for storing structured data. This section covers processing population data and creating world maps using Pygal.
Processing JSON Data
Load JSON data:
import json
filename = 'population_data.json'
with open(filename) as f:
pop_data = json.load(f)
for pop_dict in pop_data:
if pop_dict['Year'] == '2010':
country_name = pop_dict['Country Name']
population = int(float(pop_dict['Value']))
print(f"{country_name}: {population}")
Get country codes for Pygal maps:
from pygal_maps_world.i18n import COUNTRIES
def get_country_code(country_name):
"""Return the two-letter country code for a given country name."""
for code, name in COUNTRIES.items():
if name == country_name:
return code
return None
Create world maps with Pygal:
from pygal_maps_world.maps import World
from pygal.style import RotateStyle, LightColorizedStyle
wm_style = RotateStyle('#336699', base_style=LightColorizedStyle)
wm = World(style=wm_style)
wm.title = "World Population"
wm.add('North America', {'ca': 34126000, 'us': 309349000, 'mx': 113423000})
wm.render_to_file('world_map.svg')
Complete population map example:
import json
from pygal_maps_world.i18n import COUNTRIES
from pygal_maps_world.maps import World
from pygal.style import RotateStyle, LightColorizedStyle
def get_country_code(country_name):
for code, name in COUNTRIES.items():
if name == country_name:
return code
return None
filename = 'population_data.json'
with open(filename) as f:
pop_data = json.load(f)
cc_populations = {}
for pop_dict in pop_data:
if pop_dict['Year'] == '2010':
country_name = pop_dict['Country Name']
population = int(float(pop_dict['Value']))
country_code = get_country_code(country_name)
if country_code:
cc_populations[country_code] = population
cc_pops_1, cc_pops_2, cc_pops_3 = {}, {}, {}
for cc, pop in cc_populations.items():
if pop < 10_000_000:
cc_pops_1[cc] = pop
elif pop < 1_000_000_000:
cc_pops_2[cc] = pop
else:
cc_pops_3[cc] = pop
wm_style = RotateStyle('#336699', base_style=LightColorizedStyle)
wm = World(style=wm_style)
wm.title = 'World Population 2010'
wm.add('0-10M', cc_pops_1)
wm.add('10M-1B', cc_pops_2)
wm.add('1B+', cc_pops_3)
wm.render_to_file('world_population.svg')
Chapter 17: Using Web APIs
Web APIs (Application Programming Interfaces) allow programs to request specific information from websites rather than retrieving entire pages. This enables creating applications that always use the most current data.
17.1 Making API Requests
APIs return data in easily processed formats like JSON or CSV. GitHub's API provides access to repository information.
Example API call to GitHub:
https://api.github.com/search/repositories?q=language:python&sort=stars
This returns Python repositories sorted by star count.
Make API requests using the requests library:
import requests
url = 'https://api.github.com/search/repositories?q=language:python&sort=stars'
headers = {'Accept': 'application/vnd.github.v3+json'}
response = requests.get(url, headers=headers)
print(f"Status code: {response.status_code}")
response_dict = response.json()
print(response_dict.keys())
The response contains total_count, incomplete_results, and items (list of repositories).
Process repository data:
repo_dicts = response_dict['items']
print(f"Total repositories: {response_dict['total_count']}")
print(f"Returned items: {len(repo_dicts)}")
for repo_dict in repo_dicts:
print(f"\nName: {repo_dict['name']}")
print(f"Owner: {repo_dict['owner']['login']}")
print(f"Stars: {repo_dict['stargazers_count']}")
print(f"URL: {repo_dict['html_url']}")
print(f"Description: {repo_dict['description']}")
API Rate Limits
Most APIs impose rate limits restricting requests per time period. Check GitHub's rate limits:
https://api.github.com/rate_limit
The response shows limits for different API categories. The search API typically allows 10 requests per minute.
17.2 Visualizing API Data with Pygal
Create interactive visualizations from API data:
import requests
import pygal
from pygal.style import LightColorizedStyle as LCS, LightenStyle as LS
url = 'https://api.github.com/search/repositories?q=language:python&sort=stars'
headers = {'Accept': 'application/vnd.github.v3+json'}
response = requests.get(url, headers=headers)
response_dict = response.json()
repo_dicts = response_dict['items']
names, stars = [], []
for repo_dict in repo_dicts:
names.append(repo_dict['name'])
stars.append(repo_dict['stargazers_count'])
my_style = LS('#333366', base_style=LCS)
chart = pygal.Bar(style=my_style, x_label_rotation=45, show_legend=False)
chart.title = 'Most Popular Python Projects on GitHub'
chart.x_labels = names
chart.add('', stars)
chart.render_to_file('python_repos.svg')
Use Config object for detailed customization:
my_config = pygal.Config()
my_config.x_label_rotation = 45
my_config.show_legend = False
my_config.title_font_size = 24
my_config.label_font_size = 14
my_config.major_label_font_size = 18
my_config.truncate_label = 15
my_config.show_y_guides = False
my_config.width = 1000
chart = pygal.Bar(my_config, style=my_style)
Adding Custom Tooltips
Create custom tooltips showing additional information:
names, plot_dicts = [], []
for repo_dict in repo_dicts:
names.append(repo_dict['name'])
plot_dict = {
'value': repo_dict['stargazers_count'],
'label': str(repo_dict['description']),
}
plot_dicts.append(plot_dict)
chart.add('', plot_dicts)
The label key provides the tooltip text. Converting to string prevents errors with None values.
Adding Clickable Links
Make bars clickable by adding xlink:
plot_dict = {
'value': repo_dict['stargazers_count'],
'label': str(repo_dict['description']),
'xlink': repo_dict['html_url'],
}
Clicking any bar opens the project's GitHub page in a new browser tab.
Additional Resources
For more information on matplotlib and Pygal, consult the official documentation and styling guides. These tools provide extensive customization options for creating professional visualizations.