In the previous section, we covered the fundamentals of clustering theory to establish a solid theoretical foundation. Today’s focus is on exploring visualization techniques for data analysis. We've previously encountered visualization methods in regression analysis, and now we'll shift our attention to cluster analysis visualization. We’ll explore how to use scatter plots and concentric circles to gain intuitive insights into clustering outcomes.
Data Visualization — Clustering
Our primary objective today is to read and analyze data from a specific file. This dataset contains extensive music metadata, including fields like track name, genre, artist, popularity, danceability, and release date. During our examination, we will filter out the top three genres and extract relevant attributes. Then, we will investigate correlations among these genres across other features and examine their distribution patterns.
Note that this chapter does not aim to delve deeply into clustering algorithms themselves; instead, our emphasis lies on employing visualization tools to interpret and understand the data effectively. This approach allows us to more clearly identify trends and patterns within the dataset, laying the groundwork for future analysis.
Data Filtering
First, we need to import key libraries:
!pip install seaborn
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("../data/nigerian-songs.csv")
df.head()
Next, we perform an initial inspection of the dataset to understand its overall structure and content.
The following commands provide comprehensive insights into the data format, size, and key statistics:
df.info()
df.isnull().sum()
df.describe()
df.info(): Quicklly grasp the dataset schema and column types.df.isnull().sum(): Identify columns with missing values and their extent.df.describe(): Offers basic statistical summaries for numeric columns, aiding in understanding data distribution.
Reviewing the output of describe() gives us critical summary statistics and distribution characteristics. Additional details were previously discussed and can be referenced via the accompanying figures.
Data Selection
We now proceed to filter the data, targeting the top three most popular music genres. To achieve this, we plot artist_top_genre on the X-axis to better observe data distribution. Here is the corresponding code:
import seaborn as sns
top = df['artist_top_genre'].value_counts()
plt.figure(figsize=(10,7))
sns.barplot(x=top[:5].index,y=top[:5].values)
plt.xticks(rotation=45)
plt.title('Top genres',color = 'blue')
As shown in the chart, the top five genre are identified, with three prominent ones being afro dancehall, afropop, and nigerian pop.
Since no null entries were detected during inspection, we proceed directly with plotting without removing rows. However, if your dataset contains missing values, it's advisable to drop rows with null entries before plotting to maintain data integrity and ensure accurate visualizations, avoiding potential biases.
df = df[(df['artist_top_genre'] == 'afro dancehall') | (df['artist_top_genre'] == 'afropop') | (df['artist_top_genre'] == 'nigerian pop')]
df = df[(df['popularity'] > 0)]
top = df['artist_top_genre'].value_counts()
plt.figure(figsize=(10,7))
sns.barplot(x=top.index,y=top.values)
plt.xticks(rotation=45)
plt.title('Top genres',color = 'blue')
Our data filtering is complete. We have now isolated the top three genres, as illustrated in the chart.
Strong Correlations
Similarly, let’s revisit the heatmap. We’ve already explored this in regression analysis, so we’ll just present the code here:
corrmat = df.corr(numeric_only=True)
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True)
From the heatmap, it’s evident that energy and loudness show strong correlation—this aligns with expectations since loud and energetic tracks often go hand-in-hand.
We will now introduce a new visualization technique to help better understand the distribution of data in clustering.
Data Distribution
Concentric Circles
We now analyze the data based on popularity and danceability metrics using both concentric circle plots and scatter plots. These visualizations offer clearer insights into the data distribution and trends. You can also choose alternative fields for comparison according to your preferences.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df.iloc[:, 6:8] = df.iloc[:, 6:8].apply(LabelEncoder().fit_transform)
sns.set_theme(style="ticks")
g = sns.jointplot(
data=df,
x="popularity", y="danceability", hue="artist_top_genre",
kind="kde",
)
To ensure consistency in data handling due to varying data types, I convert all values to integers. As shown, the goal is to create a joint distribution plot illustrating the relationship between popularity and danceability, color-coded by genre.
Scatter Plot
sns.FacetGrid(df, hue="artist_top_genre").map(plt.scatter, "popularity", "danceability",s=5) .add_legend()
A single line of code displays the scatter distribution, as illustrated below:
For clustering tasks, scatter plots are highly effective for observing groupings, making this visualization method essential for interpreting data structures and patterns. In upcoming lessons, we will aply the K-Means clustering algorithm to the filtered dataset to identify overlapping groups in an interesting manner.
Summary
This chapter delved into applying data visualization in cluster analysis. Through analyzing a music dataset, we successfully identified the top three genres and used scatter plots and concentric circle diagrams to visually represent data distributions and trends. Visualization enhances our understanding of data and lays a strong foundation for further clustering efforts.
Through this process, we can identify hidden patterns and support decision-making. As demonstrated, data visualization is an exploratory journey that uncovers subtle relationships within complex datasets. In the next part, we will implement the K-Means clustering algorithm to uncover deeper stories embedded in this data.