Introduction to Data Visualization tools-
Data Visualization techniques is one of the key components of any analytics project. An end to end analytics use case involves ideation, requirement gathering, getting the raw data, analyzing the data, building a predictive model, deploying the model, and communicating the end result to the business.
Throughout this entire process, the analysis of data, and the communication of results to the business requires visualizing the raw data and understanding several inter-linked relations among the features. Python is the most preferred language which has several libraries and packages such as Pandas, NumPy, Matplotlib, Seaborn, and so on used to visualize the data.
We have another detailed tutorial, covering the Data Visualization libraries in Python.
Below are some of the data visualization examples using python on real data.
Data Visualization Projects in Python
Example 1 : –
Data visualization dataset:- Iris Dataset
#Importing the necessary libraries
import pandas as pd Import numpy as np Import matplotlib.pyplot as plt Import seaborn as sns sns.set(style=”white”, color_codes=True) %matplotlib inline
After all the libraries are imported, we load the data using the read_csv command of pandas and store it into a dataframe.
df = pd.read_csv(./iris.csv)
To understand the structure of the data, the .head() function is used in pandas.
The pandas library has a .plot() feature which is mostly used for any quick visual analysis. The scatter plot of all the Iris features is displayed below.
df.plot(kind="scatter", x="SepalLengthCm", y="SepalWidthCm")
Seaborn could be used to generate similar plots. Univariate histograms, and bivariate scatter plots is shown using the jointplot of seaborn.
sns.jointplot(x="SepalLengthCm", y="SepalWidthCm", data=df, size=5)
Finding which species, the plant belongs to. FacetGrid in seaborn is used for the same. It gives the scatter plot color by species.
sns.FacetGrid(df, hue="Species", size=5) \ .map(plt.scatter, "SepalLengthCm", "SepalWidthCm") \ .add_legend()
A boxplot in Seaborn gives individual feature details.
sns.boxplot(x="Species", y="PetalLengthCm", data=df)
A layer of individual points is added to this plot using the Strip plot in Seaborn. To avoid all pints falling in a single vertical line the jitter = True value is used.
ax = sns.boxplot(x="Species", y="PetalLengthCm", data=df) ax = sns.stripplot(x="Species", y="PetalLengthCm", data=df, jitter=True, edgecolor="gray")
The benefits of the previous two plots combined using the violin plot.
sns.violinplot(x="Species", y="PetalLengthCm", data=df, size=6)
Kernel Density Estimation, Kde plot is used to look into univariate relations by plotting the kernel density estimate of the features.
sns.FacetGrid(df, hue="Species", size=6) \ .map(sns.kdeplot, "PetalLengthCm") \ .add_legend()
To show the bivariate relation between each feature, the pair plot is used in Seaborn. In the below plots, the Iris-setosa species is separated from the other two species.
sns.pairplot(df.drop("Id", axis=1), hue="Species", size=3)
To show the diagonal elements in a pair plot in form of a histogram.
sns.pairplot(df.drop("Id", axis=1), hue="Species", size=3, diag_kind="kde")
So far, we have covered some of the visualizations using Seaborn, now let’s explore some with Pandas library as well. Below is a boxplot using Pandas.
df.drop("Id", axis=1).boxplot(by="Species", figsize=(12, 6))
The next plot is of Andrews Curves which uses sample attributes as coefficient for Fourier series.
from pandas.plotting import andrews_curves andrews_curves(df.drop("Id", axis=1), "Species")
Parallel co-ordinates are another multivariate data visualization technique in pandas where each feature is plotted on a separate column and then lines are drawn which connects each data sample feature.
from pandas.plotting import parallel_coordinates parallel_coordinates(df.drop("Id", axis=1), "Species")
Radviz is another data visualization technique in pandas used for multivariate plotting. Here, on a 2D plane each feature is put, and then simulates having each sample attached to those points through a spring weighted by the value of the feature.
from pandas.plotting import radviz radviz(df.drop("Id", axis=1), "Species")
These were some of the data visualizations best practices done on a Iris dataset.
Data Visualization dataset: San Francisco Salaries
The very first step is to read the data.
salaries = pd.read_csv(‘./Salaries.csv’)
Checking the columns present using the .info() function in pandas.
Converting all the columns to numeric.
for col in ['BasePay', 'OvertimePay', 'OtherPay', 'Benefits']: salaries[col] = pd.to_numeric(salaries[col], errors='coerce')
All the pay columns are plotted in one plot.
pay_columns = salaries.columns[3:salaries.columns.get_loc('Year')] pay_columns
A 2×3 figure is plotted with histogram which is useful for grouping elements.
pays_arrangement = list(zip(*(iter(pay_columns),) * 3))
The plt.subplots command gives a figure and a 2×3 array of axes.
fig, axes = plt.subplots(2,3) for i in range(len(pays_arrangement)): for j in range(len(pays_arrangement[i])): # pass in axes to pandas hist salaries[pays_arrangement[i][j]].hist(ax=axes[i,j]) # axis objects have a lot of methods for customizing the look of a plot axes[i,j].set_title(pays_arrangement[i][j]) plt.show()
To make the plot more readable, a combination of figure height, width, and subplot spacing could be used.
fig, axes = plt.subplots(2,3) # set the figure height fig.set_figheight(5) fig.set_figwidth(12) for i in range(len(pays_arrangement)): for j in range(len(pays_arrangement[i])): # pass in axes to pandas hist salaries[pays_arrangement[i][j]].hist(ax=axes[i,j]) axes[i,j].set_title(pays_arrangement[i][j]) # add a row of emptiness between the two rows plt.subplots_adjust(hspace=1) # add a row of emptiness between the cols plt.subplots_adjust(wspace=1) plt.show()
On top of this, the ticks could be rotated.
# and here is a cleaner version using tick rotation and plot spacing fig, axes = plt.subplots(2,3) # set the figure height fig.set_figheight(5) fig.set_figwidth(12) for i in range(len(pays_arrangement)): for j in range(len(pays_arrangement[i])): salaries[pays_arrangement[i][j]].hist(ax=axes[i,j]) axes[i,j].set_title(pays_arrangement[i][j]) # set xticks with these labels, axes[i,j].set_xticklabels(labels=axes[i,j].get_xticks(), # with this rotation rotation=30) plt.subplots_adjust(hspace=1) plt.subplots_adjust(wspace=1) plt.show()
There is an innumerable list of plots available in the official documentation of Matplotlib and Seaborn. Another library that has gained a reputation and used quite regularly is Plotly which makes interactive browser-friendly plots.
In this blog, we covered some of the Data visualization techniques could be performed using Python. EduGrad has a rich set of courses pertaining to Python, Data visualization, Data Science, and so on which would enhance your skill to the level necessary to sustain in the industry.
Explore our courses –
Explore all courses here.
Industrial Dataset Machine Learning Projects –
Explore all projects here.