Today’s post is about data visualization with pandas and Matplotlib in Python. This will be useful to anyone who wants to learn how to create better graphs and charts for their data visualization endeavors.
My motivation for writing this is that I’ve been using pandas to generate plots for my other posts, like these ones and crypto prices: here and here. However, I usually forget some details and need to spend time looking online for tips. I’m hoping that this article will help others and myself in the future save time when visualizing data.
This post should be straightforward, although, having some basic knowledge of Pandas and Matplotlib could help, but is not necessary.
Creating Plots With Pandas
You can create plots directly from the Pandas interface without touching Matplotlib. In the back, Pandas is still using Matplotlib by default to create those engines. However, you can swap the plotting engine and use Bokeh, or Plottly, for instance.
For this tutorial, I will use Matplotlib for the backend. I will write more posts on data visualization with the other backends as well in the future. For the data, I will use a csv file tracking puppy weights (I used the same data to make a fun animated bar chart race with matplotlib here) over time. Therefore, this dataset is a time-series dataset.
All the code and Jupyter notebooks that I will be sharing here can be found in my Github repository:
Toy Dataset
The “puppy_weights.csv” dataset looks like this:
It is a simple file of weights collected over a period of over a year and the values are set in grams.
Default Pandas Plot
Once you load your data into a pandas dataframe, you can simply call the plot method on that object to generate a chart. Without specifying any parameters, here is what it will look like:
import pandas as pd
data_df = pd.read_csv("puppy_weights.csv").set_index("date",drop=True)
data_df.index = pd.to_datetime(data_df.index) #convert index column into DateTime column
data_df.plot()
and the output will be:
We can see that the image is small and missing things like a title and a y-axis label. In the next sections, we will cover how to modify the plot to improve its look and effectiveness in explaining data.
Figure Size, Labels, Title, etc…
The plot function for pandas dataframes accepts a large number of parameters. In fact, you can actually pass all the parameters that matplotlib’s plotting function accepts. If you want to learn more, check the documentation for pandas plot function.
Here are a few of the parameters that you might want to use:
puppy = 'puppy_2'
data_df[puppy].plot(
figsize = (15,7),
title = f"Weight for {puppy} Over Time",
ylabel = "weight (g)",
grid = True,
marker = ".",
markersize= 5,
)
Here is the output from the command above:
First of all, I’m selecting only one column this time, “puppy_2”, to keep the example clear. Next, inside the plot function, I’m using the figsize parameter to indicate the width and height, in inches, for the plot. Now the data is much more clearly visible.
Parameters title, ylabel, and grid are pretty self-explanatory. You can also use xlabel if you want to change the default label that pandas places on the x-axis.
The other parameters I’m using are marker, and markersize. The former marks each data in the series with the symbol we selected. I chose a dot “.” and that is why in the plot, you can see the line filled with dots, one after the other. The latter parameter controls the size of those dots. These two parameters can help make the data clearer, but I would check case by case.
Matplotlib’s Axes Object
Matplotlib plots have a hierarchy to them. First, there is a figure, and then we have axes objects. One Axes object is a single plot, and Figure is the outermost structure. Therefore, we can have several Axes objects in one figure, if we wanted to create subplots. You can check my other post on more advanced plots using Axes objects to build subplots.
This article on Matplotlib can probably explain it much better than I can, so check it out if you want to understand the low-level details of how it works.
Pandas And the Axes Object
The plot function from pandas will return an Axes object when called. However, if we are using a different backend, such as plottly or bokeh, it will return whatever those backends return.
Plot will also accept an Axes object as a parameter if we already created one before calling the function.
Here is an example of how we can add more things to the plot from the Axes interface:
ax = data_df.plot(
figsize = (15,10),
title = f"Weight for {puppy} Over Time",
ylabel = "weight (g)",
grid = True,
#marker = ".",
#markersize= 5,
)
#adding some horizontal lines
xmin='2019-05-01'
xmax='2020-03-01'
yval=20000
shift=4000
ax.hlines(y=yval,
xmin=xmin,
xmax=xmax,
linestyle="--",
color='black',
label=f"Partial dashed line at {yval} g")
ax.hlines(y=yval+shift,
xmin=data_df.index[0],
xmax=data_df.index[-1],
linestyle="dotted",
color='red',
label=f"Full dotted line at {yval+shift} g",)
#adding a vertical line
ymin=0
ymax=data_df.max().max() #first max will return a series, so second max gets an individual value
xval='2020-03-12' #puppies' first birthday
ax.vlines(x='2020-03-12',
ymin=ymin,
ymax=ymax,
linestyle="-.",
label=f"Puppies's Birthday on {xval}",)
#show legend
ax.legend()
which will produce the following chart:
The code looks a lot longer because I broke the lines to make it fit on the screen. The main addition to the previous example would be the manipulation of the Axes object, ax, directly. We can create horizontal lines with ax.hlines, and vertical lines with ax.vlines.
As you can see in the example, these vertical and horizontal lines can be used to add extra information to the graph. In this case, I used a vertical line to mark the birthday date of the puppies.
For the horizontal lines, I’m creating one that spans just a portion of the data. You can create small segments of lines that way.
Interpolate Data
This extra section is not necessary for the plot, but it will make it look better. Pandas dataframes have a function interpolate which will fill in missing values in our data. Since there are several missing dates in our puppy_weights.csv file, we can use this method.
Keep in mind that this adds artificial data to the dataset, so decide if it is necessary based on the dataset. Here is what it will look like for our example dataset:
# Interpolate data (fill in missing values) with linear fill by default
data_df = data_df.interpolate(axis=0)
In that case, running the code for the last plot again will yield the following plot:
Puppies 6 through 10 look weird because their data stopped pretty early (they were adopted young). As a result, the interpolation method simply repeated their last known value, resulting in those horizontal lines at the bottom. Check the documentation for interpolate to see other methods available.
Other Types of Plots
So far, we have been using the default plot type from pandas, “line”. However, with the parameter kind, we can specify other chart types. Some of the options available are:
"line" #default type
"bar" #vertical bar plot
"barh" #horizontal bar plot
"pie" #pie chart
"scatter" #scatter plot
"hist" #histogram
"box" #boxplot
"kde" or "density" #kernel density estimation
You can experiment with the different options by looking at the documentation.
Bar Plots
In this example, I will show a bar plot example. One simple example is to plot the last recorded weights for the puppies. At the same time, I will exclude puppies 6-10 and just keep the first 5:
# create vertical bar plot
_puppy_names = ['puppy_1','puppy_2','puppy_3','puppy_4','puppy_5']
puppies_df = data_df[_puppy_names] #get only a subsection of the puppies
and the bar plot will be produced with:
ax = puppies_df.iloc[-1].plot(
kind = 'bar',
figsize = (15,10),
title = "Puppies' Final Weight",
ylabel = "weight (g)",
#grid = True,
)
ax.grid('on', which='both', axis='y',)# linestyle='-', linewidth=0.25,) #adding only horizontal grid lines
The second line in the code tells pandas to generate a bar plot. In addition, we are adding only horizontal grid lines on line 10.
The plot looks decent, but there are a few things we can add to improve the visualization. For example, we can add labels at the top of each bar so we know the exact values. Also, I prefer the x-axis labels to have a slight rotation. So, here are some modifications to the code to accomplish that:
# bar plot again, but this time with annotated bars, and category labels rotated
ax = puppies_df.iloc[-1].plot(
kind = 'bar',
figsize = (15,10),
title = "Puppies' Final Weight",
ylabel = "weight (g)",
)
#-- add horizontal grid lines
ax.grid('on', which='both', axis='y',)# linestyle='--', linewidth=0.25,) #adding only horizontal grid lines
#-- rotate labels in x-axis
plt.xticks(rotation=45)
#-- Annotate each bar in the chart (found this piece of code on StackOverflow)
for bar in ax.patches:
value:str = f"{bar.get_height() / 1000:.2f} kg" #convert grams into kilograms for display
#-- coordinates where to put the text
xcoord = bar.get_x() + bar.get_width() / 2 #find middle of the bar
ycoord = bar.get_height()
ax.annotate(
text = value,
xy = (xcoord, ycoord),
ha = 'center',
va = 'center',
textcoords = 'offset points',
xytext = (0, 10), #offset from the xy point provided above
#rotation = 25, #use this to rotate labels by given angle
)
and the plot that comes out is:
In case you want to do fancier annotations on the plot, the documentation for ax.annotate is here.
Final Thoughts
Pandas is a powerful tool for working with data, and that includes data visualization. You can take the examples in this post and adapt them to your own datasets. The Github repository is here. Don’t forget that there is a lot that I did not cover here. Check the documentation links I shared and experiment a little.
In the future, I will write posts on using other backends with pandas, like plottly, bokeh, and holoviews. I will also write a follow-up to this post to show how to create more complex graphs using subplots.
If you found this post interesting and informative, you can consider joining the mailing list to stay up to date with my new posts. Also, leave a comment if you have any suggestions or questions.