Today we will explore how to build more complex plots using pandas and matplotlib. Last week, I wrote a post covering common plots that can be created with these libraries. However, I did not get a chance to go deeper into them. Therefore, you can consider this as a follow-up to that post.
First, I’ll be introducing a very convenient Python library to download global economic data from the World Bank.
I’ll also be covering how to format the tick labels for each axis, and the titles and labels for the plot. On top of that, I’m showing how you can use subplots to join data series that share a common axis, for example, stock prices and trading volume.
You can find the link to the code at the end of the article.
Dataset For This Post
For this post, I decided to get some data about different countries’ economies and populations. The World Bank has several databases to choose from. Now, you could download the data manually, but I prefer a more programmatic way of doing it. Luckily, I found a Python package that implements the World Bank API to make downloading the data easier.
Expect a full blog post in the future covering this library in more detail. I think it is very useful for data acquisition.
The World Bank has a lot of data on every country, all the way back to 1960. To keep this post manageable, I will only focus on the current top 10 world economies by total GDP, according to this Investopedia article.
Those countries are:
- United States
- China
- Japan
- Germany
- United Kingdom
- India
- France
- Italy
- Canada
- South Korea
Downloading the Data
The download process is relatively simple. The World Bank uses code names for different types of data. I found the ones I was interested in by searching on Google. Here are some of the codes:
_world_bank_codes = [
'NY.GDP.MKTP.CD', #GDP total
'NY.GDP.MKTP.KD.ZG', #GDP growth, annual
'SP.POP.TOTL', #total population
'EN.ATM.CO2E.KT', #CO2 emisions
'NY.GDP.PCAP.CD', #GDP per capita
'NY.GDP.MKTP.PP.CD' #GDP, PPP
]
info_series = wb.series.info(_world_bank_codes) #print code meanings
The contents of info_series is:
To actually download the data, we can call the DataFrame method:
_wanted_series = [
series_mapper['gdp'],
series_mapper['gdp_ppp'],
series_mapper['gdp_growth'],
series_mapper['population'],
]
#-- download the data
full_df = wb.data.DataFrame(
series = _wanted_series,
economy = country_coder.values(),
time = 'all',
skipBlanks = True,
columns = 'series',
#index = 'time',
)
Don’t pay too much attention to the undefined variables and objects in the code above. It’s all in the Jupyter notebook. I just wanted to share how simple it is to fetch the data. The code above will return a multi-index pandas dataframe like this:
For this exercise, I found it easier to work with individual country dataframes, so I had to split them based on their country codes. Additionally, I turned the time column into a DateTime column and set it as the index. Finally, the column names were replaced with more human-readable strings. The resulting dataframe for each country will look like this:
Pandas Subplots
We can tell pandas to create a separate plot for each series (column) in the dataframe by passing the subplots boolean parameter to the plot function:
country = 'USA'
df = dfs_dict[country]
df.plot(
subplots=True,
figsize=(15,8),
title = f'Data for {country}'
)
In this case, pandas created four subplots because we had four data series: GDP, PPP GDP, GDP growth, and population. By default, they all share the x-axis, which is what we want here since they all share the same years. However, we can see that the plots are missing their y-labels. This makes understanding the data more difficult. Not only that, but the tick values in the y-axis are not very helpful.
In my last post, we saw that pandas.DataFrame.plot will return a matplotlib axes object by default. Moreover, when we use the subplots parameter, pandas will return a list of axes objects, one per subplot.
We can capture those axes objects and add labels and customizations to each, individually.
axes_list = df.plot(subplots=True,
figsize =(15,13),
title = f'Data for {country}',
grid = True,
)
#-- Define helper labels and variables
one_trillion = 1_000_000_000_000
one_million = 1_000_000
ylabels = ['trillions ($)', '% change', 'trillions ($)', 'people (in millions)']
for i, ax in enumerate(axes_list):
ax.set_ylabel(ylabels[i])
#-- format y tickers manually (it could be done in loop, but I kept getting some problems)
axes_list[0].yaxis.set_major_formatter(mpl.ticker.FuncFormatter(lambda x,pos: f"{x/one_trillion:,.0f}"))
axes_list[1].yaxis.set_major_formatter(mpl.ticker.FuncFormatter(lambda x,pos: f"{x:.2f}%"))
axes_list[2].yaxis.set_major_formatter(mpl.ticker.FuncFormatter(lambda x,pos: f"{x/one_trillion:,.0f}"))
axes_list[3].yaxis.set_major_formatter(mpl.ticker.FuncFormatter(lambda x,pos: f"{x/one_million:,.0f}"))
We could also set a title for each subplot, but I think I will leave it there for now. The interesting part is that we can customize ticks, labels, and even font sizes for each part of the plot. The Jupyter notebook contains all the setup code needed, so I’m not sharing those snippets here.
For me, the main formatting I wanted to accomplish in the plots above was to have numbers that scaled correctly in the y-axis. That’s accomplish by the last few lines in the code, using the yaxis.set_major_formatter method. I defined a few simple lambda functions that return the tick values in the correct format depending on the data. Take a look at those lines if you are having trouble with your formatting. Additionally, here is the matplotlib documentation for several of the options you have in formatting ticks.
Plots With Shared Axes
There are many times when we want to plot two or more pieces of data that have an axis in common. For example, financial data, such as stock, or cryptocurrency prices are often plotted together with the volume. Here is an example:
In the example above, there are two data series: the stock price, and volume. Stock price is given in dollars, while volume is the number of shares traded. As such, the y-axis of each series is independent of the other. However, they share their x-axis since it is time.
Now we will see a similar example, but using data from the World Bank. We are going to plot the GDP of the top 10 current economies, as well as their GDP growth percentage, from 1960 to 2020.
Here is the code:
#-- Create and adjust new figure
fig = plt.figure(
constrained_layout=True,
figsize=(18,11)
)
set_dpi(120)
set_font_size(15)
#-- use the mosaic layout for custom sizes
axs = fig.subplot_mosaic(
[['top'],['bottom']], #layout and axes handles.
gridspec_kw={'height_ratios':[2, 1]}, #ratio between top plot and bottom plot
#gridspec_kw={'width_ratios':[2, 1]}, #use for plots from left to right
sharex=True,
)
#-- plot the data
_top_series_to_plot = 'gdp'
_bottom_series_to_plot = 'gdp_growth'
for country in country_coder:
_country_code:str = country_coder[country]
df = dfs_dict[_country_code]
#-- top part
axs['top'].plot(df[_top_series_to_plot], label=_country_code)
#-- bottom part
axs['bottom'].plot(df[_bottom_series_to_plot], label=_country_code)
#-- set axis labels
axs['top'].set_ylabel('GDP (in trillions $)')
axs['bottom'].set_ylabel("% change")
#-- set grid lines
axs['top'].grid()
axs['bottom'].grid()
#-- set subplot titles
axs['top'].set_title('GDP (Top 10 Economies)')
axs['bottom'].set_title('GDP Growth (%)')
#-- set subplot legends
axs['top'].legend()
axs['bottom'].legend()
#-- format tick labels
axs['top'].yaxis.set_major_formatter(mpl.ticker.FuncFormatter(lambda x,pos: f"{x/one_trillion:,.0f}"))
axs['bottom'].yaxis.set_major_formatter(mpl.ticker.FuncFormatter(lambda x,pos: f"{x:,.0f}%"))
and the output will be:
In this case, I’m creating the plot through the matplotlib API because I find it easier to customize.
There are a few ways to create subplots like the one above. In this case, I’m using a mosaic layout (line 11) to tell matplotlib the names and layout of each subplot. I also set the height ratio to 2:1. That means that the top graph will be twice as high as the bottom one.
The rest of the code is simple formatting labels and ticks as in the previous example.
Suggestions For Plot Customization
For me, the trickiest part of plotting with matplotlib is getting the tick labels to look right. In this post, you can find several examples where the ticks are being modified. Additionally, I would try to remember the basic methods such as set_xlabel, set_ylabel, set_title, etc.
If you don’t need to apply too many customizations, the pandas plot interface will probably give you enough options. I would explore that API in more detail, because it can simplify the process of generating graphs.
You can copy that code and adapt it to your needs. Also, take a look at the Jupyter notebook here.
Final Thoughts
With today’s post, you should feel comfortable creating slightly more complex plots with pandas and matplotlib. I tried making this post interesting and useful because I tend to forget matplotlib stuff very quickly. That’s why I’m writing these posts, so I can go back later and re-learn. Once again, you can find the rest of the code here.
If you found it useful and want to hear more when I post something new, consider subscribing to the newsletter. It would mean a lot. Also, if you have something to say or suggest, let me know in the comments.