Bivariate plotting with pandas¶


Scatter Plot	Hex Plot	Stacked Bar Chart	Bivariate Line Chart
df.plot.scatter()	df.plot.hexbin()	df.plot.bar(stacked=True)	df.plot.line()
Good for interval and some nominal categorical data.	Good for interval and some nominal categorical data.	Good for nominal and ordinal categorical data.	Good for ordinal categorical and interval data.

In the previous notebook, we explored using pandas to plot and understand relationships within a single column. In this notebook, we'll expand this view by looking at plots that consider two variables at a time.

Data without relationships between variables is the data science equivalent of a blank canvas. To paint the picture in, we need to understand how variables interact with one another. Does an increase in one variable correlate with an increase in another? Does it relate to a decrease somewhere else? The best way to paint the picture in is by using plots that enable these possibilities.

import pandas as pd
reviews = pd.read_csv("../input/wine-reviews/winemag-data_first150k.csv", index_col=0)
reviews.head()

Scatter plot¶

The simplest bivariate plot is the lowly scatter plot. A simple scatter plot simply maps each variable of interest to a point in two-dimensional space. This is the result:

reviews[reviews['price'] < 100].sample(100).plot.scatter(x='price', y='points')

<matplotlib.axes._subplots.AxesSubplot at 0x7f1b49d99b38>

This plot shows us that price and points are weakly correlated: that is, that more expensive wines do generally earn more points when reviewed.

Note that in order to make effective use of this plot, we had to downsample our data, taking just 100 points from the full set. This is because naive scatter plots do not effectively treat points which map to the same place. For example, if two wines, both costing 100 dollars, get a rating of 90, then the second one is overplotted onto the first one, and we add just one point to the plot.

This isn't a problem if it happens just a few times. But with enough points the distribution starts to look like a shapeless blob, and you lose the forest for the trees:

reviews[reviews['price'] < 100].plot.scatter(x='price', y='points')

<matplotlib.axes._subplots.AxesSubplot at 0x7f1b49ce7f98>

Because of their weakness to overplotting, scatter plots work best with relatively small datasets, and with variables which have a large number of unique values.

There are a few ways to deal with overplotting. We've already demonstrated one way: sampling the points. Another interesting way to do this that's built right into pandas is to use our next plot type, a hexplot.

Hexplot¶

A hex plot aggregates points in space into hexagons, and then colors those hexagons based on the values within them:

reviews[reviews['price'] < 100].plot.hexbin(x='price', y='points', gridsize=15)

<matplotlib.axes._subplots.AxesSubplot at 0x7f1b49cd1e10>

(note: the x-axis is points, but is missing from the chart due to a bug)

The data in this plot is directly comparable with that in the scatter plot from earlier, but the story it tells us is very different. From this hexplot we can see that the bottles of wine reviewed by Wine Magazine cluster around 87.5 points and around $20.

We did not see this effect by looking at the scatter plot, because too many similarly-priced, similarly-scoring wines were overplotted. By doing away with this problem, this hexplot presents us a much more useful view of the dataset.

Hexplots and scatter plots can by applied to combinations of interval variables and/or ordinal categorical variables.

Stacked plots¶

Scatter plots and hex plots are new. But we can also use the simpler plots we saw in the last notebook.

The easiest way to modify them to support another visual variable is by using stacking. A stacked chart is one which plots the variables one on top of the other.

We'll use a supplemental selection of the five most common wines for this next section.

wine_counts = pd.read_csv("../input/most-common-wine-scores/top-five-wine-score-counts.csv",
                          index_col=0)

wine_counts counts the number of times each of the possible review scores was received by the five most commonly reviewed types of wines:

wine_counts.head()

Many pandas multivariate plots expect input data to be in this format, with one categorical variable in the columns, one categorical variable in the rows, and counts of their intersections in the entries.

Let's now look at some stacked plots. We'll start with the stacked bar chart.

wine_counts.plot.bar(stacked=True)

<matplotlib.axes._subplots.AxesSubplot at 0x7f1b49bb33c8>

Stacked bar plots share the strengths and weaknesses of univariate bar charts. They work best for nominal categorical or small ordinal categorical variables.

Another simple example is the area plot, which lends itself very naturally to this form of manipulation:

wine_counts.plot.area()

<matplotlib.axes._subplots.AxesSubplot at 0x7f1b49ae97f0>

Like single-variable area charts, multivariate area charts are meant for nominal categorical or interval variables.

Stacked plots are visually very pretty. However, they have two major limitations.

The first limitation is that the second variable in a stacked plot must be a variable with a very limited number of possible values (probably an ordinal categorical, as here). Five different types of wine is a good number because it keeps the result interpretable; eight is sometimes mentioned as a suggested upper bound. Many dataset fields will not fit this critereon naturally, so you have to "make do", as here, by selecting a group of interest.

The second limitation is one of interpretability. As easy as they are to make, and as pretty as they look, stacked plots make it really hard to distinguish concrete values. For example, looking at the plots above, can you tell which wine got a score of 87 more often: Red Blends (in purple), Pinot Noir (in red), or Chardonnay (in green)? It's actually really hard to tell!

Bivariate line chart¶

One plot type we've seen already that remains highly effective when made bivariate is the line chart. Because the line in this chart takes up so little visual space, it's really easy and effective to overplot multiple lines on the same chart.

wine_counts.plot.line()

<matplotlib.axes._subplots.AxesSubplot at 0x7f1b49ce7da0>

Using a line chart this way makes inroads against the second limitation of stacked plotting: interpretability. Bivariate line charts are much more interpretable because the lines themselves don't take up much space. Their values remain readable when we place multiple lines side-by-side, as here.

For example, in this chart we can easily answer our question from the previous example: which wine most commonly scores an 87. We can see here that the Chardonnay, in green, narrowly beats out the Pinot Noir, in red.

Exercises¶

In this section of the tutorial we introduced and explored some common bivariate plot types:

Scatter plots
Hex plots
Stacked bar charts and area charts
Bivariate line charts

Let's now put what we've learned to the test!

To start off, try answering the following questions:

A scatter plot or hex plot is good for what two types of data?
What type of data makes sense to show in a stacked bar chart, but not in a bivariate line chart?
What type of data makes sense to show in a bivariate line chart, but not in a stacked bar chart?
Suppose we create a scatter plot but find that due to the large number of points it's hard to interpret. What are two things we can do to fix this issue?

To see the answers, click the "Output" button on the cell below.

from IPython.display import HTML
HTML("""
<ol>
<li>Scatter plots and hex plots work best with a mixture of ordinal categorical and interval data.</li>
<br/>
<li>Nominal categorical data makes sense in a stacked bar chart, but not in a bivariate line chart.</li>
<br/>
<li>Interval data makes sense in a bivariate line chart, but not in a stacked bar chart.</li>
<br/>
<li>One way to fix this issue would be to sample the points. Another way to fix it would be to use a hex plot.</li>
</ol>
""")

Next, let's replicate some plots. Recall the Pokemon dataset from earlier:

pokemon = pd.read_csv("../input/pokemon/Pokemon.csv", index_col=0)
pokemon.head()

For the exercises that follow, try forking this notebook and replicating the plots that follow. To see the answers, hit the "Input" button below to un-hide the code.

pokemon.plot.scatter(x='Attack', y='Defense')

<matplotlib.axes._subplots.AxesSubplot at 0x7f1b49908518>

pokemon.plot.hexbin(x='Attack', y='Defense', gridsize=15)

<matplotlib.axes._subplots.AxesSubplot at 0x7f1b498ba160>

For thee next plot, use the following data:

pokemon_stats_legendary = pokemon.groupby(['Legendary', 'Generation']).mean()[['Attack', 'Defense']]

pokemon_stats_legendary.plot.bar(stacked=True)

<matplotlib.axes._subplots.AxesSubplot at 0x7f1b488bab00>

For the next plot, use the following data:

pokemon_stats_by_generation = pokemon.groupby('Generation').mean()[['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']]

pokemon_stats_by_generation.plot.line()

<matplotlib.axes._subplots.AxesSubplot at 0x7f1b487e0588>

Conclusion¶

In this section we introduced and explored some common bivariate plot types:

Scatter plots
Hex plots
Stacked bar charts and area charts
Bivariate line charts

In the next section we will move on to exploring another plotting library, seaborn, which compliments pandas with many more advanced data visualization tools for you to use.

Click here to move on to the next section, "Plotting with seaborn".

	country	description	designation	points	price	province	region_1	region_2	variety	winery
0	US	This tremendous 100% varietal wine hails from ...	Martha's Vineyard	96	235.0	California	Napa Valley	Napa	Cabernet Sauvignon	Heitz
1	Spain	Ripe aromas of fig, blackberry and cassis are ...	Carodorum Selección Especial Reserva	96	110.0	Northern Spain	Toro	NaN	Tinta de Toro	Bodega Carmen Rodríguez
2	US	Mac Watson honors the memory of a wine once ma...	Special Selected Late Harvest	96	90.0	California	Knights Valley	Sonoma	Sauvignon Blanc	Macauley
3	US	This spent 20 months in 30% new French oak, an...	Reserve	96	65.0	Oregon	Willamette Valley	Willamette Valley	Pinot Noir	Ponzi
4	France	This is the top wine from La Bégude, named aft...	La Brûlade	95	66.0	Provence	Bandol	NaN	Provence red blend	Domaine de la Bégude

	Bordeaux-style Red Blend	Cabernet Sauvignon	Chardonnay	Pinot Noir	Red Blend
points
80	5.0	87.0	68.0	36.0	72.0
81	18.0	159.0	150.0	83.0	107.0
82	72.0	435.0	517.0	295.0	223.0
83	95.0	570.0	669.0	346.0	364.0
84	268.0	923.0	1146.0	733.0	602.0

	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
#
1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	False
3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	1	False
3	VenusaurMega Venusaur	Grass	Poison	625	80	100	123	122	120	80	1	False
4	Charmander	Fire	NaN	309	39	52	43	60	50	65	1	False