![]() |
![]() |
| Facet Grid | Pair Plot |
| sns.FacetGrid() | sns.pairplot() |
| Good for data with at least two categorical variables. | Good for exploring most kinds of data. |
So far in this tutorial we've been plotting data in one (univariate) or two (bivariate) dimensions, and we've learned how plotting in seaborn works. In this section we'll dive deeper into seaborn by exploring faceting.
Faceting is the act of breaking data variables up across multiple subplots, and combining those subplots into a single figure. So instead of one bar chart, we might have, say, four, arranged together in a grid.
In this notebook we'll put this technique in action, and see why it's so useful.
import pandas as pd
pd.set_option('max_columns', None)
df = pd.read_csv("../input/fifa-18-demo-player-dataset/CompleteDataset.csv", index_col=0)
import re
import numpy as np
footballers = df.copy()
footballers['Unit'] = df['Value'].str[-1]
footballers['Value (M)'] = np.where(footballers['Unit'] == '0', 0,
footballers['Value'].str[1:-1].replace(r'[a-zA-Z]',''))
footballers['Value (M)'] = footballers['Value (M)'].astype(float)
footballers['Value (M)'] = np.where(footballers['Unit'] == 'M',
footballers['Value (M)'],
footballers['Value (M)']/1000)
footballers = footballers.assign(Value=footballers['Value (M)'],
Position=footballers['Preferred Positions'].str.split().str[0])
(Note: the first code cell above contains some data pre-processing. This is extraneous, and so I've hidden it by default.)
footballers.head()
import seaborn as sns
The core seaborn utility for faceting is the FacetGrid. A FacetGrid is an object which stores some information on how you want to break up your data visualization.
For example, suppose that we're interested in (as in the previous notebook) comparing strikers and goalkeepers in some way. To do this, we can create a FacetGrid with our data, telling it that we want to break the Position variable down by col (column).
Since we're zeroing in on just two positions in particular, this results in a pair of grids ready for us to "do" something with them:
df = footballers[footballers['Position'].isin(['ST', 'GK'])]
g = sns.FacetGrid(df, col="Position")
From there, we use the map object method to plot the data into the laid-out grid.
df = footballers[footballers['Position'].isin(['ST', 'GK'])]
g = sns.FacetGrid(df, col="Position")
g.map(sns.kdeplot, "Overall")
Passing a method into another method like this may take some getting used to, if this is your first time seeing this being done. But once you get used to it, FacetGrid is very easy to use.
By using an object to gather "design criteria", seaborn does an effective job seamlessly marrying the data representation to the data values, sparing us the need to lay the plot out ourselves.
We're probably interested in more than just goalkeepers and strikers, however. But if we squeezed all of the possible game positions into one row, the resulting plots would be tiny. FacetGrid comes equipped with a col_wrap parameter for dealing with this case exactly.
df = footballers
g = sns.FacetGrid(df, col="Position", col_wrap=6)
g.map(sns.kdeplot, "Overall")
So far we've been dealing exclusively with one col (column) of data. The "grid" in FacetGrid, however, refers to the ability to lay data out by row and column.
For example, suppose we're interested in comparing the talent distribution for (goalkeepers and strikers specifically, to keep things succinct) across rival clubs Real Madrid, Atlético Madrid, and FC Barcelona.
As the plot below demonstrates, we can achieve this by passing row=Position and col=Club parameters into the plot.
df = footballers[footballers['Position'].isin(['ST', 'GK'])]
df = df[df['Club'].isin(['Real Madrid CF', 'FC Barcelona', 'Atlético Madrid'])]
g = sns.FacetGrid(df, row="Position", col="Club")
g.map(sns.violinplot, "Overall")
FacetGrid orders the subplots effectively arbitrarily by default. To specify your own ordering explicitly, pass the appropriate argument to the row_order and col_order parameters.
df = footballers[footballers['Position'].isin(['ST', 'GK'])]
df = df[df['Club'].isin(['Real Madrid CF', 'FC Barcelona', 'Atlético Madrid'])]
g = sns.FacetGrid(df, row="Position", col="Club",
row_order=['GK', 'ST'],
col_order=['Atlético Madrid', 'FC Barcelona', 'Real Madrid CF'])
g.map(sns.violinplot, "Overall")
FacetGrid comes equipped with various lesser parameters as well, but these are the most important ones.
In a nutshell, faceting is the easiest way to make your data visualization multivariate.
Faceting is multivariate because after laying out one (categorical) variable in the rows and another (categorical) variable in the columns, we are already at two variables accounted for before regular plotting has even begun.
And faceting is easy because transitioning from plotting a kdeplot to gridding them out, as here, is very simple. It doesn't require learning any new visualization techniques. The limitations are the same ones that held for the plots you use inside.
Faceting does have some important limitations however. It can only be used to break data out across singular or paired categorical variables with very low numeracy—any more than five or so dimensions in the grid, and the plots become too small (or involve a lot of scrolling). Additionally it involves choosing (or letting Python) an order to plot in, but with nominal categorical variables that choice is distractingly arbitrary.
Nevertheless, faceting is an extremely useful and applicable tool to have in your toolbox.
Now that we understand faceting, it's worth taking a quick once-over of the seaborn pairplot function.
pairplot is a very useful and widely used seaborn method for faceting variables (as opposed to variable values). You pass it a pandas DataFrame in the right shape, and it returns you a gridded result of your variable values:
sns.pairplot(footballers[['Overall', 'Potential', 'Value']])
By default pairplot will return scatter plots in the main entries and a histogram in the diagonal. pairplot is oftentimes the first thing that a data scientist will throw at their data, and it works fantastically well in that capacity, even if sometimes the scatter-and-histogram approach isn't quite appropriate, given the data types.
As in previous notebooks, let's now test ourselves by answering some questions about the plots we've used in this section. Once you have your answers, click on "Output" button below to show the correct answers.
n by n FacetGrid. How big can n get?pairplot most useful?from IPython.display import HTML
HTML("""
<ol>
<li>You should try to keep your grid variables down to five or so. Otherwise the plots get too small.</li>
<li>It's (1) a multivariate technique which (2) is very easy to use.</li>
<li>Pair plots are most useful when just starting out with a dataset, because they help contextualize relationships within it.</li>
</ol>
""")
Next, try forking this kernel, and see if you can replicate the following plots. To see the answers, click the "Input" button to unhide the code and see the answers. Here's the dataset we've been working with:
import pandas as pd
import seaborn as sns
pokemon = pd.read_csv("../input/pokemon/Pokemon.csv", index_col=0)
pokemon.head(3)
g = sns.FacetGrid(pokemon, row="Legendary")
g.map(sns.kdeplot, "Attack")
g = sns.FacetGrid(pokemon, col="Legendary", row="Generation")
g.map(sns.kdeplot, "Attack")
sns.pairplot(pokemon[['HP', 'Attack', 'Defense']])
In this notebook we explored FacetGrid and pairplot, two seaborn facilities for faceting your data, and discussed why faceting is so useful in a broad range of cases.
This technique is our first dip into multivariate plotting, an idea that we will explore in more depth with two other approaches in the next section.
Click here to go to the next section, "Multivariate plotting".