Multivariate plotting

Multivariate Scatter Plot Grouped Box Plot Heatmap Parallel Coordinates
df.plot.scatter() df.plot.box() sns.heatmap pd.plotting.parallel_coordinates

For most of this tutorial we've been plotting data in one (univariate) or two (bivariate) dimensions. In the previous section we explored faceting: a multivariate plotting method that works by "gridding out" the data.

In this section we'll delve further into multivariate plotting. First we'll explore "truly" multivariate charts. Then we'll examine some plots that use summarization to get at the same thing.

In [1]:
import pandas as pd
pd.set_option('max_columns', None)
df = pd.read_csv("../input/fifa-18-demo-player-dataset/CompleteDataset.csv", index_col=0)

import re
import numpy as np

footballers = df.copy()
footballers['Unit'] = df['Value'].str[-1]
footballers['Value (M)'] = np.where(footballers['Unit'] == '0', 0, 
                                    footballers['Value'].str[1:-1].replace(r'[a-zA-Z]',''))
footballers['Value (M)'] = footballers['Value (M)'].astype(float)
footballers['Value (M)'] = np.where(footballers['Unit'] == 'M', 
                                    footballers['Value (M)'], 
                                    footballers['Value (M)']/1000)
footballers = footballers.assign(Value=footballers['Value (M)'],
                                 Position=footballers['Preferred Positions'].str.split().str[0])
/opt/conda/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3018: DtypeWarning: Columns (23,35) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

(Note: the first code cell above contains some data pre-processing. This is extraneous, and so I've hidden it by default.)

In [2]:
footballers.head()
Out[2]:
Name Age Photo Nationality Flag Overall Potential Club Club Logo Value Wage Special Acceleration Aggression Agility Balance Ball control Composure Crossing Curve Dribbling Finishing Free kick accuracy GK diving GK handling GK kicking GK positioning GK reflexes Heading accuracy Interceptions Jumping Long passing Long shots Marking Penalties Positioning Reactions Short passing Shot power Sliding tackle Sprint speed Stamina Standing tackle Strength Vision Volleys CAM CB CDM CF CM ID LAM LB LCB LCM LDM LF LM LS LW LWB Preferred Positions RAM RB RCB RCM RDM RF RM RS RW RWB ST Unit Value (M) Position
0 Cristiano Ronaldo 32 https://cdn.sofifa.org/48/18/players/20801.png Portugal https://cdn.sofifa.org/flags/38.png 94 94 Real Madrid CF https://cdn.sofifa.org/24/18/teams/243.png 95.5 €565K 2228 89 63 89 63 93 95 85 81 91 94 76 7 11 15 14 11 88 29 95 77 92 22 85 95 96 83 94 23 91 92 31 80 85 88 89.0 53.0 62.0 91.0 82.0 20801 89.0 61.0 53.0 82.0 62.0 91.0 89.0 92.0 91.0 66.0 ST LW 89.0 61.0 53.0 82.0 62.0 91.0 89.0 92.0 91.0 66.0 92.0 M 95.5 ST
1 L. Messi 30 https://cdn.sofifa.org/48/18/players/158023.png Argentina https://cdn.sofifa.org/flags/52.png 93 93 FC Barcelona https://cdn.sofifa.org/24/18/teams/241.png 105.0 €565K 2154 92 48 90 95 95 96 77 89 97 95 90 6 11 15 14 8 71 22 68 87 88 13 74 93 95 88 85 26 87 73 28 59 90 85 92.0 45.0 59.0 92.0 84.0 158023 92.0 57.0 45.0 84.0 59.0 92.0 90.0 88.0 91.0 62.0 RW 92.0 57.0 45.0 84.0 59.0 92.0 90.0 88.0 91.0 62.0 88.0 M 105.0 RW
2 Neymar 25 https://cdn.sofifa.org/48/18/players/190871.png Brazil https://cdn.sofifa.org/flags/54.png 92 94 Paris Saint-Germain https://cdn.sofifa.org/24/18/teams/73.png 123.0 €280K 2100 94 56 96 82 95 92 75 81 96 89 84 9 9 15 15 11 62 36 61 75 77 21 81 90 88 81 80 33 90 78 24 53 80 83 88.0 46.0 59.0 88.0 79.0 190871 88.0 59.0 46.0 79.0 59.0 88.0 87.0 84.0 89.0 64.0 LW 88.0 59.0 46.0 79.0 59.0 88.0 87.0 84.0 89.0 64.0 84.0 M 123.0 LW
3 L. Suárez 30 https://cdn.sofifa.org/48/18/players/176580.png Uruguay https://cdn.sofifa.org/flags/60.png 92 92 FC Barcelona https://cdn.sofifa.org/24/18/teams/241.png 97.0 €510K 2291 88 78 86 60 91 83 77 86 86 94 84 27 25 31 33 37 77 41 69 64 86 30 85 92 93 83 87 38 77 89 45 80 84 88 87.0 58.0 65.0 88.0 80.0 176580 87.0 64.0 58.0 80.0 65.0 88.0 85.0 88.0 87.0 68.0 ST 87.0 64.0 58.0 80.0 65.0 88.0 85.0 88.0 87.0 68.0 88.0 M 97.0 ST
4 M. Neuer 31 https://cdn.sofifa.org/48/18/players/167495.png Germany https://cdn.sofifa.org/flags/21.png 92 92 FC Bayern Munich https://cdn.sofifa.org/24/18/teams/21.png 61.0 €230K 1493 58 29 52 35 48 70 15 14 30 13 11 91 90 95 91 89 25 30 78 59 16 10 47 12 85 55 25 11 61 44 10 83 70 11 NaN NaN NaN NaN NaN 167495 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN GK NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN M 61.0 GK

Adding more visual variables

The most obvious way to plot lots of variables is to augement the visualizations we've been using thus far with even more visual variables. A visual variable is any visual dimension or marker that we can use to perceptually distinguish two data elements from one another. Examples include size, color, shape, and one, two, and even three dimensional position.

"Good" multivariate data displays are ones that make efficient, easily-interpretable use of these parameters.

Multivariate scatter plots

Let's look at some examples. We'll start with the scatter plot. Supose that we are interested in seeing which type of offensive players tends to get paid the most: the striker, the right-winger, or the left-winger.

In [3]:
import seaborn as sns

sns.lmplot(x='Value', y='Overall', hue='Position', 
           data=footballers.loc[footballers['Position'].isin(['ST', 'RW', 'LW'])], 
           fit_reg=False)
Out[3]:
<seaborn.axisgrid.FacetGrid at 0x7f6ff35e8198>

This scatterplot uses three visual variables. The horizontal position (x-value) tracks the Value of the player (how well they are paid). The vertical position (y-value) tracks the Overall score of the player across all attributes. And the color (the hue parameter) tracks which of the three categories of interest the player the point represents is in.

The new variable in this chart is color. Color provides an aesthetically pleasing visual, but it's tricky to use. Looking at this scatter plot we see the same overplotting issue we saw in previous sections. But we no longer have an easy solution, like using a hex plot, because color doesn't make sense in that setting.

Another example visual variable is shape. Shape controls, well, the shape of the marker:

In [4]:
sns.lmplot(x='Value', y='Overall', markers=['o', 'x', '*'], hue='Position',
           data=footballers.loc[footballers['Position'].isin(['ST', 'RW', 'LW'])],
           fit_reg=False
          )
Out[4]:
<seaborn.axisgrid.FacetGrid at 0x7f6fe9bb27f0>

seaborn is opinionated about what kinds of visual variables you should use, and doesn't provide a shape option very often. This is because simple shapes, though nifty, are perceptually inferior to colors in terms of their distinguishability.

Grouped box plot

Another demonstrative plot is the grouped box plot. This plot takes advantage of grouping. Suppose we're interested in the following question: do Strikers score higher on "Aggression" than Goalkeepers do?

In [5]:
f = (footballers
         .loc[footballers['Position'].isin(['ST', 'GK'])]
         .loc[:, ['Value', 'Overall', 'Aggression', 'Position']]
    )
f = f[f["Overall"] >= 80]
f = f[f["Overall"] < 85]
f['Aggression'] = f['Aggression'].astype(float)

sns.boxplot(x="Overall", y="Aggression", hue='Position', data=f)
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6ff35e3710>

As you can see, this plot demonstrates conclusively that within our datasets goalkeepers (at least, those with an overall score between 80 and 85) have much lower Aggression scores than Strikers do.

In this plot, the horizontal axis encodes the Overall score, the vertical axis encodes the Aggression score, and the grouping encodes the Position.

Grouping is an extremely communicative visual variable: it makes this chart very easy to interpret. However, it has very low cardinality: it's very hard to use groups to fit more than a handful of categorical values. In this plot we've chosen just two player positions and five Overall player scores and the visualization is already rather crowded. Overall, grouping is very similar to faceting in terms of what it can and can't do.

Summarization

It is difficult to squeeze enough dimensions onto a plot without hurting its interpretability. Very busy plots are naturally very hard to interpret. Hence highly multivariate can be difficult to use.

Another way to plot many dataset features while circumnavigating this problem is to use summarization. Summarization is the creation and addition of new variables by mixing and matching the information provided in the old ones.

Summarization is a useful technique in data visualization because it allows us to "boil down" potentially very complicated relationships into simpler ones.

Heatmap

Probably the most heavily used summarization visualization is the correlation plot, in which measures the correlation between every pair of values in a dataset and plots a result in color.

In [6]:
f = (
    footballers.loc[:, ['Acceleration', 'Aggression', 'Agility', 'Balance', 'Ball control']]
        .applymap(lambda v: int(v) if str.isdecimal(v) else np.nan)
        .dropna()
).corr()

sns.heatmap(f, annot=True)
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6fe9a01128>

Each cell in this plot is the intersection of two variables; its color and label together indicate the amount of correlation between the two variables (how likely both variables are the increase or decrease at the same time). For example, in this dataset Agility and Acceleration are highly correlated, while Aggression and Balanced are very uncorrelated.

A correlation plot is a specific kind of heatmap. A heatmap maps one particular fact (in this case, correlation) about every pair of variables you chose from a dataset.

Parallel Coordinates

A parallel coordinates plot provides another way of visualizing data across many variables.

In [7]:
from pandas.plotting import parallel_coordinates

f = (
    footballers.iloc[:, 12:17]
        .loc[footballers['Position'].isin(['ST', 'GK'])]
        .applymap(lambda v: int(v) if str.isdecimal(v) else np.nan)
        .dropna()
)
f['Position'] = footballers['Position']
f = f.sample(200)

parallel_coordinates(f, 'Position')
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6fe9939668>

In the visualization above we've plotted a sample of 200 goalkeepers (in dark green) and strikers (in light green) across our five variables of interest.

Parallel coordinates plots are great for determining how distinguishable different classes are in the data. They standardize the variables from top to bottom... In this case, we see that strikers are almost uniformally higher rated on all of the variables we've chosen, meaning these two classes of players are very easy to distinguish.

Exercises

In [8]:
pokemon = pd.read_csv("../input/pokemon/Pokemon.csv", index_col=0)
pokemon.head()
Out[8]:
Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
#
1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False
4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False

Try answering the following questions. Click the "Output" button on the cell below to see the answers.

  1. What are three techniques for creating multivariate data visualziations?
  2. Name three examples of visual variables.
  3. How does summarization in data visualization work?
In [9]:
from IPython.display import HTML
HTML("""
<ol>
<li>The three techniques we have covered in this tutorial are faceting, using more visual variables, and summarization.</li>
<br/>
<li>Some examples of visual variables are shape, color, size, x-position, y-position, and grouping. However there are many more that are possible!</li>
<br/>
<li>In data visualization, summarization works by compressing complex data into simpler, easier-to-plot indicators.</li>
</ol>
""")
Out[9]:
  1. The three techniques we have covered in this tutorial are faceting, using more visual variables, and summarization.

  2. Some examples of visual variables are shape, color, size, x-position, y-position, and grouping. However there are many more that are possible!

  3. In data visualization, summarization works by compressing complex data into simpler, easier-to-plot indicators.

For the exercises that follow, try forking this notebook and replicating the plots that follow. To see the answers, hit the "Input" button below to un-hide the code.

In [10]:
sns.lmplot(x='Attack', y='Defense', markers=['x', 'o',], hue='Legendary',
           data=pokemon,
           fit_reg=False
          )
Out[10]:
<seaborn.axisgrid.FacetGrid at 0x7f6fe96b0160>
In [11]:
sns.boxplot(x="Generation", y="Total", hue='Legendary', data=pokemon)
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6fe96b70b8>
In [12]:
pokemon.head()
Out[12]:
Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
#
1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False
4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False
In [13]:
f = pokemon.loc[:, ['HP', 'Attack', 'Sp. Atk', 'Defense', 'Sp. Def', 'Speed']].corr()

sns.heatmap(f, annot=True)
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6fe9715550>
In [14]:
from pandas.plotting import parallel_coordinates

p = (pokemon.loc[:, ['Type 1', 'Attack', 'Sp. Atk', 'Defense', 'Sp. Def']]
        .loc[pokemon['Type 1'].isin(['Fighting', 'Psychic'])])

parallel_coordinates(p, 'Type 1')
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6fe8486a58>

Conclusion

In this tutorial we followed up on faceting, covered in the last section, by diving into two other multivariate data visualization techniques.

The first technique, adding more visual variables, results in more complicated but potentially more detailed plots. The second technique, summarization, compresses variable information to a summary statistic, resulting in a simple output—albeit at the cost of expressiveness.

Faceting, adding visual variables, and summarization are the three multivariate techniques that we will cover in this tutorial.

The rest of the material in this tutorial is optional. In the next section we will learn to use plotly, a very popular interactive visualization library that builds on these libraries.

Click here to go to the next section, "Introduction to plotly".