Multivariate plotting¶


Multivariate Scatter Plot	Grouped Box Plot	Heatmap	Parallel Coordinates
df.plot.scatter()	df.plot.box()	sns.heatmap	pd.plotting.parallel_coordinates

For most of this tutorial we've been plotting data in one (univariate) or two (bivariate) dimensions. In the previous section we explored faceting: a multivariate plotting method that works by "gridding out" the data.

In this section we'll delve further into multivariate plotting. First we'll explore "truly" multivariate charts. Then we'll examine some plots that use summarization to get at the same thing.

import pandas as pd
pd.set_option('max_columns', None)
df = pd.read_csv("../input/fifa-18-demo-player-dataset/CompleteDataset.csv", index_col=0)

import re
import numpy as np

footballers = df.copy()
footballers['Unit'] = df['Value'].str[-1]
footballers['Value (M)'] = np.where(footballers['Unit'] == '0', 0, 
                                    footballers['Value'].str[1:-1].replace(r'[a-zA-Z]',''))
footballers['Value (M)'] = footballers['Value (M)'].astype(float)
footballers['Value (M)'] = np.where(footballers['Unit'] == 'M', 
                                    footballers['Value (M)'], 
                                    footballers['Value (M)']/1000)
footballers = footballers.assign(Value=footballers['Value (M)'],
                                 Position=footballers['Preferred Positions'].str.split().str[0])

/opt/conda/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3018: DtypeWarning: Columns (23,35) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

(Note: the first code cell above contains some data pre-processing. This is extraneous, and so I've hidden it by default.)

footballers.head()

Adding more visual variables¶

The most obvious way to plot lots of variables is to augement the visualizations we've been using thus far with even more visual variables. A visual variable is any visual dimension or marker that we can use to perceptually distinguish two data elements from one another. Examples include size, color, shape, and one, two, and even three dimensional position.

"Good" multivariate data displays are ones that make efficient, easily-interpretable use of these parameters.

Multivariate scatter plots¶

Let's look at some examples. We'll start with the scatter plot. Supose that we are interested in seeing which type of offensive players tends to get paid the most: the striker, the right-winger, or the left-winger.

import seaborn as sns

sns.lmplot(x='Value', y='Overall', hue='Position', 
           data=footballers.loc[footballers['Position'].isin(['ST', 'RW', 'LW'])], 
           fit_reg=False)

<seaborn.axisgrid.FacetGrid at 0x7fac34410860>

This scatterplot uses three visual variables. The horizontal position (x-value) tracks the Value of the player (how well they are paid). The vertical position (y-value) tracks the Overall score of the player across all attributes. And the color (the hue parameter) tracks which of the three categories of interest the player the point represents is in.

The new variable in this chart is color. Color provides an aesthetically pleasing visual, but it's tricky to use. Looking at this scatter plot we see the same overplotting issue we saw in previous sections. But we no longer have an easy solution, like using a hex plot, because color doesn't make sense in that setting.

Another example visual variable is shape. Shape controls, well, the shape of the marker:

sns.lmplot(x='Value', y='Overall', markers=['o', 'x', '*'], hue='Position',
           data=footballers.loc[footballers['Position'].isin(['ST', 'RW', 'LW'])],
           fit_reg=False
          )

<seaborn.axisgrid.FacetGrid at 0x7fac1c9fa7b8>

seaborn is opinionated about what kinds of visual variables you should use, and doesn't provide a shape option very often. This is because simple shapes, though nifty, are perceptually inferior to colors in terms of their distinguishability.

Grouped box plot¶

Another demonstrative plot is the grouped box plot. This plot takes advantage of grouping. Suppose we're interested in the following question: do Strikers score higher on "Aggression" than Goalkeepers do?

f = (footballers
         .loc[footballers['Position'].isin(['ST', 'GK'])]
         .loc[:, ['Value', 'Overall', 'Aggression', 'Position']]
    )
f = f[f["Overall"] >= 80]
f = f[f["Overall"] < 85]
f['Aggression'] = f['Aggression'].astype(float)

sns.boxplot(x="Overall", y="Aggression", hue='Position', data=f)

<matplotlib.axes._subplots.AxesSubplot at 0x7fac1c91f668>

As you can see, this plot demonstrates conclusively that within our datasets goalkeepers (at least, those with an overall score between 80 and 85) have much lower Aggression scores than Strikers do.

In this plot, the horizontal axis encodes the Overall score, the vertical axis encodes the Aggression score, and the grouping encodes the Position.

Grouping is an extremely communicative visual variable: it makes this chart very easy to interpret. However, it has very low cardinality: it's very hard to use groups to fit more than a handful of categorical values. In this plot we've chosen just two player positions and five Overall player scores and the visualization is already rather crowded. Overall, grouping is very similar to faceting in terms of what it can and can't do.

Summarization¶

It is difficult to squeeze enough dimensions onto a plot without hurting its interpretability. Very busy plots are naturally very hard to interpret. Hence highly multivariate can be difficult to use.

Another way to plot many dataset features while circumnavigating this problem is to use summarization. Summarization is the creation and addition of new variables by mixing and matching the information provided in the old ones.

Summarization is a useful technique in data visualization because it allows us to "boil down" potentially very complicated relationships into simpler ones.

Heatmap¶

Probably the most heavily used summarization visualization is the correlation plot, in which measures the correlation between every pair of values in a dataset and plots a result in color.

f = (
    footballers.loc[:, ['Acceleration', 'Aggression', 'Agility', 'Balance', 'Ball control']]
        .applymap(lambda v: int(v) if str.isdecimal(v) else np.nan)
        .dropna()
).corr()

sns.heatmap(f, annot=True)

<matplotlib.axes._subplots.AxesSubplot at 0x7fac34aee128>

Each cell in this plot is the intersection of two variables; its color and label together indicate the amount of correlation between the two variables (how likely both variables are the increase or decrease at the same time). For example, in this dataset Agility and Acceleration are highly correlated, while Aggression and Balanced are very uncorrelated.

A correlation plot is a specific kind of heatmap. A heatmap maps one particular fact (in this case, correlation) about every pair of variables you chose from a dataset.

Parallel Coordinates¶

A parallel coordinates plot provides another way of visualizing data across many variables.

from pandas.plotting import parallel_coordinates

f = (
    footballers.iloc[:, 12:17]
        .loc[footballers['Position'].isin(['ST', 'GK'])]
        .applymap(lambda v: int(v) if str.isdecimal(v) else np.nan)
        .dropna()
)
f['Position'] = footballers['Position']
f = f.sample(200)

parallel_coordinates(f, 'Position')

<matplotlib.axes._subplots.AxesSubplot at 0x7fac34a1e5c0>

In the visualization above we've plotted a sample of 200 goalkeepers (in dark green) and strikers (in light green) across our five variables of interest.

Parallel coordinates plots are great for determining how distinguishable different classes are in the data. They standardize the variables from top to bottom... In this case, we see that strikers are almost uniformally higher rated on all of the variables we've chosen, meaning these two classes of players are very easy to distinguish.

Exercises¶

pokemon = pd.read_csv("../input/pokemon/Pokemon.csv", index_col=0)
pokemon.head()

Try answering the following questions. Click the "Output" button on the cell below to see the answers.

What are three techniques for creating multivariate data visualziations?
Name three examples of visual variables.
How does summarization in data visualization work?

from IPython.display import HTML
HTML("""
<ol>
<li>The three techniques we have covered in this tutorial are faceting, using more visual variables, and summarization.</li>
<br/>
<li>Some examples of visual variables are shape, color, size, x-position, y-position, and grouping. However there are many more that are possible!</li>
<br/>
<li>In data visualization, summarization works by compressing complex data into simpler, easier-to-plot indicators.</li>
</ol>
""")

For the exercises that follow, try forking this notebook and replicating the plots that follow. To see the answers, hit the "Input" button below to un-hide the code.

import seaborn as sns

sns.lmplot(x='Attack', y='Defense', hue='Legendary', 
           markers=['x', 'o'],
           fit_reg=False, data=pokemon)

<seaborn.axisgrid.FacetGrid at 0x7fac3479e0b8>

sns.boxplot(x="Generation", y="Total", hue='Legendary', data=pokemon)

<matplotlib.axes._subplots.AxesSubplot at 0x7fac347142e8>

sns.heatmap(
    pokemon.loc[:, ['HP', 'Attack', 'Sp. Atk', 'Defense', 'Sp. Def', 'Speed']].corr(),
    annot=True
)

<matplotlib.axes._subplots.AxesSubplot at 0x7fac1c0fc198>

import pandas as pd
from pandas.plotting import parallel_coordinates

p = (pokemon[(pokemon['Type 1'].isin(["Psychic", "Fighting"]))]
         .loc[:, ['Type 1', 'Attack', 'Sp. Atk', 'Defense', 'Sp. Def']]
    )

parallel_coordinates(p, 'Type 1')

<matplotlib.axes._subplots.AxesSubplot at 0x7fac1c0409b0>

Conclusion¶

In this tutorial we followed up on faceting, covered in the last section, by diving into two other multivariate data visualization techniques.

The first technique, adding more visual variables, results in more complicated but potentially more detailed plots. The second technique, summarization, compresses variable information to a summary statistic, resulting in a simple output—albeit at the cost of expressiveness.

Faceting, adding visual variables, and summarization are the three multivariate techniques that we will cover in this tutorial.

The rest of the material in this tutorial is optional. In the next section we will learn to use plotly, a very popular interactive visualization library that builds on these libraries.

Click here to go to the next section, "Introduction to plotly".

	Name	Age	Photo	Nationality	Flag	Overall	Potential	Club	Club Logo	Value	Wage	Special	Acceleration	Aggression	Agility	Balance	Ball control	Composure	Crossing	Curve	Dribbling	Finishing	Free kick accuracy	GK diving	GK handling	GK kicking	GK positioning	GK reflexes	Heading accuracy	Interceptions	Jumping	Long passing	Long shots	Marking	Penalties	Positioning	Reactions	Short passing	Shot power	Sliding tackle	Sprint speed	Stamina	Standing tackle	Strength	Vision	Volleys	CAM	CB	CDM	CF	CM	ID	LAM	LB	LCB	LCM	LDM	LF	LM	LS	LW	LWB	Preferred Positions	RAM	RB	RCB	RCM	RDM	RF	RM	RS	RW	RWB	ST	Unit	Value (M)	Position
0	Cristiano Ronaldo	32	https://cdn.sofifa.org/48/18/players/20801.png	Portugal	https://cdn.sofifa.org/flags/38.png	94	94	Real Madrid CF	https://cdn.sofifa.org/24/18/teams/243.png	95.5	€565K	2228	89	63	89	63	93	95	85	81	91	94	76	7	11	15	14	11	88	29	95	77	92	22	85	95	96	83	94	23	91	92	31	80	85	88	89.0	53.0	62.0	91.0	82.0	20801	89.0	61.0	53.0	82.0	62.0	91.0	89.0	92.0	91.0	66.0	ST LW	89.0	61.0	53.0	82.0	62.0	91.0	89.0	92.0	91.0	66.0	92.0	M	95.5	ST
1	L. Messi	30	https://cdn.sofifa.org/48/18/players/158023.png	Argentina	https://cdn.sofifa.org/flags/52.png	93	93	FC Barcelona	https://cdn.sofifa.org/24/18/teams/241.png	105.0	€565K	2154	92	48	90	95	95	96	77	89	97	95	90	6	11	15	14	8	71	22	68	87	88	13	74	93	95	88	85	26	87	73	28	59	90	85	92.0	45.0	59.0	92.0	84.0	158023	92.0	57.0	45.0	84.0	59.0	92.0	90.0	88.0	91.0	62.0	RW	92.0	57.0	45.0	84.0	59.0	92.0	90.0	88.0	91.0	62.0	88.0	M	105.0	RW
2	Neymar	25	https://cdn.sofifa.org/48/18/players/190871.png	Brazil	https://cdn.sofifa.org/flags/54.png	92	94	Paris Saint-Germain	https://cdn.sofifa.org/24/18/teams/73.png	123.0	€280K	2100	94	56	96	82	95	92	75	81	96	89	84	9	9	15	15	11	62	36	61	75	77	21	81	90	88	81	80	33	90	78	24	53	80	83	88.0	46.0	59.0	88.0	79.0	190871	88.0	59.0	46.0	79.0	59.0	88.0	87.0	84.0	89.0	64.0	LW	88.0	59.0	46.0	79.0	59.0	88.0	87.0	84.0	89.0	64.0	84.0	M	123.0	LW
3	L. Suárez	30	https://cdn.sofifa.org/48/18/players/176580.png	Uruguay	https://cdn.sofifa.org/flags/60.png	92	92	FC Barcelona	https://cdn.sofifa.org/24/18/teams/241.png	97.0	€510K	2291	88	78	86	60	91	83	77	86	86	94	84	27	25	31	33	37	77	41	69	64	86	30	85	92	93	83	87	38	77	89	45	80	84	88	87.0	58.0	65.0	88.0	80.0	176580	87.0	64.0	58.0	80.0	65.0	88.0	85.0	88.0	87.0	68.0	ST	87.0	64.0	58.0	80.0	65.0	88.0	85.0	88.0	87.0	68.0	88.0	M	97.0	ST
4	M. Neuer	31	https://cdn.sofifa.org/48/18/players/167495.png	Germany	https://cdn.sofifa.org/flags/21.png	92	92	FC Bayern Munich	https://cdn.sofifa.org/24/18/teams/21.png	61.0	€230K	1493	58	29	52	35	48	70	15	14	30	13	11	91	90	95	91	89	25	30	78	59	16	10	47	12	85	55	25	11	61	44	10	83	70	11	NaN	NaN	NaN	NaN	NaN	167495	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	GK	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	M	61.0	GK

	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
#
1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	False
3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	1	False
3	VenusaurMega Venusaur	Grass	Poison	625	80	100	123	122	120	80	1	False
4	Charmander	Fire	NaN	309	39	52	43	60	50	65	1	False