Renaming and combining reference

Introduction

This is the reference part of the "Renaming and combining" section of the Advanced Pandas tutorial. For the workbook section, click here.

Renaming is covered in its own section in the "Essential Basic Functionality" section of the extensive official documentation. Combining is covered by the "Merge, join, concatenate" section there.

In [1]:
import pandas as pd
pd.set_option('max_rows', 5)
reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
reviews
Out[1]:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 France A dry style of Pinot Gris, this is crisp with ... NaN 90 32.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Marcel Deiss 2012 Pinot Gris (Alsace) Pinot Gris Domaine Marcel Deiss
129970 France Big, rich and off-dry, this is powered by inte... Lieu-dit Harth Cuvée Caroline 90 21.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... Gewürztraminer Domaine Schoffit

129971 rows × 13 columns

Renaming

Oftentimes data will come to us with column names, index names, or other naming conventions that we are not satisfied with. In that case, we may use pandas renaming utility functions to change the names of the offending entries to something better.

The first function we'll introduce here is rename, which lets you rename index names and/or column names. For example, to change the points column in our dataset to score, we would do:

In [2]:
reviews.rename(columns={'points': 'score'})
Out[2]:
country description designation score price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 France A dry style of Pinot Gris, this is crisp with ... NaN 90 32.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Marcel Deiss 2012 Pinot Gris (Alsace) Pinot Gris Domaine Marcel Deiss
129970 France Big, rich and off-dry, this is powered by inte... Lieu-dit Harth Cuvée Caroline 90 21.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... Gewürztraminer Domaine Schoffit

129971 rows × 13 columns

rename lets you rename index or column values by specifying a index or column keyword parameter, respectively. It supports a variety of input formats, but I usually find a Python dict to be the most convenient one. Here is an example using it to rename some elements on the index.

In [3]:
reviews.rename(index={0: 'firstEntry', 1: 'secondEntry'})
Out[3]:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
firstEntry Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
secondEntry Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 France A dry style of Pinot Gris, this is crisp with ... NaN 90 32.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Marcel Deiss 2012 Pinot Gris (Alsace) Pinot Gris Domaine Marcel Deiss
129970 France Big, rich and off-dry, this is powered by inte... Lieu-dit Harth Cuvée Caroline 90 21.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... Gewürztraminer Domaine Schoffit

129971 rows × 13 columns

You'll probably rename columns very often, but rename index values very rarely. For that, set_index is usually more convenient.

Both the row index and the column index can have their own name attribute. The complimentary rename_axis method may be used to change these names. For example:

In [4]:
reviews.rename_axis("wines", axis='rows').rename_axis("fields", axis='columns')
Out[4]:
fields country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
wines
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 France A dry style of Pinot Gris, this is crisp with ... NaN 90 32.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Marcel Deiss 2012 Pinot Gris (Alsace) Pinot Gris Domaine Marcel Deiss
129970 France Big, rich and off-dry, this is powered by inte... Lieu-dit Harth Cuvée Caroline 90 21.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... Gewürztraminer Domaine Schoffit

129971 rows × 13 columns

Combining

When performing operations on a dataset we will sometimes need to combine different DataFrame and/or Series in non-trivial ways. pandas has three core methods for doing this. In order of increasing complexity, these are concat, join, and merge. Most of what merge can do can also be done more simply with join, so I will omit it and focus on the first two functions here.

The simplest combining method is concat. This function works just like the list.concat method in core Python: given a list of elements, it will smush those elements together along an axis.

This is useful when we have data in different DataFrame or Series objects but having the same fields (columns). One example: the YouTube Videos dataset, which splits the data up based on country of origin (e.g. Canada and the UK, in this example). If we want to study multiple countries simultaneously, we can use concat to smush them together:

In [5]:
canadian_youtube = pd.read_csv("../input/youtube-new/CAvideos.csv")
british_youtube = pd.read_csv("../input/youtube-new/GBvideos.csv")

pd.concat([canadian_youtube, british_youtube])
Out[5]:
video_id trending_date title channel_title category_id publish_time tags views likes dislikes comment_count thumbnail_link comments_disabled ratings_disabled video_error_or_removed description
0 n1WpP7iowLc 17.14.11 Eminem - Walk On Water (Audio) ft. Beyoncé EminemVEVO 10 2017-11-10T17:00:03.000Z Eminem|"Walk"|"On"|"Water"|"Aftermath/Shady/In... 17158579 787425 43420 125882 https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg False False False Eminem's new track Walk on Water ft. Beyoncé i...
1 0dBIkQ4Mz1M 17.14.11 PLUSH - Bad Unboxing Fan Mail iDubbbzTV 23 2017-11-13T17:00:00.000Z plush|"bad unboxing"|"unboxing"|"fan mail"|"id... 1014651 127794 1688 13030 https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg False False False STill got a lot of packages. Probably will las...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
38914 -DRsfNObKIQ 18.14.06 Eleni Foureira - Fuego - Cyprus - LIVE - First... Eurovision Song Contest 24 2018-05-08T20:32:32.000Z Eurovision Song Contest|"2018"|"Lisbon"|"Cypru... 14317515 151870 45875 26766 https://i.ytimg.com/vi/-DRsfNObKIQ/default.jpg False False False Eleni Foureira represented Cyprus at the first...
38915 4YFo4bdMO8Q 18.14.06 KYLE - Ikuyo feat. 2 Chainz & Sophia Black [A... SuperDuperKyle 10 2018-05-11T04:06:35.000Z Kyle|"SuperDuperKyle"|"Ikuyo"|"2 Chainz"|"Soph... 607552 18271 274 1423 https://i.ytimg.com/vi/4YFo4bdMO8Q/default.jpg False False False Debut album 'Light of Mine' out now: http://ky...

79797 rows × 16 columns

The middlemost combiner in terms of complexity is pd.DataFrame.join. join lets you combine different DataFrame objects which have an index in common. For example, to pull down videos that happened to be trending on the same day in both Canada and the UK, we could do the following:

In [6]:
left = canadian_youtube.set_index(['title', 'trending_date'])
right = british_youtube.set_index(['title', 'trending_date'])

left.join(right, lsuffix='_CAN', rsuffix='_UK')
Out[6]:
video_id_CAN channel_title_CAN category_id_CAN publish_time_CAN tags_CAN views_CAN likes_CAN dislikes_CAN comment_count_CAN thumbnail_link_CAN comments_disabled_CAN ratings_disabled_CAN video_error_or_removed_CAN description_CAN video_id_UK channel_title_UK category_id_UK publish_time_UK tags_UK views_UK likes_UK dislikes_UK comment_count_UK thumbnail_link_UK comments_disabled_UK ratings_disabled_UK video_error_or_removed_UK description_UK
title trending_date
!! THIS VIDEO IS NOTHING BUT PAIN !! | Getting Over It - Part 7 18.04.01 PNn8sECd7io Markiplier 20 2018-01-03T19:33:53.000Z getting over it|"markiplier"|"funny moments"|"... 835930 47058 1023 8250 https://i.ytimg.com/vi/PNn8sECd7io/default.jpg False False False Getting Over It continues with RAGE BEYOND ALL... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
#1 Fortnite World Rank - 2,323 Solo Wins! 18.09.03 DvPW66IFhMI AlexRamiGaming 20 2018-03-09T07:15:52.000Z PS4 Battle Royale|"PS4 Pro Battle Royale"|"Bat... 212838 5199 542 11 https://i.ytimg.com/vi/DvPW66IFhMI/default.jpg False False False Discord For EVERYONE - https://discord.gg/nhud... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
🚨 BREAKING NEWS 🔴 Raja Live all Slot Channels Welcome 🎰 18.07.05 Wt9Gkpmbt44 TheBigJackpot 24 2018-05-07T06:58:59.000Z Slot Machine|"win"|"Gambling"|"Big Win"|"raja"... 28973 2167 175 10 https://i.ytimg.com/vi/Wt9Gkpmbt44/default.jpg False False False The Raja takes matters in to his own hands ton... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
🚨Active Shooter at YouTube Headquarters - LIVE BREAKING NEWS COVERAGE 18.04.04 Az72jrKbANA Right Side Broadcasting Network 25 2018-04-03T23:12:37.000Z YouTube shooter|"YouTube active shooter"|"acti... 103513 1722 181 76 https://i.ytimg.com/vi/Az72jrKbANA/default.jpg False False False An active shooter has been reported at the You... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

40900 rows × 28 columns

The lsuffix and rsuffix parameters are necessary here because the data has the same column names in both British and Canadian datasets. If this wasn't true (because, say, we'd renamed them beforehand) we wouldn't need them.