Creating, reading, and writing reference

This is the reference component to the "Creating, reading, and writing" section of the tutorial. For the workbook section, click here.

The very first step in any data analytics project will probably reading the data out of a file somewhere, so it makes sense that that's the first thing we'd need to cover. In this section, we'll look at exercises on creating pandas Series and DataFrame objects, both by hand and by reading data from disc.

The IO Tools section of the official pandas docs provides a comprehensive overview on this subject.

In [4]:
import pandas as pd

Creating data

There are two core objects in pandas: the DataFrame and the Series.

A DataFrame is a table. It contains an array of individual entries, each of which has a certain value. Each entry corresponds with a row (or record) and a column.

For example, consider the following simple DataFrame:

In [5]:
pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]})
Out[5]:
Yes No
0 50 131
1 21 2

In this example, the "0, No" entry has the value of 131. The "0, Yes" entry has a value of 50, and so on.

DataFrame entries are not limited to integers. For instance, here's a DataFrame whose values are str strings:

In [6]:
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']})
Out[6]:
Bob Sue
0 I liked it. Pretty good.
1 It was awful. Bland.

We are using the pd.DataFrame constructor to generate these DataFrame objects. The syntax for declaring a new one is a dictionary whose keys are the column names (Bob and Sue in this example), and whose values are a list of entries. This is the standard way of constructing a new DataFrame, and the one you are likliest to encounter.

The dictionary-list constructor assigns values to the column labels, but just uses an ascending count from 0 (0, 1, 2, 3, ...) for the row labels. Sometimes this is OK, but oftentimes we will want to assign these labels ourselves.

The list of row labels used in a DataFrame is known as an Index. We can assign values to it by using an index parameter in our constructor:

In [7]:
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 
              'Sue': ['Pretty good.', 'Bland.']},
             index=['Product A', 'Product B'])
Out[7]:
Bob Sue
Product A I liked it. Pretty good.
Product B It was awful. Bland.

A Series, by contrast, is a sequence of data values. If a DataFrame is a table, a Series is a list. And in fact you can create one with nothing more than a list:

In [8]:
pd.Series([1, 2, 3, 4, 5])
Out[8]:
0    1
1    2
2    3
3    4
4    5
dtype: int64

A Series is, in essence, a single column of a DataFrame. So you can assign column values to the Series the same way as before, using an index parameter. However, a Series do not have a column name, it only has one overall name:

In [9]:
pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')
Out[9]:
2015 Sales    30
2016 Sales    35
2017 Sales    40
Name: Product A, dtype: int64

Series and the DataFrame are intimately related. It's helpful to think of a DataFrame as actually being just a bunch of Series "glue together". We'll see more of this in the next section of this tutorial.

Reading common file formats

Being able to create a DataFrame and Series by hand is handy. But, most of the time, we won't actually be creating our own data by hand, we'll be working with data that already exists.

Data can be stored in any of a number of different forms and formats. By far the most basic of these is the humble CSV file. When you open a CSV file you get something that looks like this:

csv
Product A,Product B,Product C,
30,21,9,
35,34,1,
41,11,11

So a CSV file is a table of values separated by commas. Hence the name: "comma-seperated values", or CSV.

Let's now set aside our toy datasets and see what a real dataset looks like when we read it into a DataFrame. We'll use the read_csv function to read the data into a DataFrame. This goes thusly:

In [10]:
wine_reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv")

We can use the shape attribute to check how large the resulting DataFrame is:

In [11]:
wine_reviews.shape
Out[11]:
(129971, 14)

So our new DataFrame has 130,000 records split across 14 different columns. That's almost 2 million entries!

We can examine the contents of the resultant DataFrame using the head command, which grabs the first five rows:

In [12]:
wine_reviews.head()
Out[12]:
Unnamed: 0 country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks

The pandas read_csv function is well-endowed, with over 30 optional parameters you can specify. For example, you can see in this dataset that the csv file has an in-built index, which pandas did not pick up on automatically. To make pandas use that column for the index (instead of creating a new one from scratch), we may specify and use an index_col.

In [13]:
wine_reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
wine_reviews.head()
Out[13]:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks

Let's look at a few more datatypes you're likely to encounter.

First up, the venerable Excel spreadsheet. An Excel file (XLS or XLST) organizes itself as a sequence of named sheets. Each sheet is basically a table. So to load the data into pandas we need one additional parameter: the name of the sheet of interest.

So this:

Becomes this:

In [14]:
wic = pd.read_excel("../input/xls-files-all/WICAgencies2013ytd.xls", 
                    sheet_name='Total Women')
wic.head()
Out[14]:
WIC PROGRAM -- TOTAL NUMBER OF WOMEN PARTICIPATING Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12 Unnamed: 13
0 FISCAL YEAR 2013 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Data as of October 05, 2018 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 State Agency or Indian Tribal Organization 2012-10-01 00:00:00 2012-11-01 00:00:00 2012-12-01 00:00:00 2013-01-01 00:00:00 2013-02-01 00:00:00 2013-03-01 00:00:00 2013-04-01 00:00:00 2013-05-01 00:00:00 2013-06-01 00:00:00 2013-07-01 00:00:00 2013-08-01 00:00:00 2013-09-01 00:00:00 Average Participation
4 Connecticut 11891 11763 11328 11786 11159 11070 11379 11666 11387 11587 11570 11376 11496.8

As you can see in this example, Excel files are often not formatted as well as CSV files are. Spreadsheets allow (and encourage) creating notes and fields which are human-readable, but not machine-readable.

So before we can use this particular dataset, we will need to clean it up a bit. We will see how to do so in the next section.

For now, let's move on to another common data format: SQL files.

SQL databases are where most of the data on the web ultimately gets stored. They can be used to store data on things as simple as recipes to things as complicated as "almost everything on the Kaggle website".

Connecting to a SQL database requires a lot more thought than reading from an Excel file. For one, you need to create a connector, something that will handle siphoning data from the database.

pandas won't do this for you automatically because there are many, many different types of SQL databases out there, each with its own connector. So for a SQLite database (the only kind supported on Kaggle), you would need to first do the following (using the sqlite3 library that comes with Python):

In [21]:
import sqlite3
conn = sqlite3.connect("../input/188-million-us-wildfires/FPA_FOD_20170508.sqlite")

The other thing you need to do is write a SQL statement. Internally, SQL databases all operate very differently. Externally, however, they all provide the same API, the "Structured Query Language" (or...SQL...for short).

We (very briefly) need to use SQL to load data into

For the purposes of analysis however we can usually just think of a SQL database as a set of tables with names, and SQL as a minor inconvenience in getting that data out of said tables.

So, without further ado, here is all the SQL you have to know to get the data out of SQLite and into pandas:

In [16]:
fires = pd.read_sql_query("SELECT * FROM fires", conn)

Every SQL statement beings with SELECT. The asterisk (*) is a wildcard character, meaning "everything", and FROM fires tells the database we want only the data from the fires table specifically.

And, out the other end, data:

In [22]:
fires.head()
Out[22]:
OBJECTID FOD_ID FPA_ID SOURCE_SYSTEM_TYPE SOURCE_SYSTEM NWCG_REPORTING_AGENCY NWCG_REPORTING_UNIT_ID NWCG_REPORTING_UNIT_NAME SOURCE_REPORTING_UNIT SOURCE_REPORTING_UNIT_NAME LOCAL_FIRE_REPORT_ID LOCAL_INCIDENT_ID FIRE_CODE FIRE_NAME ICS_209_INCIDENT_NUMBER ICS_209_NAME MTBS_ID MTBS_FIRE_NAME COMPLEX_NAME FIRE_YEAR DISCOVERY_DATE DISCOVERY_DOY DISCOVERY_TIME STAT_CAUSE_CODE STAT_CAUSE_DESCR CONT_DATE CONT_DOY CONT_TIME FIRE_SIZE FIRE_SIZE_CLASS LATITUDE LONGITUDE OWNER_CODE OWNER_DESCR STATE COUNTY FIPS_CODE FIPS_NAME Shape
0 1 1 FS-1418826 FED FS-FIRESTAT FS USCAPNF Plumas National Forest 0511 Plumas National Forest 1 PNF-47 BJ8K FOUNTAIN None None None None None 2005 2453403.5 33 1300 9.0 Miscellaneous 2453403.5 33.0 1730 0.10 A 40.036944 -121.005833 5.0 USFS CA 63 063 Plumas b'\x00\x01\xad\x10\x00\x00\xe8d\xc2\x92_@^\xc0...
1 2 2 FS-1418827 FED FS-FIRESTAT FS USCAENF Eldorado National Forest 0503 Eldorado National Forest 13 13 AAC0 PIGEON None None None None None 2004 2453137.5 133 0845 1.0 Lightning 2453137.5 133.0 1530 0.25 A 38.933056 -120.404444 5.0 USFS CA 61 061 Placer b'\x00\x01\xad\x10\x00\x00T\xb6\xeej\xe2\x19^\...
2 3 3 FS-1418835 FED FS-FIRESTAT FS USCAENF Eldorado National Forest 0503 Eldorado National Forest 27 021 A32W SLACK None None None None None 2004 2453156.5 152 1921 5.0 Debris Burning 2453156.5 152.0 2024 0.10 A 38.984167 -120.735556 13.0 STATE OR PRIVATE CA 17 017 El Dorado b'\x00\x01\xad\x10\x00\x00\xd0\xa5\xa0W\x13/^\...
3 4 4 FS-1418845 FED FS-FIRESTAT FS USCAENF Eldorado National Forest 0503 Eldorado National Forest 43 6 None DEER None None None None None 2004 2453184.5 180 1600 1.0 Lightning 2453189.5 185.0 1400 0.10 A 38.559167 -119.913333 5.0 USFS CA 3 003 Alpine b'\x00\x01\xad\x10\x00\x00\x94\xac\xa3\rt\xfa]...
4 5 5 FS-1418847 FED FS-FIRESTAT FS USCAENF Eldorado National Forest 0503 Eldorado National Forest 44 7 None STEVENOT None None None None None 2004 2453184.5 180 1600 1.0 Lightning 2453189.5 185.0 1200 0.10 A 38.559167 -119.933056 5.0 USFS CA 3 003 Alpine b'\x00\x01\xad\x10\x00\x00@\xe3\xaa.\xb7\xfb]\...

Writing common file formats

Writing data to a file is usually easier than reading it out of one, because pandas handles the nuisance of conversions for you.

We'll start with CSV files again. The opposite of read_csv, which reads our data, is to_csv, which writes it. With CSV files it's dead simple:

In [23]:
wine_reviews.head().to_csv("wine_reviews.csv")

To write an Excel file back you need to_excel and the sheet_name again:

In [19]:
wic.to_excel('wic.xlsx', sheet_name='Total Women')

And finally, to output to a SQL database, supply the name of the table in the database we want to throw the data into, and a connector:

In [20]:
conn = sqlite3.connect("fires.sqlite")
fires.head(10).to_sql("fires", conn)

Painless!