Lesson 04 - pandas part 2
Lesson 04 - Pandas part 2
Welcome to lesson 5! In this lesson we will continue on with pandas as our main way of storing data. NumPy will be useful when we have a uniform multidimensional data set, but for now pandas will be our default.
Again, an exhaustive run through of pandas is too much for this class - see the book - Python for Data Analysis or the official pandas website.
If you are familiar with R, SAS, and/or SQL, click on the links to lead you to the intro to pandas for users of each language.
Please download todays notebook here.
Data Import
Importing data is the most important first step to get our data in. Today we will cover read_csv, before the end of the course we will talk about how to connect to your netezza (and other SQL) databases
In [1]:
We have a ton of ways of reading data into and writing data out of pandas. See the dataIO page for more details.
In [2]:
date,A,B,C
20090101,a,1,2
20090102,b,3,4
20090103,c,4,5
In [3]:
date | A | B | C | |
---|---|---|---|---|
0 | 20090101 | a | 1 | 2 |
1 | 20090102 | b | 3 | 4 |
2 | 20090103 | c | 4 | 5 |
In [38]:
A | B | C | |
---|---|---|---|
date | |||
20090101 | a | 1 | 2 |
20090102 | b | 3 | 4 |
20090103 | c | 4 | 5 |
In [19]:
B C
date A
20090101 a 1 2
20090102 b 3 4
20090103 c 4 5
B C
A
a 1 2
In [45]:
A | B | C | |
---|---|---|---|
date | |||
20090101 | a | 1 | 2 |
20090102 | b | 3 | 4 |
20090103 | c | 4 | 5 |
In [51]:
foo | bar | baz | |
---|---|---|---|
20090101 | a | 1 | 2 |
20090102 | b | 3 | 4 |
20090103 | c | 4 | 5 |
In [57]:
foo | baz | |
---|---|---|
0 | a | 2 |
1 | b | 4 |
2 | c | 5 |
In [20]:
DatetimeIndex(['2009-01-01', '2009-01-02', '2009-01-03'], dtype='datetime64[ns]', name='date', freq=None)
Quick Review
I mentioned in the previous lecture we can use all our base and NumPy methods on pandas DataFrames: Here is a quick review taken from the SQL lesson:
In [21]:
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
SQL select:
SELECT total_bill, tip, smoker, time
FROM tips
LIMIT 5;
In [6]:
total_bill | tip | smoker | time | |
---|---|---|---|---|
0 | 16.99 | 1.01 | No | Dinner |
1 | 10.34 | 1.66 | No | Dinner |
2 | 21.01 | 3.50 | No | Dinner |
3 | 23.68 | 3.31 | No | Dinner |
4 | 24.59 | 3.61 | No | Dinner |
SQL where:
SELECT *
FROM tips
WHERE time = 'Dinner' AND tip > 5.00;
LIMIT 5;
In [10]:
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
23 | 39.42 | 7.58 | Male | No | Sat | Dinner | 4 |
44 | 30.40 | 5.60 | Male | No | Sun | Dinner | 4 |
47 | 32.40 | 6.00 | Male | No | Sun | Dinner | 4 |
52 | 34.81 | 5.20 | Female | No | Sun | Dinner | 4 |
59 | 48.27 | 6.73 | Male | No | Sat | Dinner | 4 |
Groupby
Grouping and summarising data allows us to carry out the key data analysis steps of split, apply, combine. The journal article by Hadley Wickham was one of the first formalisations of the split apply combine paradigm, and we can of course do it in Python.
- Splitting the data into groups based on some criteria
- Applying a function to each group independently
- Combining the results into a data structure
Let’s continue on with our analysis of the tips data:
In [22]:
<pandas.core.groupby.DataFrameGroupBy object at 0x0000015E89FF2C88>
We now have a new data type, the groupby object. We can access the attribute, groups. This is a dict, with each level as it’s own entry and the indices of the original data frame:
In [23]:
Female
Male
We can do grouping on any axis, or with a custom function (this example is pathological):
In [24]:
{0: ['tip', 'sex', 'day', 'time', 'size'], 1: ['total_bill', 'smoker']}
We can use tab completion to see all our methods and attributes:
In [43]:
total_bill | tip | size | |
---|---|---|---|
sex | |||
Female | 18.056897 | 2.833448 | 2.459770 |
Male | 20.744076 | 3.089618 | 2.630573 |
We can iterate through groups:
In [36]:
Female
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
11 35.26 5.00 Female No Sun Dinner 4
14 14.83 3.02 Female No Sun Dinner 2
16 10.33 1.67 Female No Sun Dinner 3
Male
total_bill tip sex smoker day time size
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
5 25.29 4.71 Male No Sun Dinner 4
6 8.77 2.00 Male No Sun Dinner 2
To apply, we can use .aggregate:
In [48]:
sex
Female 2.833448
Male 3.089618
Name: tip, dtype: float64
In [55]:
total_bill | tip | size | |||||||
---|---|---|---|---|---|---|---|---|---|
mean | sum | std | mean | sum | std | mean | sum | std | |
sex | |||||||||
Female | 18.056897 | 1570.95 | 8.009209 | 2.833448 | 246.51 | 1.159495 | 2.459770 | 214 | 0.937644 |
Male | 20.744076 | 3256.82 | 9.246469 | 3.089618 | 485.07 | 1.489102 | 2.630573 | 413 | 0.955997 |
In [58]:
size | tip | ||
---|---|---|---|
<lambda> | mean | sum | |
sex | |||
Female | 6 | 2.833448 | 246.51 |
Male | 6 | 3.089618 | 485.07 |
We can also filter, transform, plot, count etc etc. Take a look in the help for more details!
Joins
We can use a variety of joins in pandas, the most basic using the concat function:
In [24]:
In [66]:
A | B | C | D | |
---|---|---|---|---|
0 | A0 | B0 | C0 | D0 |
1 | A1 | B1 | C1 | D1 |
2 | A2 | B2 | C2 | D2 |
3 | A3 | B3 | C3 | D3 |
4 | A4 | B4 | C4 | D4 |
5 | A5 | B5 | C5 | D5 |
6 | A6 | B6 | C6 | D6 |
7 | A7 | B7 | C7 | D7 |
In [68]:
A | B | C | D | A | B | C | D | |
---|---|---|---|---|---|---|---|---|
0 | A0 | B0 | C0 | D0 | NaN | NaN | NaN | NaN |
1 | A1 | B1 | C1 | D1 | NaN | NaN | NaN | NaN |
2 | A2 | B2 | C2 | D2 | NaN | NaN | NaN | NaN |
3 | A3 | B3 | C3 | D3 | NaN | NaN | NaN | NaN |
4 | NaN | NaN | NaN | NaN | A4 | B4 | C4 | D4 |
5 | NaN | NaN | NaN | NaN | A5 | B5 | C5 | D5 |
6 | NaN | NaN | NaN | NaN | A6 | B6 | C6 | D6 |
7 | NaN | NaN | NaN | NaN | A7 | B7 | C7 | D7 |
In [29]:
A | B | C | D | A | B | C | D | |
---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | NaN | NaN | A0 | B0 | C0 | D0 |
1 | NaN | NaN | NaN | NaN | A1 | B1 | C1 | D1 |
2 | NaN | NaN | NaN | NaN | A2 | B2 | C2 | D2 |
3 | NaN | NaN | NaN | NaN | A3 | B3 | C3 | D3 |
4 | A4 | B4 | C4 | D4 | NaN | NaN | NaN | NaN |
5 | A5 | B5 | C5 | D5 | NaN | NaN | NaN | NaN |
6 | A6 | B6 | C6 | D6 | NaN | NaN | NaN | NaN |
7 | A7 | B7 | C7 | D7 | NaN | NaN | NaN | NaN |
For more control, we might want to explicitly use merge!
We have the standard joins - inner, outer, left, right, full and union:
In [30]:
key | value_x | value_y | |
---|---|---|---|
0 | A | -1.259100 | NaN |
1 | B | -2.293441 | -0.569932 |
2 | C | -0.415496 | NaN |
3 | D | 0.786847 | 1.958187 |
4 | D | 0.786847 | -0.275405 |
5 | E | NaN | 0.549911 |
We have not covered a bunch of stuff: Pivot tables and reshaping, window functions (which were completely updated yesterday) and time series. We will cover these as we continue throughout the course.
Example
From here, we will look at a worked example of data analysis using pandas. In the first lesson, we looked at the example of the tennis fixing scandal, and briefly ran through it. Now we have the skills and knowledge to walk through it, and assess the analysis.
Here’s the link to the original article and the notebook on github
Exercises
-
Update pandas, using conda. Read the documentation for the new window functions.
-
Read through the tennis example, and make sure you understand the basic idea of what is being done
-
Rewrite cell 5, so that it does not use the ~ for negation - you might need to google how to do this
-
Write a function or statement to find how many of the report_players are in your list of significant outliers (the output of cell 21)
-
Read the following csv in, using io.StringIO. There is no header!
-
Read in the data as above, but with -999 being a missing value
-
(advanced) From the tennis data frame, find the player (by hash) who has the worst winning record. Find the player with the best winning record (remember, we have a row for each betting agent on each match!)