Lesson 05 - pandas part 2

Lesson 05 - Pandas part 2

Welcome to lesson 6! In this lesson we will introduce pandas as our main way of storing data. NumPy will be useful when we have a uniform multidimensional data set, but for now pandas will be our default.

Again, an exhaustive run through of pandas is too much for this class - see the book - Python for Data Analysis or the official pandas website.

If you are familiar with R, SAS, and/or SQL, click on the links to lead you to the intro to pandas for users of each language.

Please download todays notebook here.

Data Import

Importing data is the most important first step to get our data in. Today we will cover read_csv, before we finish the course we will talk about how to connect to your netezza (and other SQL) databases

In [1]:

from pandas import DataFrame, Series
import pandas as pd
import io
import numpy as np

We have a ton of ways of reading data into and writing data out of pandas. See the dataIO page for more details.

In [3]:

#using a string as example
#we could refer to file names if we had the file saved
data = '''
date,A,B,C
20090101,a,1,2
20090102,b,3,4
20090103,c,4,5
'''
print(data)

date,A,B,C
20090101,a,1,2
20090102,b,3,4
20090103,c,4,5

In [3]:

#by default, the index is arange(nrows)
pd.read_csv(io.StringIO(data))

	date	A	B	C
0	20090101	a	1	2
1	20090102	b	3	4
2	20090103	c	4	5

In [4]:

#we can specify the index:
pd.read_csv(io.StringIO(data), index_col=0)
#also index_col='date'

	A	B	C
date
20090101	a	1	2
20090102	b	3	4
20090103	c	4	5

In [40]:

#we can also use nested indices:
pd.read_csv(io.StringIO(data), index_col=['date','A'])

		B	C
date	A
20090101	a	1	2
20090102	b	3	4
20090103	c	4	5

In [5]:

#we can specify data type (it will speed things up, or avoid conversion)
pd.read_csv(io.StringIO(data), index_col=['date'], dtype={'A' : str, 'B':np.int32, 'C':np.float64})

	A	B	C
date
20090101	a	1	2
20090102	b	3	4
20090103	c	4	5

In [6]:

#We can throw out names and use our own
pd.read_csv(io.StringIO(data), index_col=[0],
            dtype={'A' : str, 'B':np.int32, 'C':np.float64},
           names=["foo", 'bar', "baz"],
           header = 0)

	foo	bar	baz
20090101	a	1	2
20090102	b	3	4
20090103	c	4	5

In [57]:

#filter out some unneeded columns:
pd.read_csv(io.StringIO(data),
           names=['date', 'foo', 'bar', "baz"],
           header = 0,
           usecols = ['foo', 'baz'])

	foo	baz
0	a	2
1	b	4
2	c	5

In [5]:

#dates! more to come
dat = pd.read_csv(io.StringIO(data),
           parse_dates = True,
            index_col = [0]
           )
dat.index

DatetimeIndex(['2009-01-01', '2009-01-02', '2009-01-03'], dtype='datetime64[ns]', name='date', freq=None)

DataFrames

DataFrames are similar to a dict of a series - technically they are a 2d series with some linking between levels.

Columns are arrays (must be one data type), and rows are similar to dicts.

However, the row/column mapping is not as strictly enforced as R.

In [7]:

dat = pd.read_csv("http://jeremy.kiwi.nz/pythoncourse/assets/tests/r&d/test1data.csv")[1:20]
dat

	TripType	VisitNumber	Weekday	Upc	ScanCount	DepartmentDescription	FinelineNumber
1	30	7	Friday	60538815980	1	SHOES	8931
2	30	7	Friday	7410811099	1	PERSONAL CARE	4504
3	26	8	Friday	2238403510	2	PAINT AND ACCESSORIES	3565
4	26	8	Friday	2006613744	2	PAINT AND ACCESSORIES	1017
5	26	8	Friday	2006618783	2	PAINT AND ACCESSORIES	1017
6	26	8	Friday	2006613743	1	PAINT AND ACCESSORIES	1017
7	26	8	Friday	7004802737	1	PAINT AND ACCESSORIES	2802
8	26	8	Friday	2238495318	1	PAINT AND ACCESSORIES	4501
9	26	8	Friday	2238400200	-1	PAINT AND ACCESSORIES	3565
10	26	8	Friday	5200010239	1	DSD GROCERY	4606
11	26	8	Friday	88679300501	2	PAINT AND ACCESSORIES	3504
12	26	8	Friday	22006000000	1	MEAT - FRESH & FROZEN	6009
13	26	8	Friday	2236760452	1	PAINT AND ACCESSORIES	7
14	26	8	Friday	88679300501	-1	PAINT AND ACCESSORIES	3504
15	26	8	Friday	2238400200	2	PAINT AND ACCESSORIES	3565
16	26	8	Friday	3019294203	1	PAINT AND ACCESSORIES	2801
17	26	8	Friday	72450408840	1	PAINT AND ACCESSORIES	1028
18	26	8	Friday	25541500000	2	DAIRY	1305
19	26	8	Friday	2310010776	1	PETS AND SUPPLIES	3300

In [12]:

# get the column names
dat.columns
#get the first five rows
dat.head()
#pick out specific columns
DataFrame(dat,columns=['TripType','VisitNumber'])
#same as
dat[['TripType','VisitNumber']]
#get one specific column
dat.TripType
#get one specific column
dat['TripType']
#if we want rows, we need ix, loc or iloc
#loc works on labels in the index:
dat.loc[1]
#iloc works on integer index:
dat.iloc[1]
#ix tries to work like loc, but falls back to iloc if not found:
dat.ix[1]
#add a new column
dat['foo']="spam"
#using other columns:
dat['foo'] = dat['VisitNumber'] + dat['ScanCount']
#add a new column with specific values
dat['foo']=Series(['spam', 'more spam'],index=[4,10])

In [13]:

#delete a column
del dat['foo']
#'http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.html'

In [15]:

#recall, indexes are immutable?
#how to reindex?
dat = dat.reindex(np.arange(5))
dat

	TripType	VisitNumber	Weekday	Upc	ScanCount	DepartmentDescription	FinelineNumber
0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	30	7	Friday	60538815980	1	SHOES	8931
2	30	7	Friday	7410811099	1	PERSONAL CARE	4504
3	26	8	Friday	2238403510	2	PAINT AND ACCESSORIES	3565
4	26	8	Friday	2006613744	2	PAINT AND ACCESSORIES	1017

In [16]:

dat.reindex(np.arange(7),fill_value=0)
dat.reindex(np.arange(10),method='ffill')

	TripType	VisitNumber	Weekday	Upc	ScanCount	DepartmentDescription	FinelineNumber
0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	30	7	Friday	60538815980	1	SHOES	8931
2	30	7	Friday	7410811099	1	PERSONAL CARE	4504
3	26	8	Friday	2238403510	2	PAINT AND ACCESSORIES	3565
4	26	8	Friday	2006613744	2	PAINT AND ACCESSORIES	1017
5	26	8	Friday	2006613744	2	PAINT AND ACCESSORIES	1017
6	26	8	Friday	2006613744	2	PAINT AND ACCESSORIES	1017
7	26	8	Friday	2006613744	2	PAINT AND ACCESSORIES	1017
8	26	8	Friday	2006613744	2	PAINT AND ACCESSORIES	1017
9	26	8	Friday	2006613744	2	PAINT AND ACCESSORIES	1017

In [18]:

dat.drop(1)
#dat.drop('foo', axis = 1)

	TripType	VisitNumber	Weekday	Upc	ScanCount	DepartmentDescription	FinelineNumber
0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	30	7	Friday	7410811099	1	PERSONAL CARE	4504
3	26	8	Friday	2238403510	2	PAINT AND ACCESSORIES	3565
4	26	8	Friday	2006613744	2	PAINT AND ACCESSORIES	1017

In [19]:

#getting data
dat[['TripType','Upc']]
dat['ScanCount']>1
dat[dat['ScanCount']>1]

	TripType	VisitNumber	Weekday	Upc	ScanCount	DepartmentDescription	FinelineNumber
3	26	8	Friday	2238403510	2	PAINT AND ACCESSORIES	3565
4	26	8	Friday	2006613744	2	PAINT AND ACCESSORIES	1017

Quick Review

I mentioned in the previous lecture we can use all our base and NumPy methods on pandas DataFrames: Here is a quick review taken from the SQL lesson:

In [8]:

tips = pd.read_csv('https://raw.github.com/pydata/pandas/master/pandas/tests/data/tips.csv')
tips.head()

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

SQL select: SELECT total_bill, tip, smoker, time FROM tips LIMIT 5;

In [9]:

tips[['total_bill', 'tip', 'smoker', 'time']].head(5)

	total_bill	tip	smoker	time
0	16.99	1.01	No	Dinner
1	10.34	1.66	No	Dinner
2	21.01	3.50	No	Dinner
3	23.68	3.31	No	Dinner
4	24.59	3.61	No	Dinner

SQL where: SELECT * FROM tips WHERE time = 'Dinner' AND tip > 5.00; LIMIT 5;

In [11]:

tips[(tips['time'] == 'Dinner') & (tips['tip'] > 5.00)].head(5)

	total_bill	tip	sex	smoker	day	time	size
23	39.42	7.58	Male	No	Sat	Dinner	4
44	30.40	5.60	Male	No	Sun	Dinner	4
47	32.40	6.00	Male	No	Sun	Dinner	4
52	34.81	5.20	Female	No	Sun	Dinner	4
59	48.27	6.73	Male	No	Sat	Dinner	4

Groupby

Grouping and summarising data allows us to carry out the key data analysis steps of split, apply, combine. The journal article by Hadley Wickham was one of the first formalisations of the split apply combine paradigm, and we can of course do it in Python.

Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure

Let’s continue on with our analysis of the tips data:

In [12]:

tipsgroups = tips.groupby('sex')
tipsgroups

<pandas.core.groupby.DataFrameGroupBy object at 0x7f5f60520128>

We now have a new data type, the groupby object. We can access the attribute, groups. This is a dict, with each level as it’s own entry and the indices of the original data frame:

In [13]:

for i,j in tipsgroups.groups.items():
    print(i)

Female
Male

We can do grouping on any axis, or with a custom function (this example is pathological):

In [14]:

def myfun(index):
    if len(index) >= 5:
        return 1
    else:
        return 0

group2 = tips.groupby(myfun, axis = 1)
group2.groups

{0: ['tip', 'sex', 'day', 'time', 'size'], 1: ['total_bill', 'smoker']}

We can use tab completion to see all out methods and attributes:

In [43]:

from matplotlib import pyplot as plt
%matplotlib inline
tipsgroups.mean()
#tipsgroups.boxplot();

	total_bill	tip	size
sex
Female	18.056897	2.833448	2.459770
Male	20.744076	3.089618	2.630573

We can iterate through groups:

In [36]:

for name, group in tipsgroups:
    print(name)
    print(group.head(5))

Female
    total_bill   tip     sex smoker  day    time  size
0        16.99  1.01  Female     No  Sun  Dinner     2
4        24.59  3.61  Female     No  Sun  Dinner     4
11       35.26  5.00  Female     No  Sun  Dinner     4
14       14.83  3.02  Female     No  Sun  Dinner     2
16       10.33  1.67  Female     No  Sun  Dinner     3
Male
   total_bill   tip   sex smoker  day    time  size
1       10.34  1.66  Male     No  Sun  Dinner     3
2       21.01  3.50  Male     No  Sun  Dinner     3
3       23.68  3.31  Male     No  Sun  Dinner     2
5       25.29  4.71  Male     No  Sun  Dinner     4
6        8.77  2.00  Male     No  Sun  Dinner     2

To apply, we can use .aggregate:

In [15]:

tipsgroups.aggregate(np.mean)
#selecting columns:
tipsgroups['tip'].aggregate(np.mean)

sex
Female    2.833448
Male      3.089618
Name: tip, dtype: float64

In [16]:

#.agg is short for agg
tipsgroups.agg([np.mean, np.sum, np.std])

	total_bill			tip			size
	mean	sum	std	mean	sum	std	mean	sum	std
sex
Female	18.056897	1570.95	8.009209	2.833448	246.51	1.159495	2.459770	214	0.937644
Male	20.744076	3256.82	9.246469	3.089618	485.07	1.489102	2.630573	413	0.955997

In [17]:

#we can also use a dict, to do different things to different rows:
tipsgroups.agg({'tip': [np.mean, np.sum], 'size':lambda x: max(x)})

	tip		size
	mean	sum	<lambda>
sex
Female	2.833448	246.51	6
Male	3.089618	485.07	6

We can also filter, transform, plot, count etc etc. Take a look in the help for more details!

Joins

We can use a variety of joins in pandas, the most basic using the concat function:

In [18]:

df1 = DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},
                     index=[0, 1, 2, 3])
df2 = DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},
                     index=[4, 5, 6, 7])

In [19]:

#joins on index
pd.concat([df1, df2])

	A	B	C	D
0	A0	B0	C0	D0
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3
4	A4	B4	C4	D4
5	A5	B5	C5	D5
6	A6	B6	C6	D6
7	A7	B7	C7	D7

In [20]:

#joins on index!
pd.concat([df1, df2], axis = 1)

	A	B	C	D	A	B	C	D
0	A0	B0	C0	D0	NaN	NaN	NaN	NaN
1	A1	B1	C1	D1	NaN	NaN	NaN	NaN
2	A2	B2	C2	D2	NaN	NaN	NaN	NaN
3	A3	B3	C3	D3	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	A4	B4	C4	D4
5	NaN	NaN	NaN	NaN	A5	B5	C5	D5
6	NaN	NaN	NaN	NaN	A6	B6	C6	D6
7	NaN	NaN	NaN	NaN	A7	B7	C7	D7

In [21]:

#we can ignore the index!
df1.append(df2, ignore_index=True)

	A	B	C	D
0	A0	B0	C0	D0
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3
4	A4	B4	C4	D4
5	A5	B5	C5	D5
6	A6	B6	C6	D6
7	A7	B7	C7	D7

For more control, we might want to explicitly use merge!

We have the standard joins - inner, outer, left, right, full and union:

In [78]:

df1 = DataFrame({'key': ['A', 'B', 'C', 'D'],
                 'value': np.random.randn(4)})
df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],
                    'value': np.random.randn(4)})
#SQL:
#SELECT *
#FROM df1
#INNER JOIN df2
#  ON df1.key = df2.key;
pd.merge(df1, df2, on='key')
#SQL:
#SELECT *
#FROM df1
#LEFT OUTER JOIN df2
#  ON df1.key = df2.key;
pd.merge(df1, df2, on='key', how='left')
#SQL:
#SELECT *
#FROM df1
#RIGHT OUTER JOIN df2
#  ON df1.key = df2.key;
pd.merge(df1, df2, on='key', how='right')
#SQL:
#SELECT *
#FROM df1
#FULL OUTER JOIN df2
#  ON df1.key = df2.key;
pd.merge(df1, df2, on='key', how='outer')

	key	value_x	value_y
0	A	-2.275516	NaN
1	B	0.050553	2.301355
2	C	0.943035	NaN
3	D	-0.237517	0.449717
4	D	-0.237517	-1.692712
5	E	NaN	-1.079463

We have not covered a bunch of stuff: Pivot tables and reshaping, window functions (which were completely updated on Sunday) and time series. We will cover these as we continue throughout the course.

Example

From here, we will look at a worked example of data analysis using pandas. In the first lesson, we looked at the example of the tennis fixing scandal, and briefly ran through it. Now we have the skills and knowledge to walk through it, and assess the analysis.

Here’s the link to the original article and the notebook on github

Exercises

Update pandas, using conda. Read the documentation for the new window functions.
Read through the tennis example, and make sure you understand the basic idea of what is being done
Rewrite cell 5, so that it does not use the ~ for negation - you might need to google how to do this
Write a function or statement to find how many of the report_players are in your list of significant outliers (the output of cell 21)
Read the following csv in, using io.StringIO. There is no header!

In [1]:

data = """\
10/08/2012,12:10:10,name1,0.81,4.02,50;18.5701400N,4;07.7693770E,7.92,10.50,0.0106,4.30,0.0301
10/08/2012,12:10:11,name2,-999,-999,-999,-999,10.87,1.40,0.0099,9.70,0.0686
"""

Read in the data as above, but with -999 being a missing value
(advanced) From the tennis data frame, find the player (by hash) who has the worst winning record. Find the player with the best winning record (remember, we have a row for each betting agent on each match!)

	A	B	C	D
0	A0	B0	C0	D0
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3
4	A4	B4	C4	D4
5	A5	B5	C5	D5
6	A6	B6	C6	D6
7	A7	B7	C7	D7

	A	B	C	D	A	B	C	D
0	A0	B0	C0	D0	NaN	NaN	NaN	NaN
1	A1	B1	C1	D1	NaN	NaN	NaN	NaN
2	A2	B2	C2	D2	NaN	NaN	NaN	NaN
3	A3	B3	C3	D3	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	A4	B4	C4	D4
5	NaN	NaN	NaN	NaN	A5	B5	C5	D5
6	NaN	NaN	NaN	NaN	A6	B6	C6	D6
7	NaN	NaN	NaN	NaN	A7	B7	C7	D7

	A	B	C	D
0	A0	B0	C0	D0
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3
4	A4	B4	C4	D4
5	A5	B5	C5	D5
6	A6	B6	C6	D6
7	A7	B7	C7	D7

	A	B	C	D
0	A0	B0	C0	D0
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3
4	A4	B4	C4	D4
5	A5	B5	C5	D5
6	A6	B6	C6	D6
7	A7	B7	C7	D7

	A	B	C	D	A	B	C	D
0	A0	B0	C0	D0	NaN	NaN	NaN	NaN
1	A1	B1	C1	D1	NaN	NaN	NaN	NaN
2	A2	B2	C2	D2	NaN	NaN	NaN	NaN
3	A3	B3	C3	D3	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	A4	B4	C4	D4
5	NaN	NaN	NaN	NaN	A5	B5	C5	D5
6	NaN	NaN	NaN	NaN	A6	B6	C6	D6
7	NaN	NaN	NaN	NaN	A7	B7	C7	D7

	A	B	C	D
0	A0	B0	C0	D0
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3
4	A4	B4	C4	D4
5	A5	B5	C5	D5
6	A6	B6	C6	D6
7	A7	B7	C7	D7

	A	B	C	D
0	A0	B0	C0	D0
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3
4	A4	B4	C4	D4
5	A5	B5	C5	D5
6	A6	B6	C6	D6
7	A7	B7	C7	D7

	A	B	C	D	A	B	C	D
0	A0	B0	C0	D0	NaN	NaN	NaN	NaN
1	A1	B1	C1	D1	NaN	NaN	NaN	NaN
2	A2	B2	C2	D2	NaN	NaN	NaN	NaN
3	A3	B3	C3	D3	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	A4	B4	C4	D4
5	NaN	NaN	NaN	NaN	A5	B5	C5	D5
6	NaN	NaN	NaN	NaN	A6	B6	C6	D6
7	NaN	NaN	NaN	NaN	A7	B7	C7	D7

	A	B	C	D
0	A0	B0	C0	D0
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3
4	A4	B4	C4	D4
5	A5	B5	C5	D5
6	A6	B6	C6	D6
7	A7	B7	C7	D7