Test 02 - Answers

Test 2: Machine learning

Here are the answers, see the notebook here.

In this test we will use the entire dataset from the walmart kaggle challenge, do some feature engineering and data munging, then fit a random forest model to our data.

Again, the data is a csv file which contains one line for each scan on their system, with a Upc, Weekday, ScanCount, DepartmentDescription and FinelineNumber.

The VisitNumber column groups our data into baskets - Every unique VisitNumber is a unique basket, with a basket possibly containing multiple scans.

The label is the TripType column, which is Walmarts proprietary way of clustering their visits into categories. We wish to match their algorithm, and predict the category of some of our held out data.

This time we will use the full dataset - we have about 650,000 lines, in about 100,000 baskets. Just as a heads up, using 100 classifiers, my answer to the test takes less than 3 minutes to run - no need for hours and hours of computation.

If you do need to run this script multiple times, download the dataset from the website rather than redownloading each time, as it’s around 30 mb.

Please answer the questions in the cells below them - feel free to answer out of order, but leave comments saying where you carried out the answer. I am working more or less step by step through my answer - Feel free to add on extra predictors if you can think of them.

1. Import the modules you will use for the rest of the test:

In [1]:

import pandas as pd
import numpy as np
from sklearn import ensemble
from sklearn.cross_validation import train_test_split
import operator
from sklearn import pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import KFold
from sklearn.cross_validation import cross_val_score

2. Read in the data, and check its head. The data is available on the website at: http://jeremy.kiwi.nz/pythoncourse/assets/tests/test2data.csv

In [2]:

dat = pd.read_csv("c:/users/jeremy/desktop/kaglewalmart/data/train.csv")
dat.head()

	TripType	VisitNumber	Weekday	Upc	ScanCount	DepartmentDescription	FinelineNumber
0	999	5	Friday	6.811315e+10	-1	FINANCIAL SERVICES	1000.0
1	30	7	Friday	6.053882e+10	1	SHOES	8931.0
2	30	7	Friday	7.410811e+09	1	PERSONAL CARE	4504.0
3	26	8	Friday	2.238404e+09	2	PAINT AND ACCESSORIES	3565.0
4	26	8	Friday	2.006614e+09	2	PAINT AND ACCESSORIES	1017.0

3. Fix the Weekday and DepartmentDescription into dummified data. For now they can be seperate dataframes

In [3]:

#now fix the categorical variables
weekdum = pd.get_dummies(dat['Weekday'])
weekdum.head()
departdum = pd.get_dummies(dat['DepartmentDescription'])
departdum.head()

	...	SHOES
0	...	0.0
1	...	1.0
2	...	0.0
3	...	0.0
4	...	0.0

5 rows × 68 columns

4. Drop the unneeded columns from the raw data - I suggest removing - ‘Weekday’, ‘Upc’, ‘DepartmentDescription’ and ‘FinelineNumber’ (we could dummify Upc and FineLine, but this will massively increase our data size.)

In [4]:

#drop the useless columns:
dat = dat.drop(['Weekday', 'Upc', 'DepartmentDescription', 'FinelineNumber'], axis = 1)
dat.head()

	TripType	VisitNumber	ScanCount
0	999	5	-1
1	30	7	1
2	30	7	1
3	26	8	2
4	26	8	2

5. Correct the Dummified data for number bought in each ScanCount. I would recommend something like:

departdummies.multiply(dat['ScanCount'], axis = 0)

In [5]:

#correct for scancount
departdum = departdum.multiply(dat['ScanCount'], axis = 0)
departdum['ScanCount'] = dat['ScanCount']
dat = dat.drop(['ScanCount'], axis = 1)

6. Concatenate back together the dummy variables with the main dataframe

In [6]:

dat = pd.concat([dat, weekdum, departdum], axis = 1)
dat.head()

	TripType	VisitNumber	Friday	1-HR PHOTO	...	SEASONAL	SERVICE DELI	SHEER HOSIERY	SHOES	SLEEPWEAR/FOUNDATIONS	SPORTING GOODS	SWIMWEAR/OUTERWEAR	TOYS	WIRELESS	ScanCount
0	999	5	1.0	-0.0	...	-0.0	-0.0	-0.0	-0.0	-0.0	-0.0	-0.0	-0.0	-0.0	-1
1	30	7	1.0	0.0	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	1
2	30	7	1.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1
3	26	8	1.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	2
4	26	8	1.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	2

5 rows × 78 columns

7. Summarise the data for each basket (hint, if you groupby columns, an .agg() method will not apply to them)

In [7]:

dat1 = dat.groupby(['TripType', 'VisitNumber']).agg(sum)
dat1.head()

		Friday	Monday	Saturday	Sunday	Thursday	Tuesday	Wednesday	1-HR PHOTO	ACCESSORIES	AUTOMOTIVE	...	SEASONAL	SERVICE DELI	SHEER HOSIERY	SHOES	SLEEPWEAR/FOUNDATIONS	SPORTING GOODS	SWIMWEAR/OUTERWEAR	TOYS	WIRELESS	ScanCount
TripType	VisitNumber
3	106	2.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	2
	121	2.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	2
	153	2.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	2
	162	2.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	2
	164	2.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	2

5 rows × 76 columns

8. Use the reset_index() method to remove your groupings. As we did not cover multiple indices in the lesson, my answer was

dat1 = dat1.reset_index()

In [8]:

dat1 = dat1.reset_index()

9. Split the data into training and testing sets: Use 0.25 of the data in the test set.

In [9]:

classes = dat1.TripType
dat1 = dat1.drop('TripType', axis = 1)
classes.head()

X_train, X_test, y_train, y_test = \
    train_test_split(dat1, classes, test_size = 0.25, random_state = 0)

10. Construct at least two more features for the data - For Example, a 1/0 variable for if any product was returned (ScanCount < 0). You might want to do this step before splitting the data as above

In [10]:

#lots of good answers here!

11. Plot the training data using matplotlib or seaborn. Choose at least 3 meaningful plots to present aspects of the data.

In [11]:

#lots of good answers here

12. Take out the TripType from our dataframe - we don’t want our label as a feature.

Make sure to save it somewhere though, as our model needs to be fit to these labels.

In [12]:

#see part 9

13. Describe and fit a pipeline that carries out a kfold crossvalidation randomforest model on the data. Include any relevant preprocessing steps such as centering and scaling. The kfold might need to be outside the pipeline.

In [13]:

pipe = pipeline.Pipeline(steps=[
                ('rf', RandomForestClassifier())
        #we don't really need preprocessing, the bootstrapped nature of RF takes care of it for use
        #most people chose a sensible option anyway!
        ])

#kfold, using the Kfold package
#most people did this
kf = KFold(len(X_train.index), n_folds=3)
for train_index, test_index in kf:
    pipe.fit(X_train.iloc[train_index], y_train.iloc[train_index])
    print(pipe.score(X_test, y_test))

0.626322170659
0.626991094945
0.622684894853

In [14]:

#I really wanted this, but explained it very poorly in the original question.
#apologies!
scores = cross_val_score(pipe, X_test, y_test, cv=5)
scores

array([ 0.61009174,  0.61923639,  0.6049718 ,  0.62408377,  0.61556208])

14. Modify your pipeline to include a grid search for a variable in the RandomForest model. Try at least 3 values, choose a sensible variable to optimise. (NB this question has changed from the initial version)

In [15]:

estimators = {'rf__n_estimators':list(range(10, 30, 10))}
gs = GridSearchCV(pipe, param_grid=estimators)
gs.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(steps=[('rf', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'rf__n_estimators': [10, 20]}, pre_dispatch='2*n_jobs',
       refit=True, scoring=None, verbose=0)

In [16]:

gs.best_estimator_

Pipeline(steps=[('rf', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

15. What is the score of the model on the training data?

In [21]:

#we could just use our fitted CV here...
model = pipeline.Pipeline(steps=[
                ('rf', RandomForestClassifier(n_estimators = 20))])
model.fit(X_train, y_train)
model.score(X_train, y_train)

0.99396557731168556

16. What is the score of the model on the testing data?

In [22]:

model.score(X_test, y_test)

0.64045319620385466

17. What is the most important variable? Can you explain the model?

In [23]:

importances = model.named_steps['rf'].feature_importances_
max_index, max_value = max(enumerate(importances), key=operator.itemgetter(1))
print('Feature {x} was the most important, with an importance value of {y}'.format(x = dat1.columns[max_index], y = max_value))

Feature ScanCount was the most important, with an importance value of 0.16822170595499625

Thanks for taking the Python Course!

Please save your notebook file as ‘your name - test2.ipynb’, and email it to jeremycgray+pythoncourse@gmail.com by the 2nd of May.

	...	SHOES
0	...	0.0
1	...	1.0
2	...	0.0
3	...	0.0
4	...	0.0

	...	SHOES
0	...	0.0
1	...	1.0
2	...	0.0
3	...	0.0
4	...	0.0

	...	SHOES
0	...	0.0
1	...	1.0
2	...	0.0
3	...	0.0
4	...	0.0