Test 1 answers - r&d
Test 1 - R&D stream
The data science department at Walmart often posts problems on the data science challenge website, kaggle. In this test, we will download some of their data, and use our current understanding of Python to do some exploratory data analysis.
The challenge this data comes from was to revise the method they use to classify trip types. The have a number of categories which shopping trips are clustered into, and challenged entrants to recapitulate their clustering. The challenge is archived here. The training data set contains the TripType, as well as VisitNumber, day of the week, Upc of product purchased, the scancount, department, and fineline number (a categorical description of the item). Each VisitNumber is a unique basket, with a line for each item scanned. I am providing you with the first 50,000 lines of data, from the 650,000 total.
The Desired clustering is the categorical variable TripType.
You can complete this course without any knowledge of Pandas and Numpy - I am loading the data in like this, as it is the easiest way (by far). Please leave the code blocks in the Data Import section untouched - run them as needed. Feel free to download the csv from the website and check it out, but use Python for the analysis!
Data Import
Here I’m loading the data from the course website, showing the first 5 lines, and putting it into a dictionary, where each VisitNumber has its own entry.
The dict is called groups.
Please run the below (cursor in cell, then ctrl-enter
, or click run cell.
In [1]:
In [2]:
In [3]:
TripType | VisitNumber | Weekday | Upc | ScanCount | DepartmentDescription | FinelineNumber | |
---|---|---|---|---|---|---|---|
0 | 999 | 5 | Friday | 68113152929 | -1 | FINANCIAL SERVICES | 1000 |
1 | 30 | 7 | Friday | 60538815980 | 1 | SHOES | 8931 |
2 | 30 | 7 | Friday | 7410811099 | 1 | PERSONAL CARE | 4504 |
3 | 26 | 8 | Friday | 2238403510 | 2 | PAINT AND ACCESSORIES | 3565 |
4 | 26 | 8 | Friday | 2006613744 | 2 | PAINT AND ACCESSORIES | 1017 |
In [4]:
Data Checking
Please run the below cell, you should get:
[[999, 5, 'Friday', 68113152929.0, -1, 'FINANCIAL SERVICES', 1000.0]]
[[30, 7, 'Friday', 60538815980.0, 1, 'SHOES', 8931.0], [30, 7, 'Friday',
7410811099.0, 1, 'PERSONAL CARE', 4504.0]]
If not, please redownload the notebook from the website.
In [5]:
[[999, 5, 'Friday', 68113152929.0, -1, 'FINANCIAL SERVICES', 1000.0]]
[[30, 7, 'Friday', 60538815980.0, 1, 'SHOES', 8931.0], [30, 7, 'Friday', 7410811099.0, 1, 'PERSONAL CARE', 4504.0]]
Test
Please fill the below questions in the cells and run them for output.
If you’d like another cell, use alt-enter
or the insert menu.
If you’d like to enter text to explain, either use # for comments, or add a new cell, then use the dropdown box above to convert it to markdown (from code).
Some data is missing or non-numeric, remember to check and remove or fix these data points!
Don’t worry about printing the outputs - assign them to a variable so I check them
1. Create a new dict, which contains the same keys, but a list of unique DepartmentDescription of items for each visit: ie {7:[‘SHOES’, ‘PERSONAL CARE’]}
In [6]:
2. If you used a function to do this, use a comprehension, if you used a comprehension, use a function
In [7]:
In [8]:
True
In [9]:
3. Create a new dict, with the total number of each category each customer
bought. It should look like {7:[['SHOES',1], ['PERSONAL CARE',1]], ....}
In [10]:
4. Create a new dict, which contains each customer as a key, with a list of day shopped, TripType, and summed ScanCount (total items bought).
In [11]:
5. Create a Visit Class, which contains the total data we have about each visit, with the minimum amount of repetition.
In [12]:
6. Add methods into the class which will describe total items scanned, most common DepartmentDescription and most common FinelineNumber
In [13]:
7. Turn the current dict into a dict of {VisitNumber : Vist} using the new class.
In [14]:
8. Create a method in your class to calculate the similarity of a visit to another visit based on DepartmentDescription (you could use total number of shared categories, or Cosine Similarity)
In [15]:
9. Compare every basket to every other basket using your similarity score - Which baskets are the two most similar? How did you score this?
In [19]:
In [23]:
[[5564, 7008], 23.421313765753897]
10. Optional, open ended question - Can you get an idea how the TripType categories are determined? Hint, tally TripTypes against categories, fineline categories, day of the week, number of items.
Include your code, and a quick description. They used a machine learning algorithm, so don’t worry about complete accuracy, a qualatative explanation is perfect.
In [None]: