Kaggle data challenge post 1.
This is the first post of an occasional series on R, data science, Kaggle and my quest to make it through the book “The elements of statistical learning”.
For a first post, here is a write up of my current kaggle problem. This knitr is an archived version - it wont hit my best score, and I haven’t done too much data processing other than making predictors. At the time of writing it was enough to hit about the 50th percentile, and was about 0.1 lower than my best score.
Here’s the kaggle challenge “Facebook Recruiting IV: Human or Robot?” - link.
The challenge presents as an auction website, which has been overtaken by bots outbidding real people. The user is to determine which users are bots and which are real. Presumably the owner will then ban the bots.
This code is available as r markdown on github at www.github.com/jeremycg/kaggle
Let’s read in some libraries for analysis:
And now let’s read in the data. Im using fread from the data.table package, as it is a large file.
Lets take a look:
So, we have 2013 users to train on, with a little over 7.6 million bids.
The user data has obfuscated rows for bidder id, payment account and address, as well as the outcome - was it a bot?
The bid data has a bid id, bidder id (to link to the training data), an auction id, mechandise type (this can change per auction depending on how the user found it), device used to bid, time, country, ip, and referring url.
We dont really need to merge the datasets - we can add it on at the end.
Now we want predictors!
First let’s double check that there are no shared addresses or payment details in our training set:
Nope! So we can put this to the side and not use it in any predictors.
I’m going to use plyr on this data, as it is really large - data.table is also nice.
Lets Slowly build up functions to make our table of predictors.
First, how many total bids did each user make? We can assume bots are making a ton.
We group_by bidder_id and then use n() to take the length. n() is a dplyr specific function implemented in rcpp, so is much faster than length or nrow.
Next, we can take the number of bids per auction, on average. Again, I would guess bots are very high.
First, group the data by bidder_id and auction, take the number of bids each user had on each auction. Then we group it again by user, and find the average for each. mean() is again very fast in dplyr.
Next, number of distinct auctions bid on - This will depend on bot coding - if a bot is used to push up prices on specific auctions, maybe it will have a low number, otherwsie it might be high.
First, let’s group the data into users and auctions, then regroup into users and use n().
Now maybe bots use a user agent switcher so they appear to use different devices? All user agents in this are classed as unique phone models, so if a user uses 8000 phones, they are probably a bot.
again, using pylr we group and and use n().
A bot might only be going after one specific type of merchandise - let’s check
After running this predictor, we get no users with more than 1 type of merchandise! Weird.
So, bots most common (or only!) merchandise might be of a particular type - ie home goods might not be worth coding a bot for. Let’s check - it’s easy as we know there is no replication
Now let’s try country number - someone bidding from multiple countries is up to something.
Now find the most common country - some countries might be more susceptible to bots than others.
Now let’s do some time series stuff. If the bots are poorly coded, they will always bid exactly, say, 10 seconds after the last bid. So, if we can calculate the diffs for each, we will get a picture of anyone doing something weird.
This is a little trickier, and very slow as it is harder to vectorise. I’ll use diff and add a 0 at the start. We will only run this once, and then use the data frame for multiple analyses.
Now let’s use that data to get mean time since last bid, and the percentage of bids in which the user was the first, and the percentage of bids in which the user was the first. I’m getting mutliple predictors here as they are quicker when gathered at once. call on timediffbids.
Now depending on how the site works, a bid against yourself might be smart (if there is a reserve), or a way of bidding up the price (bot). This one again is slow. Either way, lets make it a feature. Call this on timediffbids.
Now let’s add in the number of final bids. We can assume bots either are trying to jack up prices but lose, or win lots of auctions. Either way there should be a signal here. Let’s get proportion final bids, and total number. Again, run this on the timediffbids.
Now we have a percentage of mismatched bids - It is possible users are more likely to find something in a category noone else used to find it, whereas bots know exactly what they want. We know from above that noone switches type midstream, which simplifies analysis.
Let’s check nor for using multiple ipaddresses. This is probably highly correlated with number of countries:
As the phones are based on models, it is possible the bot software is only using a small subset - lets get the most common phone type used for each user.
Shared use of the same ip could be an indication of cheating - let’s find a value for number of shared addresses.
We could do some fun graph stuff with shared ip addresses - ie see who is connected and call whole networks scams - I’ll not do that here. This might be the key to getting over 0.9?
For now, we have a ton of predictors, and a few more to come that are easier once they are all made. I’m going to call it all at once to make a data frame with them all in.
This will take a while! I reccomend saving it as a file so you don’t have to run it again. In the raw knitr file, I’ve set to to not evaluate, so change it if you are playing along.
We now have 15 predictors! Let’s read in the file, and make a couple of composite ones.
Now comes the actual machine learning! We can do fun things like scale and turn categories into dummy variables. For now, I’ll just leave it, and hope the fitting package takes care of it.
I’m going to use gradient boosting - as it is fast. Random Forest will probably be better, but it will take forever (about an hour on my laptop).
First, let’s set our fitting parameters. I’m using 10 fold crossvalidation
Now we need to merge and partition our data. First we merge to make sure everything is in a sensible order. We then split out predictors and outcomes.
Now gradient boosting
Random Forest (not run)
Now we can predict!
First read in the test data, and then merge it. We have 70 bidders in the test data, but not in the bids sheet. Let’s give them the average value of our prediction.
Now let’s predict.
This gave me a score of 0.76567 - about 0.1 worse than my best answer! So we need to do some data clean up and more sensible feature choice. It’s a good start for a walkthrough though.