This is the first post of an occasional series on R, data science, Kaggle and my quest to make it through the book “The elements of statistical learning”.

For a first post, here is a write up of my current kaggle problem. This knitr is an archived version - it wont hit my best score, and I haven’t done too much data processing other than making predictors. At the time of writing it was enough to hit about the 50th percentile, and was about 0.1 lower than my best score.

Here’s the kaggle challenge “Facebook Recruiting IV: Human or Robot?” - link.

The challenge presents as an auction website, which has been overtaken by bots outbidding real people. The user is to determine which users are bots and which are real. Presumably the owner will then ban the bots.

This code is available as r markdown on github at www.github.com/jeremycg/kaggle

Let’s read in some libraries for analysis:

library(data.table)
library(plyr)
library(dplyr)
library(caret)
library(knitr)
library(bit64)

And now let’s read in the data. Im using fread from the data.table package, as it is a large file.

setwd("C:/Users/jeremy/Desktop/kaggle/kaggle/fbbid/")
train<-fread("train.csv")
bids<-fread("bids.csv")
##
Read 0.0% of 7656334 rows
Read 8.6% of 7656334 rows
Read 17.0% of 7656334 rows
Read 25.5% of 7656334 rows
Read 34.2% of 7656334 rows
Read 43.0% of 7656334 rows
Read 51.5% of 7656334 rows
Read 60.0% of 7656334 rows
Read 68.3% of 7656334 rows
Read 76.3% of 7656334 rows
Read 84.2% of 7656334 rows
Read 92.5% of 7656334 rows
Read 7656334 rows and 9 (of 9) columns from 0.862 GB file in 00:00:19

Lets take a look:

  kable(head(train))
  
bidder_id payment_account address outcome
91a3c57b13234af24875c56fb7e2b2f4rb56a a3d2de7675556553a5f08e4c88d2c228754av a3d2de7675556553a5f08e4c88d2c228vt0u4 0
624f258b49e77713fc34034560f93fb3hu3jo a3d2de7675556553a5f08e4c88d2c228v1sga ae87054e5a97a8f840a3991d12611fdcrfbq3 0
1c5f4fc669099bfbfac515cd26997bd12ruaj a3d2de7675556553a5f08e4c88d2c2280cybl 92520288b50f03907041887884ba49c0cl0pd 0
4bee9aba2abda51bf43d639013d6efe12iycd 51d80e233f7b6a7dfdee484a3c120f3b2ita8 4cb9717c8ad7e88a9a284989dd79b98dbevyi 0
4ab12bc61c82ddd9c2d65e60555808acqgos1 a3d2de7675556553a5f08e4c88d2c22857ddh 2a96c3ce94b3be921e0296097b88b56a7x1ji 0
7eaefc97fbf6af12e930528151f86eb91bafh a3d2de7675556553a5f08e4c88d2c228yory1 5a1d8f28bc31aa6d72bef2d8fbf48b967hra3 0
  kable(head(bids))
  
bid_id bidder_id auction merchandise device time country ip url
0 8dac2b259fd1c6d1120e519fb1ac14fbqvax8 ewmzr jewelry phone0 9759243157894736 us 69.166.231.58 vasstdc27m7nks3
1 668d393e858e8126275433046bbd35c6tywop aeqok furniture phone1 9759243157894736 in 50.201.125.84 jmqlhflrzwuay9c
2 aa5f360084278b35d746fa6af3a7a1a5ra3xe wa00e home goods phone2 9759243157894736 py 112.54.208.157 vasstdc27m7nks3
3 3939ac3ef7d472a59a9c5f893dd3e39fh9ofi jefix jewelry phone4 9759243157894736 in 18.99.175.133 vasstdc27m7nks3
4 8393c48eaf4b8fa96886edc7cf27b372dsibi jefix jewelry phone5 9759243157894736 in 145.138.5.37 vasstdc27m7nks3
5 e8291466de91b0eb4e1515143c7f74dexy2yr 3vi4t mobile phone7 9759243157894736 ru 91.107.221.27 vasstdc27m7nks3
  nrow(train)
  
  ## [1] 2013
  
  nrow(bids)
  
  ## [1] 7656334
  

So, we have 2013 users to train on, with a little over 7.6 million bids.

The user data has obfuscated rows for bidder id, payment account and address, as well as the outcome - was it a bot?

The bid data has a bid id, bidder id (to link to the training data), an auction id, mechandise type (this can change per auction depending on how the user found it), device used to bid, time, country, ip, and referring url.

We dont really need to merge the datasets - we can add it on at the end.

Now we want predictors!

First let’s double check that there are no shared addresses or payment details in our training set:

anyDuplicated(train$payment_account)
## [1] 0
anyDuplicated(train$address)
## [1] 0

Nope! So we can put this to the side and not use it in any predictors.

I’m going to use plyr on this data, as it is really large - data.table is also nice.

Lets Slowly build up functions to make our table of predictors.

First, how many total bids did each user make? We can assume bots are making a ton.

We group_by bidder_id and then use n() to take the length. n() is a dplyr specific function implemented in rcpp, so is much faster than length or nrow.

totalbids<-function(dataframe){
  dataframe%>%group_by(bidder_id)%>%summarise(totalbids=n())
}

Next, we can take the number of bids per auction, on average. Again, I would guess bots are very high.

First, group the data by bidder_id and auction, take the number of bids each user had on each auction. Then we group it again by user, and find the average for each. mean() is again very fast in dplyr.

meanbids<-function(dataframe){
  dataframe%>%group_by(bidder_id,auction)%>%summarise(numberbids=n())%>%
    group_by(bidder_id)%>%summarise(meanbids=mean(numberbids))
}

Next, number of distinct auctions bid on - This will depend on bot coding - if a bot is used to push up prices on specific auctions, maybe it will have a low number, otherwsie it might be high.

First, let’s group the data into users and auctions, then regroup into users and use n().

numauctions<-function(dataframe){
dataframe%>%group_by(bidder_id,auction)%>%summarise(numberbids=n())%>%
group_by(bidder_id)%>%summarise(numauctions=n())
}

Now maybe bots use a user agent switcher so they appear to use different devices? All user agents in this are classed as unique phone models, so if a user uses 8000 phones, they are probably a bot.

again, using pylr we group and and use n().

numdevices<-function(dataframe){
dataframe%>%group_by(bidder_id,device)%>%summarise(timesused=n())%>%
group_by(bidder_id)%>%summarise(numdevices=n())
}

A bot might only be going after one specific type of merchandise - let’s check

numtypes<-function(dataframe){
  dataframe%>%group_by(bidder_id,merchandise)%>%summarise(timeseachtype=n())%>%
    group_by(bidder_id)%>%summarise(numtypes=n())
}

After running this predictor, we get no users with more than 1 type of merchandise! Weird.

So, bots most common (or only!) merchandise might be of a particular type - ie home goods might not be worth coding a bot for. Let’s check - it’s easy as we know there is no replication

commonmerch<-function(dataframe){
  dataframe%>%group_by(bidder_id)%>%summarise(commonmerch=merchandise[1])
}

Now let’s try country number - someone bidding from multiple countries is up to something.

numcountry<-function(dataframe){
dataframe%>%group_by(bidder_id,country)%>%summarise(n())%>%
group_by(bidder_id)%>%summarise(numcountry=n())
}

Now find the most common country - some countries might be more susceptible to bots than others.

commoncountry<-function(dataframe){
dataframe%>%group_by(bidder_id,country)%>%summarise(number=n())%>%
group_by(bidder_id)%>%summarise(commoncountry=country[which.max(number)])
}

Now let’s do some time series stuff. If the bots are poorly coded, they will always bid exactly, say, 10 seconds after the last bid. So, if we can calculate the diffs for each, we will get a picture of anyone doing something weird.

This is a little trickier, and very slow as it is harder to vectorise. I’ll use diff and add a 0 at the start. We will only run this once, and then use the data frame for multiple analyses.

differences<-function(df){
df<-df[order(df$time), ]
df$time<-c(0,diff(df$time))
df
}
timediffbids<-bids%>%group_by(auction)%>%do(.,differences(.))

Now let’s use that data to get mean time since last bid, and the percentage of bids in which the user was the first, and the percentage of bids in which the user was the first. I’m getting mutliple predictors here as they are quicker when gathered at once. call on timediffbids.

bidtimes<-function(dataframe){
dataframe%>%group_by(bidder_id)%>%
summarise(averagetimetobid=mean(time),percentfirst=sum(.[["time"]]==0)/n())
}

Now depending on how the site works, a bid against yourself might be smart (if there is a reserve), or a way of bidding up the price (bot). This one again is slow. Either way, lets make it a feature. Call this on timediffbids.

bidself<-function(df){
df$self<-0
if(nrow(df)==1){
return(df)}
df$prevbidder<-c("none",df$bidder_id[-nrow(df)])
df$self[df$bidder_id==df$prevbidder]<-1
df$prevbidder<-NULL
df
}

bidagainstself<-function(dataframe){
dataframe%>%group_by(auction)%>%do(.,bidself(.))%>%
group_by(bidder_id)%>%summarise(percentself=sum(.[["self"]]==1)/n())
}

Now let’s add in the number of final bids. We can assume bots either are trying to jack up prices but lose, or win lots of auctions. Either way there should be a signal here. Let’s get proportion final bids, and total number. Again, run this on the timediffbids.

numfinal<-function(dataframe){
dataframe%>%group_by(auction)%>%mutate(last = c(rep(0,n()-1), 1))%>%
group_by(bidder_id)%>%summarise(percentfinal=sum(last),percentfinal=mean(last))
}

Now we have a percentage of mismatched bids - It is possible users are more likely to find something in a category noone else used to find it, whereas bots know exactly what they want. We know from above that noone switches type midstream, which simplifies analysis.

#pretty brutal function!

percentmismatch<-function(dataframe){
mostcommoninauction<-dataframe%>%group_by(auction,merchandise)%>%
summarise(number=n())%>%group_by(auction)%>%
summarise(mostcommoninauction=merchandise[which.max(number)])
merge(mostcommoninauction,dataframe,by="auction")%>%group_by(bidder_id,auction)%>%
summarise(mismatch=sum(mostcommoninauction[1]==merchandise[1]))%>%group_by(bidder_id)%>%
summarise(percentmismatch=mean(mismatch))
}

Let’s check nor for using multiple ipaddresses. This is probably highly correlated with number of countries:

  numipaddress<-function(dataframe){
  dataframe%>%group_by(bidder_id,ip)%>%summarise(n())%>%
    group_by(bidder_id)%>%summarise(numipaddress=n())
  }
  

As the phones are based on models, it is possible the bot software is only using a small subset - lets get the most common phone type used for each user.

commonphone<-function(dataframe){
  dataframe%>%group_by(bidder_id,device)%>%summarise(number=n())%>%
    group_by(bidder_id)%>%summarise(commonphone=device[which.max(number)])
}

Shared use of the same ip could be an indication of cheating - let’s find a value for number of shared addresses.

percentsharedip<-function(dataframe){
shareips<-dataframe%>%group_by(ip,bidder_id)%>%summarise(usesofaddress=n())%>%
group_by(ip)%>%summarise(totalbidders=sum(n()>1))
merge(shareips,dataframe,by="ip")%>%group_by(bidder_id)%>%
summarise(percentsharedip=mean(totalbidders))
}

We could do some fun graph stuff with shared ip addresses - ie see who is connected and call whole networks scams - I’ll not do that here. This might be the key to getting over 0.9?

For now, we have a ton of predictors, and a few more to come that are easier once they are all made. I’m going to call it all at once to make a data frame with them all in.

This will take a while! I reccomend saving it as a file so you don’t have to run it again. In the raw knitr file, I’ve set to to not evaluate, so change it if you are playing along.

fulldata<-join_all(list(totalbids(bids),meanbids(bids),numauctions(bids),
numdevices(bids),commonmerch(bids),numcountry(bids),
commoncountry(bids),bidtimes(timediffbids),
bidagainstself(timediffbids),numfinal(timediffbids),
percentmismatch(bids),numipaddress(bids),
commonphone(bids),percentsharedip(bids)),
by='bidder_id',type='full')
write.csv(fulldata,file="fulldata.csv",row.names=FALSE)

We now have 15 predictors! Let’s read in the file, and make a couple of composite ones.

fulldata<-read.csv("fulldata.csv")
fulldata$ippercountry<-fulldata$numipaddress/fulldata$numcountry
fulldata$wonauctions<-fulldata$percentfinal/fulldata$numauctions

Now comes the actual machine learning! We can do fun things like scale and turn categories into dummy variables. For now, I’ll just leave it, and hope the fitting package takes care of it.

I’m going to use gradient boosting - as it is fast. Random Forest will probably be better, but it will take forever (about an hour on my laptop).

First, let’s set our fitting parameters. I’m using 10 fold crossvalidation

fitControl <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 10,
  savePred=T)

Now we need to merge and partition our data. First we merge to make sure everything is in a sensible order. We then split out predictors and outcomes.

mergeddata<-merge(fulldata,train,by="bidder_id")
mergeddata<-as.data.frame(mergeddata)
predictors<-mergeddata[,-c(1,(ncol(mergeddata)-2):(ncol(mergeddata)))]
outcomes<-mergeddata[,ncol(mergeddata)]

Now gradient boosting

gbmFit1 <- train(x=predictors,y=outcomes,
                 method = "gbm",
                 trControl = fitControl,
                 verbose = FALSE)
print(gbmFit1)

Random Forest (not run)

#rffit1 <- train(fulldata[,4]~.,data=fulldata[,5:18],method="rf",trControl = #fitControl,verbose=FALSE)

#print(rffit1)

Now we can predict!

First read in the test data, and then merge it. We have 70 bidders in the test data, but not in the bids sheet. Let’s give them the average value of our prediction.

test<-fread("test.csv")
mergeddata<-merge(fulldata,test,by="bidder_id")
mergeddata<-as.data.frame(mergeddata)
predictors<-mergeddata[,-c(1,(ncol(mergeddata)-1):(ncol(mergeddata)))]

Now let’s predict.

x=cbind(as.data.frame(mergeddata$bidder_id),predict(gbmFit1,predictors))
x$`predict(gbmFit1, predictors)`[x$`predict(gbmFit1, predictors)`<0]<-0
togiveempty<-mean(x$`predict(gbmFit1, predictors)`)
empty<-cbind(as.data.frame(
  test[!(test$bidder_id %in% fulldata$bidder_id)]$bidder_id),togiveempty)
names(empty)<-names(x)
x=rbind(x,empty)
write.csv(x,"output.csv")

This gave me a score of 0.76567 - about 0.1 worse than my best answer! So we need to do some data clean up and more sensible feature choice. It’s a good start for a walkthrough though.