Recently I needed to install and configure a local version of rHadoop.
I ended up using the Revolution Analytics rHadoop packages, running inside RStudio server on a CentOS based Cloudera quickstart VM.
The exact instructions didn’t quite work for me (java problems), so here is what I did.
Set up VirtualBox and the Cloudera image
First download and install VirtualBox to run your VM.
Next, get a copy of the Cloudera quickstart VM. It’s about 4GB and needs unzipping and copying to run, so plan accordingly.
Once you have virtualbox installed, import the image, using File>Import Appliance.
Default settings are ok, but you will want to change network setting so you can access RStudio server from your native OS. My network is easy - on the settings panel, simply choose network and set attached to: bridged.
Note: Be careful of network security - if you aren’t careful your VM will be accessible to the world.
Now boot the VM and wait for it to start up. It will take a while at the boot screen.
Get and install R and dependencies
Once cloudera boots, open the console, and install R and R-devel (my version of cloudera came with EPEL repos enabled, see here for instruction if not)
Now we need to install various packages in R that the hadoop packages depend on
Then inside R, install the packages and then quit
Get and install rHadoop packages
Now we want to download the rHadoop packages. For now, I only want to do some map reducing, so only need rmr2 and rhdfs.
The packages aren’t on CRAN, but are available on GitHub
Note: You don’t want the windows version if you are using cloudera.
Before installing, we must set up some environment variables to allow R to see Hadoop.
Add the lines:
to both ~/.bashrc and /etc/profile (eg sudo nano /etc/profile)
Now we can install rmr2 and rhdfs in R
Again, open R then install
If you get an error, install the package it asks for - I might have missed one or two above.
Install RStudio server.
You can follow the instructions given on the Rstudio site:
RStudio server runs as a service, so it should be started immediately. You can get your ip from:
Then access from your host OS at http://givenip:8787
Initial Hadoop setup
Now we can use Hadoop. In R, type the following to set where your hadoop cmd, streaming and java home are, as well as load the libraries and init the hdfs:
There will be a couple of warnings, but hopefully no errors.
These will need to be loaded every time you want to use hadoop, it may be worth making an RProfile with them.
If you get the error which contains “Unsupported major.minor version 51.0” upon running hdfs.init() your java is set up incorrectly - make sure the java home you have defined above is version 1.7.0 or greater.
Hadoop hello world
Using the mtcars dataset, we can see what Hadoop can do:
Which will give some technical info about your hadoop run:
And some data:
A mean for each variable, grouped by cylinder.
We can also do the canonical hello world for hadoop: a word count. This time on a text file of moby dick:
You should get something like this (we should strip punctuation if we want better results):
Is this better than something you could run in dplyr in two seconds? For now, no. The beauty of rhadoop is that using mapreduce we can keep large data out of memory and run across servers on extremely large files. You should now have a system up and running where you can try out mapreduce jobs, and start thinking about analysis of big data.