RHIPE: An Interface to Hadoop and R for Large and Complex Data Analysis




Ron Fredericks writes: Dr. Saptarshi Guha created an open-source interface between R and Hadoop called the R and Hadoop Integrated Processing Environment or RHIPE for short. LectureMaker was on the scene filming Saptarshi’s RHIPE presentation to the Bay Area’s useR Group, introduced by Michael E. Driscoll and hosted at Facebook’s Palo Alto office on March 9′th 2010. Special thanks to Jyotsna Paintal for helping me film the event.

Saptarshi received his Ph.D from Purdue University in 2010, having been advised by Dr. William S. Cleveland. Saptarshi works at Revolution Analytics in Palo Alto, as of the last update to this blog post.

Hadoop is an open source implementation of both the MapReduce programming model, and the underlying file system Google developed to support web scale data.

The MapReduce programming model was designed by Google to enable a clean abstraction between large scale data analysis tasks and the underlying systems challenges involved in ensuring reliable large-scale computation. By adhering to the MapReduce model, your data processing job can be easily parallelized and the programmer doesn’t have to think about the system level details of synchronization, concurrency, hardware failure, etc.

Reference: “5 common questions about Hadoop” a cloudera blog post May 2009 – by Christophe Bisciglia

RHIPE allows the R programmer to submit large datasets to Hadoop for a Map, Combine, Shuffle, and Reduce to process analytics at a high speed. See the figure below as an overview of the video’s key points and use cases.


The RHIPE Video

Note: On May 30, 2013, LectureMaker’s new video player version 4.4 was put into use on this page… “This new version works with LectureMaker’s eCommerce, massive streaming storage, and eLearning plugins. In addition, now you can use the full screen feature, and see automated tool tip displays in the timeline” — Ron Fredericks, Co-founder LectureMaker LLC.

Get Adobe Flash player

 

Video Topics and Navigation Table

Navigation DotElapsed TimeDescription of Topics (plus hot links into the video)
220.00%Beginning
211.00%Introduction to Dr. Saptarshi Guha
203.00%Introduction to RHIPE
194.40%Analysis of very large data sets
188.10%Overview of Hadoop
1713.6%High performance computing with existing R packages
1615.6%,High performance computing with RHIPE: MapReduce interface to R
1524.4%Case study: VOIP
1430.3%Case study, step 1: Convert raw data to R dataset
1335.4%Case study, step 2: Feed data to a reducer
1241.6%Case study, step 3: Compute summaries
1145.7%Case study, step 4: Create new objects
1049.3%Case study, step 5: Statistical routines across subsets
952.7%Case study: VOIP summary
854.6%Another example: Dept. of Homeland Security
762.0%RHIPE on EC2: Indiana bio-terrorism project
672.8%Q: What is the discrepancy between sampled data?
578.7%RHIPE on EC2: Simulation timing
480.6%RHIPE Todo list
386.3%RHIPE Lessons learned
295.0%Q: What optimization methods were used?
199.9%Credits

 

Code Examples from the Video

Source code highlighter note: R and RHIPE language constructs are color coded and hot-linked to appropriate online resources. Click on these links to learn more about these programming features. I manage the R/RHIPE source code highlighter project on my engineering site here: R highlighter.

Move Raw Data Into Hadoop File System for Use In R Data Frames

Code (r)
  1. ## Case Study – VoIP
  2. ## Copy text data to HDFS
  3. rhput(/home/sguha/pres/voip/text/20040312-105951-0.iprtp.out‘,’/pres/voip/text)
  4. ## Use RHIPE to convert text data :
  5. ##   1079089238.075950 IP UDP 200 67.17.54.213 6086 67.17.50.213 15074 0
  6. ##   …
  7. ##   to R data frames
  8. input < – expression({
  9. ## create components (direction, id.ip,id.port) from Sys.getenv("mapred.input.file")
  10. v <- lapply(seq_along(map.values),function(r) {
  11. value0 <- strsplit(map.values[[r]]," +")[[1]]
  12. key <- paste(value0[id.ip[1]],value0[id.port[1]],value0[id.ip[2]]
  13.    ,value0[id.port[2]],direction,sep=".")
  14. rhcollect(key,value0[c(1,9)])
  15.  })})

 

Submit a MapReduce Job Then Retrieve semi-calls

Code (r)
  1. ## Case Study - VoIP
  2. ## We can run this from within R:
  3. mr< -rhmr(map=input,reduce=reduce, inout=c('text','map'), ifolder='/pres/voip/text', ofolder='/pres/voip/df',jobname='create'
  4.    ,mapred=list(mapred.reduce.tasks=5))
  5. mr <- rhex(mr)
  6. ## This takes 40 minutes for 70 gigabytes across 8 computers(72 cores).
  7. ##    Saved as 277K data frames(semi-calls) across 14 gigabytes.
  8.  
  9. ## We can retrieve semi-calls:
  10. rhgetkey(list('67.17.50.213.5002.67.17.50.6.5896.out'),paths='/pres/voip/df/p*')
  11. ## a list of lists(key,value pairs)

 

Compute Summaries With MapReduce

Code (r)
  1. ## Case Study - VoIP
  2. lapply(seq_along(map.values),function(i){
  3. ## make key from map.keys[[i]]
  4. value<-if(tmp[11]=="in")
  5. c(in.start=start,in.end=end,in.dur=dur,in.pkt=n.pkt)
  6. c(out.start=start,out.end=end,out.dur=dur,out.pkt=n.pkt)
  7. rhcollect(key,value)
  8. })
  9. })
  10. pre={
  11. mydata<-list()
  12. ifnull <- function(r,def=NA) if(!is.null(r)) r else NA
  13. },reduce={
  14. mydata<-append(mydata,reduce.values)
  15. },post={
  16. mydata<-unlist(mydata)
  17. in.start<-ifnull(mydata['in.start'])
  18. .....
  19. out.end<-ifnull( mydata['out.end'] )
  20. out.start<-ifnull(mydata['out.start'])
  21. value<-c(in.start,in.end,in.dur,in.pkt,out.start,out.end,out.dur,out.pkt)
  22. rhcollect(reduce.key,value)
  23. })

 

Compute Summaries With MapReduce Across HTTP and SSH Connections

Code (r)
  1. ## Example Compute total bytes, total packets across all HTTP and SSH connections.
  2. m < - expression({
  3. w <- lapply(map.values,function(r)
  4. if(any(r[1,c('sport','dport')] %in% c(22,80))) T else F)
  5. lapply(map.values[w],function(v){
  6. key <- if(22 %in% v[1,c('dport','sport')]) 22 else 80
  7. rhcollect(key, c(sum(v[,'datasize']),nrow(v)))
  8. })})
  9.  
  10. r < - expression(
  11. pre <- { sums <- 0 }
  12. reduce <-{
  13. v <- do.call("rbind",reduce.values)
  14. sums <- sums+apply(v,2,sum)
  15. },post={
  16. rhcollect(reduce.key,c(bytes=sums[1],pkts=sums[2]))
  17. })

 

Load RHIPE on an EC2 Cloud

Code (r)
  1. library(Rhipe)
  2. load("ccsim.Rdata")
  3. rhput("/root/ccsim.Rdata","/tmp/")
  4. setup < - expression({
  5. load("ccsim.Rdata")
  6. suppressMessages(library(survstl))
  7. suppressMessages(library(stl2))
  8. })
  9. chunk <- floor(length(simlist)/ 141)
  10. z <- rhlapply(a,cc_sim, setup=setup,N=chunk,shared="/tmp/ccsim.Rdata"
  11.    ,aggr=function(x) do.call("rbind",x),doLoc=TRUE)
  12. rhex(z)

 

References:

“I just watched the Saptarshi Guha video. It looks great!! Thank you! The picture is incredibly crisp, and the timeline tab is a nice touch for reviewing the film. Thank you!” -- Matt Bascom

VMware's open-source partnership with Cloudera offers you a virtual machine with Hadoop, PIG, and HIVE - download

The University of Purdue hosts the documentation and open-source code base for RHIPE - download

This is what the share text above looks like...

RHIPE: An Interface Between Hadoop and R
Presented by Saptarshi Guha

Video Link

Technorati Tags: , , , , , , , , , , , , , , ,

About Ron Fredericks

Ron Fredericks is the Co-founder and new media evangelist at LectureMaker LLC located in Sunnyvale, CA. Ron offers a portable video studio serving the Bay Area's business needs. He is a new media communicator focused on marketing, sales, engineering, and social outreach for high tech companies. He is a software engineer creating leading edge video distribution Internet platforms for business eLearning initiatives.
This entry was posted in Mathematics, R users group, Videos and tagged , , , , , , , , , , , , , , , . Bookmark the permalink.

7 Responses to RHIPE: An Interface to Hadoop and R for Large and Complex Data Analysis

  1. Pingback: Bay Area R User Group 2009 Kickoff Video | LectureMaker, LLC

  2. Pingback: RHIPE: An Interface Between Hadoop and R for Large and Complex Data Analysis | R User Groups

  3. Pingback: Tweets that mention RHIPE: An Interface Between Hadoop and R for Large and Complex Data Analysis | LectureMaker, LLC -- Topsy.com

  4. Pingback: A video from the Creator of Rhipe « Baoqiang Cao's blog

  5. Pingback: News | LectureMaker, LLC

  6. Celina says:

    It’s always a relief when someone with obvious expertise ansrwes. Thanks!

  7. Pingback: RHIPE: An Interface to Hadoop and R for Large a...

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

(Spamcheck Enabled)