Lookups in R: The Wrong Way and the Right Way
I recently wrote a script that takes DBF exports of Cube matrices and prepares them for Biogeme. Â The main… well, only reason I did this in R was because I was considering using mlogit for model estimation. Â I ultimately decided to ‘go with what I know’ and changed course to use Biogeme. Mind you, the part of Stairway to Heaven applies: “There are two paths you can go by, but in the long run /Â There’s still time to change the road you’re on.”
The Wrong Way
I’ve changed my code already, so pardon that this is from memory. Â Also, these are snippets – I have a lot more code than this.
HSkimPk<-read.dbf("data/HSKIM_PK3.dbf") for(rn in 1:nrow(TripsAll)){ HSkimPkRow<-subset(HSkimPk,I==TripsAll[rn,"PTAZ"] & J==TripsAll[rn,"ATAZ") TripsAll$DA.IVT<-HSkimPkRow[,"V1"] ... }
This took no less than 17 hours to complete for around 23,000 trip records and for values from 5 different tables * 2 time periods.
The Right Way
I (obviously) wanted something that wouldn't take forever, especially as I was working in Biogeme and seeing things that made me think that I wanted to change ONE LITTLE THING. Â This seems to always happen.
I took a different approach that by my calculations should be much quicker.
HSkimPk<-read.dbf("data/HSKIM_PK3.dbf") HSkimPkD<-acast(HSkimPk,I ~ J,value.var="V2",drop=FALSE,fill=0) HSkimPkT<-acast(HSkimPk,I ~ J,value.var="V1",drop=FALSE,fill=0) for(rn in 1:nrow(TripsAll)){ if(I<=nrow(HSkimPkT) & J<=nrow(HSkimPkT)){ TripsAll[rn,"DA.IVT"]<-HSkimPkT[I,J] } }
Since this is currently running, my only metrics are to look at the time per 50 rows (in my real code, I have a line that outputs a timestamp every 50 rows), and it is taking about 0.27 seconds per record, compared to somewhere around 4.5 seconds per record. Â While not perfect, I'll take an estimated completion of 1.75 hours compared to 17 (update: 2 hours). Â However, I will say that Cube is faster in this regard and that I may not have the fastest R solution.