Because its fun to map stuff…

Fri 30 March 2012

Its been quite a while since my last post, and its Friday and I was feeling creative, so I decided to map something! I’ve been looking for an excuse to produce a nice graphic like the one Anita Graser created to represent Vienna’s green-spaces. She used Quantum GIS to produce a hexagonal grid for representing the density of Viennese trees instead of the standard heat map or kernel density map, and the results are quite nice! I’m a huge fan of QGIS, but I tend to do most of my work in R, so I decided to see if I could produce something similar using R. Turns out you can, and the final results are displayed below (read on to see the full work-flow). Instead of trees, I went ahead and mapped the locations of unique visitors to http://www.carsonfarmer.com/ between 2009 and 2011:

Visitors map

Work-flow

Firstly, I downloaded the logs for www.carsonfarmer.com. I did this directly from the console, though I’m pretty sure I could have done this from R as well. Next, I needed to extract the unique IP addresses from the logs. I found this nice grep one-liner from here, which I modified to grab all unique IP addresses that ‘GET’ something from the site:

grep 'GET' access.log | cut -d' ' -f1 | sort | uniq > ip_addresses.log

To actually map the IP addresses, I obviously needed some way to convert the raw IP addresses to latitude and longitude coordinates. Enter the very nice Data Science Toolkit (DSTK) from Pete Warden and the very handy RDSTK R package from Ryan Elmore! Basically, the DSTK has an API that can be queried for all sorts of information useful for ‘data science’ applications, and the RDSTK makes it possible to query to API directly from within R. I first heard about both these projects from the Revolution Analytics blog, where there is an article summarising Ryan Elmore’s work on RDSTK, and a few other handy links. RDSTK isn’t (yet?) available on CRAN, so I downloaded it directly from github:

wget https://github.com/rtelmore/RDSTK/raw/master/src/RDSTK_1.0.tar.gz

Then I installed it via R CMD INSTALL (note that it requires other R packages: RCurl, rjson, and plyr):

R CMD INSTALL RDSTK_1.0.tar.gz

Once I had all that stuff installed and ready to go, I actually started up an R session and got working:

addresses = read.table('ip_addresses.log', col.names='address')
library(RDSTK)
? ip2coordinates

The ip2coordinates function requires multiple IPs to be contained within a single string with comma-separated IP addresses, but we can only do a few IPs at a time (about 100 I think?) so I had to do this part in a loop (it probably isn’t polite to slam DSTK with 1,000s of requests, so be nice!).

ips = paste(as.character(addresses$address[1:80]), collapse=', ')
out = ip2coordinates(ips)
last = 80
s = c(seq(160, nrow(addresses), 80), nrow(addresses))
for (i in s) {
    ips = paste(as.character(addresses$address[(last+1):i]), collapse=', ')
    out = rbind(out, ip2coordinates(ips))
    last = i
}

Once that is done running, the next step(s) are to a) convert the returned data.frame to a SpatialPointsDataFrame, b) create a SpatialGrid based on the points, c) create a SpatialPolygons object from a hexagonal sample of the grid, and then finally d) create a SpatialPolygonsDataFrame for plotting:

library(sp)
# make the output into a Spatial* object
coordinates(out) = ~longitude+latitude
library(maptools) # need this for the following function
sg = Sobj_SpatialGrid(out, maxDim=200, n=NULL)$SG
hex_pts = spsample(sg, type='hexagonal', cellsize=sg@grid@cellsize)
hex_poly = HexPoints2SpatialPolygons(hex_pts)
pts_poly = over(hex_poly, out, returnList=TRUE)
pts_poly_count = sapply(pts_poly, function(x) nrow(x))
poly = SpatialPolygonsDataFrame(poly, data.frame('count'=pts_poly_count), 
    match.ID=FALSE)

Ok, now for some plotting!

# pick some reasonable break points
breaks = c(1.0, 10.0, 20.0, 50.0, 100.0, 500.0, 2000.0)
# use RColorBrewer to get a nice blue palette
library(RColorBrewer)
# don't use the lightest colour, it looks washed out
cols = brewer.pal(7,"Blues")[-1] 
# plot the grid, which produces something close to our final product
spplot(poly[poly$count>0,], col='white', col.regions=cols, at=breaks, 
       par.settings=list(axis.line=list(col='transparent')))

I then used Inkscape to tweak the final product, adding titles and labels and modifying the colour key to look like something a bit more suited to the map at hand. In the end I had a nice map of my blog readers, and an excellent way to procrastinate on a sunny Friday afternoon!

Helpful Tip

Data Science Internet Mapping R FOSS How-To

twitter

recent visitors