# Parallel bootstrapping with R

Wed 21 April 2010

In a recent post, I mentioned that I was testing the stability of clusters generated from a modified network partitioning algorithm using bootstrap resampling techniques. I also mentioned that I was doing this in R, using the very nice foreach package published by REvolution Computing. To show just how nice this package is, below is a minimal example of bootstrapping a network partitioning algorithm which takes advantage of a multicore processor:

```library(doMC)
library(foreach)
library(igraph)
# Jaccard coeficcient function (taken from package fpc)
clujaccard = function (c1, c2, zerobyzero = NA) {
if (sum(c1) + sum(c2) - sum(c1 & c2) == 0)
out = zerobyzero
else
out = sum(c1 & c2)/(sum(c1) + sum(c2) - sum(c1 & c2))
return(out)
}
registerDoMC() # registers the parallel backend
B = 1000 # number of bootstrap replicates to create
fg = fastgreedy.community(g) # compute original clustering
mm = which.max(fg\$modularity) # find level of max modularity
moc = community.to.membership(g, fg\$merges, mm)\$membership # get membership
noc = length(unique(moc)) # count the number original clusters
bg = g # make a copy of g for bootstrapping
clusters = foreach(i=seq(B), .combine=cbind) %dopar% {
E(bg)\$weight = sample(E(g)\$weight, replace=TRUE) # resample the edge weights
fg = fastgreedy.community(bg) # compute bootstrap clustering
mm = which.max(fg\$modularity) # find level of max modularity
mbc = community.to.membership(bg, fg\$merges, mm)\$membership # get membership
nbc = length(unique(mbc)) # count the number new clusters
bootresult = c()
for (j in seq(0, noc-1)) { # for each of the original clusters...
maxgamma = 0
if (nbc > 0) {
for (k in seq(0, nbc-1)) { # for each of the new clusters...
bv = as.vector(mbc == k)
ov = as.vector(moc == j)
jc = clujaccard(ov, bv, zerobyzero=0)
if (jc > maxgamma) # if these two clusters are most similar...
maxgamma = jc
}
}
bootresult = c(bootresult, maxgamma) # combine results
}
return bootresult # return the results of this iteration (and cbind with the rest)
}
bootmean = apply(clusters, 1, mean) # mean Jaccard coefficient for each cluster
```

The above example might not produce great results, as it simply resamples (with replacement) the weights of all the network edges, and therefore a more sophisticated resampling regime might be warranted. Having said that, it’s quite a useful example, and as you can see, the only ‘extra’ bits required to make this run on multiple cores is the `registerDoMC()` command which simply registers the parallel backend (uses the multicore package) and the `foreach ... %dopar%` which tells `R` to run the loops in parallel. I ran a similar analysis using a different community detection algorithm on a computer with 4 cores, and was (finally) able to take full advantage of my processing power:  