Wed 09 November 2011

Well its been a long time since my last post, but I do have a relatively good reason: I was finishing up my PhD thesis. The good news is that I’m now done and graduated! I’m hoping I’ll have a bit more time to blog and continue working on side-projects that I had to put on-hold while finishing up. My plan for the next few months is to finish up here in Maynooth, (unofficially) start some post-doc work, and finish/get going on several papers on my PhD research. I’m also going to try to learn Bayesian statistics, fiddle about with some visualizations I’ve been working on, and start getting back into QGIS and Python development again

In the mean time, I’ve put together a fun little visualization of my PhD thesis in the form of a word-cloud.


This is actually a pretty rough version, and I suspect there are a few issues with hyphenated words and things like that; but it does give a pretty good impression of what my thesis is all about, so I’ll leave it at that for now. For those who might be interested (and for my own reference), the R code to generate this figure is here (requires the wordcloud and tm packages):

# read in all the lines as a character vector
lines <- readLines('modified.txt')
library(tm) # text mining package
# create a corpus object
corpus <- Corpus(DataframeSource(data.frame(lines)))
# now start processing the text and removing punctuation etc
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, function(x) removeWords(x, stopwords("english")))
# create a term document matrix (don't really know what that is...)
tdm <- TermDocumentMatrix(corpus)
# convert to matrix
m <- as.matrix(tdm)
# count up re-occuring words
v <- sort(rowSums(m), decreasing=TRUE)
# create dataframe for word cloud
d <- data.frame(word = names(v), freq=v)
png("wordcloud.png", width=1280, height=800)
wordcloud(d$word,d$freq,c(8,.3),2,100,TRUE,.15, vfont=c("sans serif","plain"))

I actually got this snippet from One R Tip A Day via R-bloggers.


