Data Mining the National Science Foundation’s Scholarship Program

Fall is in the air.  Undergraduates have returned to their roost.  All this reminds me of the annual trials and tribulations of the scholarship cycles.  One of the particularly memorable ones was the National Science Foundation’s (NSF) Graduate Research Fellowship.  I applied, but ended up on a different NSF grant for my work at NYU.  I remember the application process vividly.

All of this was on the back of my mind while I was exploring this week.  I happened to stumble upon lists of NSF Fellows for the previous 13 years.  I added a couple of more fields that had not been explicitly included and compiled these lists into a single data set.  I’ve made it publicly available as a Google Fusion Table for anyone that would like to use it for further analysis.

I’ve done a few simple analyses of the dataset (below the fold) and included my interpretations for any would be fellowship applicants.

Continue reading

Wolfram Alpha and Social Networks

Stephen Wolfram posted recently on an update to Wolfram Alpha.  You can now type in ‘facebook report’ at the Wolfram Alpha prompt.  The new command displays a whole host of interesting data mining from the information you’ve placed on Facebook over the years.

My favorite tidbit?  The clustered network graph of my friends.  Nothing shocking here, but it’s a nice visualization of the breakdown of my various social circles over the years (College, dancing, science,…).

If you like this you might be interested in

Social Networks of Shakespearean Plays – part 2

As a follow on to my previous work about social networks in Shakespeare, I wanted to see how the social network changes throughout the course of the play.  Using the same techniques as last time we can look at the social network structure by act.

The density of the social network underlies the style of the storytelling.  Movie Galaxies (the site that originally inspired this work) has found some striking differences between different directors by looking at this network feature.

Continue reading

Social Networks in Shakespeare

Privacy settings are very important when getting them wrong results in a duel to the death.

When I think of social networks, I immediately start to think of Facebook, LinkedIn and the rest of their ilk.  These tend to dominate the landscape of our thoughts on social networks simply because they’re the biggest.  But social networks actually pervade our entire life.  A few days, I started thinking about social networks as they pertain to our entertainment.

Movie Galaxies, a site that’s looking at social networks within film started me thinking on this. The basic idea that they had is to look at the script of a film and process the interactions between different characters as links in a social network.  So far they’ve found that statistics describing the structure of this network vary across different genres and the narrative styles.

I wanted to play with these ideas, so started fooling around with social networks in Shakespeare.  I found out that you can actually tell a good bit about the nature of each narrative, or the genres of narratives, by only looking at the social networks in the plays.

Continue reading

A/S/L/(Neil Gaiman fan?) – Fusion tables and OKCupid

I’ve been thinking about data visualization tools lately.  In particular, I got some advice to checkout Google’s Fusion Tables.  I needed some data to start playing around with, but, luckily, I happened to have 29,035 OKCupid profiles laying around in a database (learn how I got them here).


First question, what does the age distribution on OKCupid look like?

One thing that jumps out at me is the shape of these various curves.  Data will often go “Nudge, nudge, wink, wink, I might be a gamma distribution.”  In response to this nudge and wink we can start to think about what might cause this shape in the data to appear.  There is a wealth of theory on the gamma distribution as it comes up frequently in all sorts of branches of science.

One place it comes up particularly often is in models of wait times.  Waiting for death, waiting for your next car accident, waiting for the devil to pick up the phone (although waiting for the devil is actually a Poisson distribution, I’m not making this up BTW).  What we see here, is the distribution of waiting times until a *terminal* relationship (the one which removes you from the dating pool, with or without appropriate ‘Fatal Attraction’ references).

Caveat for the more mathematically inclined: yes, I know that a more complicated model of entering and exiting relationships in which we track much more detailed information would probably reveal that the distribution deviates from the gamma distribution in some way, but to a first order approximation, this is a good explanation of the data.


Out of the 29,035 profiles, I had 18,429 males and 10,606 females.  That gives an average ratio of male to female of about 1.737.  I plotted out the geographical distribution of this ratio by state and struggled for a while with the best way to plot it, until I was fooling around with the table filters and set the table to only show the states with a ratio less than 1.737.

My interpretation? West of the Mississippi, thar’ be sausages.

Locales with the fewest men to women on OKC: Washington DC (.862) followed by New Hampshire: (1.145)

Neil Gaiman fan? Bible fan:

My original concept for this blog was to look at the geographical distribution of OKCupid profiles with mentions of Neil Gaiman or one of his books.  But there were simply not enough profiles to give me enough statistical power to understand the distribution across states.  Gaiman or one of his books is mentioned in a little more than 1% of profiles that I’ve downloaded.  The distribution of readers across the country can’t be generated (at least without enormous error bars) with only about 300 data points.

I still wanted to look at distribution of elements out of the OKCupid essay data, so I turned to something that was more robustly reported, particularly the bible or mentions of god or jesus.

To the surprise of absolutely no one, the deep south has the greatest use of religiously oriented vocabulary across the US.

Wrap up:

I’ve had a great deal of fun with Fusion tables.  They’ll definitely remain in my data visualization arsenal.  They allow for quick generation of visuals especially with geographic data that would otherwise take a long time to generate.  I’d encourage you to go and play around with one of the various tutorials they have.  I really like how quickly the platform let me extract some visual insights form the OKCupid data.

If you have an idea for a dating map you’d like to see, let me know in the comments.  I’ll continue to play with the data I have and see what pops up.

Statistical caveat:

I presented some observations about gender and geographical differences.  Some of them make intuitive sense, some of them are amusing, all of them rise to the statistical level of “hypothesis generating.”  That is to say, due to several vagueries in the collection of the profiles there is not sufficient statistical power to be definitive.  That said, I stand by my observation “West of the Mississippi, thar’ be sausages.” 😉

How Andrew Hacker and the New York Times got it wrong

The July 29 New York Times Op-Ed page ran a piece by Andrew Hacker entitled “Is Algebra Necessary?”  In the course of the piece, he makes several arguments all pointing towards a resounding “NO!”  The arguments he makes are, in my opinion, all flawed.  I’ll address one of them here, surrounding the employability of people in math heavy fields.

One argument against Hacker’s work that I haven’t seen is pointing out his misuse of source material.  Hacker cites several reports surrounding the role of employment in the so called STEM fields (science, technology, engineering and math), some of which are misquoted, all of which are quoted out of context.  After looking over the reports, I would propose that even the source material that Hacker uses supports the position that if a person wants to belong to a field with low unemployment and high quality of employment, then algebra is most definitely necessary.

Continue reading

Data mining OKCupid

I’ve been thinking quite a bit about natural language processing lately.  This started with my series on text message analysis and looking at gender specific twitter usage.  Lately I’ve been pointed at the Natural Language Toolkit (NLTK), a library in python, to make this analysis more robust.  I want to apply this toolkit on a larger set of data than my text messages, so I’ve spent some time learning how to use some web scraping tools.

This post boils down to “How do I download thousands of OKCupid personal profiles?” Followup posts will cover how to use natural language processing to understand the database you create with the downloads.

Continue reading

An online/traditional educational hybrid inspired by the making of a university

What is the future of massive online open classrooms (MOOCs) like Udacity or EdX?  I think we’re seeing the beginning of something new and world changing.  I expressed this a couple of weeks ago here resulting in a series of conversations with my roommate.  The ins and outs and tangents of these chats have led me to consider a plausible change in the nature of higher education, in particular, a change I’d like to see. 

Continue reading

Arsenic based life and very fast neutrinos

I was amused to see a report over at The Bunsen Burner which summarized a couple of papers attacking the NASA’s arsenic-incorporating-life results.  Let me summarize, just in case you missed the arsenic hubbub in December of 2010.  NASA’s astrobiology division published in Science about an amazing microorganism living in Mono Lake, California that could tolerate truly amazing amounts of arsenic and very little phosphorous.

Arsenic based life

If NASA had stopped with a description of the organism then the paper would’ve been fascinating, compelling and a new view into the extreme niches that life can take.  But they went a bit further.

Continue reading