Data mining OKCupid

I’ve been thinking quite a bit about natural language processing lately.  This started with my series on text message analysis and looking at gender specific twitter usage.  Lately I’ve been pointed at the Natural Language Toolkit (NLTK), a library in python, to make this analysis more robust.  I want to apply this toolkit on a larger set of data than my text messages, so I’ve spent some time learning how to use some web scraping tools.

This post boils down to “How do I download thousands of OKCupid personal profiles?” Followup posts will cover how to use natural language processing to understand the database you create with the downloads.

To gather data on OKCupid users we need to have a simple script to login, generate a database of users through the site’s search function, then fetch and store the text of the profiles.  Basically we need a page scraper configured for OKCupid together with some database functionality.  The python library BeautifulSoup is well suited to this, but in the spirit of this being a learning project, I decided to reinvent the wheel and implement my own scraper.

Step one, logging in:

Python has native support for HTTP requests and cookie handling in three libraries, urllib, urllib2, and cookielib.  We need to combine these together to post to the login form and follow a redirect to the homepage.  We must store the cookie that has us logged in and submit it on subsequent requests to the website.

Here’s how we get this done:

# setup credentials and url requests
# cookie storage
cj = cookielib.CookieJar()
opener = urllib2.build_opener(
urllib2.HTTPCookieProcessor(cj),
urllib2.HTTPRedirectHandler
)
# Useragent
opener.addheaders.append(('User-agent','Mozilla/4.0'))</code>

url = 'http://www.okcupid.com/login'
login_data = urllib.urlencode({
'username':'usernamesinblogsaregreatideas',
'password':'passwordsinblogsaregreatideas',
})

req = urllib2.Request(url,login_data)
resp = opener.open(req)

This code sets up 'opener' to handle our requests with a cookie and redirect handler.  We run this and we're logged in, in this case as the user data_curious.  We can now request webpages using this opener.  Let's get some usernames.

We need a way to parse HTML coming back from the requests.  Python supplies a way to do this using HTMLParser library.  As we parse through the users we want to store the info we fetch in a database.  I like to use the lightweight web.py to provide database abstraction (thanks John for this advice).  Here's the code to handle this:

import web
# silence database logging
web.config.debug = False
db = web.database(dbn='sqlite',db='/Users/Andrew/OKC_users.db')</code>

class SearchHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.record_username = 0
self.record_aso = 0
self.username = ""
self.data_count = 0
self.user_data = []
def handle_starttag(self, tag, attrs):
for name,value in attrs:
if name == 'class' and value == 'username':
self.record_username = 1
if name == 'class' and value == 'aso':
self.record_aso = 1
def handle_endtag(self, tag):
if tag == 'span' and self.record_username:
self.record_username = 0
if tag == 'p' and self.record_aso == 1:
self.record_aso = 0
def handle_data(self, data):
if self.record_username:
self.username = data
return
if self.record_aso:
self.user_data.append(data.strip())
self.data_count += 1

if self.data_count == 7:
username = self.username
age = self.user_data[0]
sex = self.user_data[2]
orient = self.user_data[4]
status = self.user_data[6]
essay_data = ""
self.data_count = 0

results = db.query("select count(*) as total_users from users where username=" + "'" + username + "'")
num = results[0].total_users
if num==0:
print 'valid'
db.insert('users',username=username,age=age,sex=sex,orient=orient,status=status,essay_data=essay_data)
else:
print 'invalid'

self.username = ""
self.user_data = []
return

The properties we include in this object are used to keep track of the state of the parser. In particular, as the parser is fed chunks of the webpage, we mostly want to ignore the data in the chunks. The exception to this is when we see something that looks like a username. Based on the layout of the search results, we parse through the returned page looking for usernames, when we find one through analyzing tag type and classname we switch to recording mode by setting the record_username property to true. We similarly find the age, sex and orientation. The username is checked against the database. Since we might get the same one over multiple queries and we would like to avoid duplicates in the database.

All of this is stored in the database that we build with the aid of the web.py database module. I choose SQLite format because I'm familiar with how to handle it using other programs. Based on how this is setup, the database and schema must already be initialized with the username, sex, orientation, status and essay columns.

Each match search responds with 10 results, so we want to loop while we would like more results. With the code below, I setup a particular search for people, male and female within 25 miles of my zip code. It's ordered by "Special Blend" which I'm not quite sure amounts to random. It's probably decidedly not random in fact as A-List members (the premium service members) have a search option entitled "Totally Random." So, if you derive any statistical results from the database you generate, take them with a rather sizable grain of salt.

url='http://www.okcupid.com/match?timekey=1&amp;matchOrderBy=SPECIAL_BLEND&amp;use_prefs=1&amp;discard_prefs=1&amp;globalnav=1'</code>

i = 0
while i&lt;200:
try:
req = urllib2.Request(url)

resp = opener.open(req)

parser = SearchHTMLParser()
parser.feed(resp.read())
print "Request %d finished."%i
except:
parser = SearchHTMLParser()
print "ERROR ON %d"%i

i = i+1
time.sleep(3) # don't pound the OKCupid servers

I'd like to improve the way I interact with the match search on OKCupid. The search function returns an AJAX enabled page that serves more users as you scroll down. I'd like to come up with a way for this script to interact with the page in order to get additional users without requesting a whole new search page, but I've been unable to figure this out. If you have ideas for how to accomplish this, I'd love to hear them in the comments.

In a followup post to this, I'll be covering how to query this data for the nature of words with the natural language toolkit.

If you're interested in this you may also like:
My Series on text messages
Text analysis to understand gender differences on twitter

Disclaimer:  I'm presenting code here that, depending on how it's used, can violate OKCupid's terms of service.  I'm using the data I collect to develop personal insights into the nature of dating and the language we use about it.  I genuinely feel I'm within the terms of service.  As OKCupid changes their site and or ToS this code could become unusable for technical or legal reasons.  Use at your own risk and/or please don't sue me.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

About the author

Andrew Matteson is a doctoral candidate in computational biology at New York University. axiomofcats.com is home to his math, science and programming hobby explorations. You can learn more about him and his professional interests at andrewmatteson.com.

© Copyright 2017 andrewmatteson.com