So far in our exploration text messaging, we’ve analyzed the time of day each message arrives and the structure of each message. We’ve been able to pick out typical behaviors like timing of sleep cycles and life events like a trip to Switzerland. We’ve found a relationship between the structure of messages and the kind of content they contain. In the next part of our analysis we begin to look into the actual content of the message and along the way discover some interesting things about the nature of language.
What words am I putting in text messages? At the time of this writing, I’ve sent or received 21531 words in text messages. These are drawn from a list of 3701 unique words. That means that the message “OMG I love love love Justin Bieber” has 5 unique words (BTW, I would shoot myself before sending that message).
Since I’ve exchanged over 20k words with only 3k unique words, there are obviously many reused words. What are the most common ones?
- ‘to’ – 690 uses
- ‘you’ – 647 uses
- ‘I’ – 620 uses
- ‘the’ – 553 uses
- ‘a’ – 483 uses
- ‘and’ – 481 uses
- ‘in’ – 365 uses
- ‘with’ – 270 uses
- ‘of’ – 234 uses
- ‘for’ – 231 uses
Which is to be expected. Every one of these are on the list of most common English words. The common words of any language tend to be the functional ones, article adjectives, prepositions, conjunctions and the like.
But there are also a fair number of words used only once or twice. In fact there are 1685 one use words and 458 two use words. More than half of the unique words are used only once or twice. This led me to wonder what is the frequency of different words that I use in text messages?
Below I plot, as a function of the number of uses, the number of words with that number of uses. To make this clear using the example above, the curve is made up of points like (1,1685) (2,458) …. (647,1) (690,1).
The axes on this graph may not be what you’re used to. This is a kind of plot called a log-log plot. It’s based off of the common occurrence in science that differences we care about are not linear in nature. Rather, science often sees the difference between 1 and 10, 10 and 100, 100 and 1000 as all the same. This plot changes the axes so to feature this kind of thinking.
What can we learn from this? We can see the data (the blue circles) tracks in almost a straight line from upper left to bottom right, then flatten out. The flattening out happens because we’re measuring number of words, so if the line of data continued without change, eventually we’d be measuring fractional words which doesn’t make sense.
But why is the graph straight until that point? Here’s some of the magic of the log-log plot. Functions generated by power laws appear as straight lines on them. We bumped into our first power law when analyzing text message length. Now they’ve popped up again in analyzing word frequency. This should begin to drive home the ubiquity of this concept in nature.
It’s a well known fact that word usage and several other elements of language follow power laws. These are certainly interesting observations, but I’d like to use them to develop insights into the specific nature of text message communication as opposed to typical English prose.
Let’s take the hundred most common words used in texting and remove the most common words in other forms of English communication. This leaves us with a list of 43 words. Here’s a selection of the more content rich members of that list:
We’re left with those words that are, in some sense, uniquely common to text messaging and the people I text with. Those things that are defined by the nature of the communication. What makes the list? Emoticons, affirmatives (as answers to questions), words about time and cellular communication. These are all probably common to texting as a form of communication (I’d love to study that more). Somewhat more unique to me and the friends that I text with is my use of ‘dance’ and ‘dancing’, reflecting my dedication to the hobby.
I’m really excited to see the connection between the form of the communication and the content of the message. I wonder what else we’ll discover as we dig a little deeper.