Text message analysis, chapter 2

Let’s continue with our analysis of text message behavior by shifting over into content.  One of the first things to understand is the form of each message.  What’s the length of each message? As a function of length, how does the content of the message change?  To begin with…

… what’s the distribution of message lengths in texts?

It's a power law!!!

While noisy, this curve has a well known shape.  It’s a power law distribution.  Power laws crop up all the time in systems generated by humans.  I’ll post on them in depth later (because they’re really cool), but for now we can summarize them in the context of text messages.  Power laws say short messages are relatively common, while texts containing the complete works of Dostoyevsky are quite rare (or at least that I send it in many short text messages).  Furthermore, the transition in frequency from common to rare follows a particular pattern.

The transition suggests our next interesting insight.  While I could certainly generate 130 two and three letter texts that are all different, the nature of language suggests that these texts are not randomly generated.  Rather they have content, and short content bearing messages should be very similar.  What are the unique short messages that I’m texting?

  • 1 character:  !    ?   K   N
  • 2 characters:  :)  😀   😛  <3  =)  =D  K’  No  Np Ok
  • 3 characters:   :)  (!)  :)!  :-)  😛 😉  Aww  Ha!  Hi!  Yep  Yes

The short messages I’m sending are emoticons, affirmatives and negatives.  This is reasonable to expect from texting.  We’d like to convey information in a short burst.  Friend asks a question, you respond with one word. “Meet for dinner?” “Yep” is a perfect usage of the texting format.

But this gets me wondering.  Since I have very few unique short text messages but many usages of them, what is the probability of a unique text message of a given length?

We can plot the (number of unique messages of length N)/(number of messages of length N) to get the probability of a message being unique.  We see that, as expected two and three character texts are the least likely to be unique (because 1 character texts are not as frequent).  But that as a messages get longer, they rapidly become almost surely unique.  This relates to a very natural idea of language, as the length of a message increases the number of ideas that we can potentially convey rapidly grows.

What’s the longest non-unique text message I’ve sent?  17 characters, “How’re you today?”

“I’m good” (which, at 8 characters, I’ve only sent once)


Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>