Social Media Data Dredging

Calumet and Hecla Dredge Lately I've been seeing a rash of conclusions, drawn from large social media datasets, offering some received wisdom about the best times to email, how often to tweet, and other alleged correlations between your behavior as a marketer, and your audience's behavior.

Some of these offer legitimate correlations. Keep in mind, though, that a correlation is a number, and it generally shows the relationship between one numeric variable and another: i.e., for every hour I write, I produce 5 pages. The time I spend writing can be said to correlate positively with the number of pages I write. There can also be negative correlations, where the increase of one number is correlated with the decrease in another.

Where correlations get tricky is with discrete variables, which my good friend Matt Ridings reminded me of recently. In the case of a satisfaction score, like "very satisfied," "somewhat satisfied," etc., although the variables are not numbers, they are derived from a continuous scale, and ordered from low to high, so you can use them as proxies for numbers of a sort.

Some discrete variables, however, do not represent continuous ordered numbers. These include things like the day of the week. Monday is not "more" or "less" than Saturday. Monday is just not Saturday. So when we derive associations between things like emails opened and days of the week, these are hard to express as correlations - think about my hours/pages analogy. If there is a correlation between day of the week and tweets retweeted, how would you express it? For every...what? extra retweet is observed? (To really do the work here would be to do a robust multiple regression analysis with a lot of variables - including times and days of the week you send your messages, but many others, as well - to determine just how important days or times are to open rates or tweets. They may contribute something to the variance of open rates, but since I never see anyone actually do this work, it's all conjecture.)

The Dredges

This, however, is not the biggest problem with all of these "best time to send email" or "number of times to tweet" data points. The real issue is that these loose associations are prime examples of data dredging. I don't want this blog to turn into statistics school for anyone - that would be a snooze for me, too - but this is an important concept to understand before you blindly accept the conclusions of social media "studies" derived from unstructured data.

Data dredging is simply this - sifting through a big ole' pile of numbers without a hypothesis, eyeballing things that appear to be related but might just be random chance. If you just take a pile of data that shares one common element (the thing you are interested in), you will likely find some other common elements purely by chance. Until you prove otherwise, these are associations, not necessarily correlations.

The classic example for this is the cancer cluster. If you sift through cancer rates geographically, you will likely find areas of the country that report higher than normal rates for a given form of cancer. If you just "dredge" that data, you might find other things that those populations have in common - maybe they live near a Superfund site. Maybe they also have higher than average incomes. Maybe they live near the highest concentration of fast food restaurants, but equally - maybe they live in an area with the highest concentration of shoe stores. You see where I am going with this?

Sifting through the cancer cluster data and reporting that living near fast food concentrations correlates with high cancer rates might sound reasonable to some, but the exact same process could also lead you to say that shoe stores cause cancer. You need to test these hypotheses on different data sets than the ones used to dredge the correlations in the first place to determine if they are real correlations. So if you do dredge out that cancer rates and shoe stores seem to go together in one data set, you need to apply that theory to an entirely different data set to see if they are, in fact, related.

The fact is, most of these "best times to send an email" stats might just be pure coincidence, and they certainly don't necessarily apply to you and your business. The right answer for you is to look at your data. Chris Penn has detailed a pretty effective way to do this yourself, if your email service provider offers metered send functionality, and I highly recommend you give his method a try.

You could also just randomly separate your data into two subsets, then go ahead and dredge the heck out of one of them. You'll need to account for whether or not you tend to email on the same day of the week, as this probably has the single highest correlation to when they are opened, no? But lets say you normalize things to show data from emails sent across every day of the week. In data subset 1, you might observe that emails sent on Wednesdays are opened more frequently. Now you have a hypothesis, not a real correlation. All you need to do is check that hypothesis against the other data subset - if you have a real correlation, and not a coincidence, you'll see the same correlation, to roughly the same degree.

If you don't, then you don't have a correlation. And my real point here is that you should do this work yourself, with your data - it's the only right answer for you. As my friend DJ Waldow notes about "email best practices," your email best practices (or Twitter, or Facebook, etc.) are the best practices for you. Your variables will be different from my variables - perhaps dramatically so - and listening to me tell you that Fridays are the best day to send B2B emails when your audience is primarily in the Middle East (where Friday is the first day of the Islamic weekend) is just plain silly.

Finally, the most sinister thing I observe with all this dredger-y: often, the dubious conclusions I see thrown around regarding these issues are accompanied by some form of weak caveat: "of course, correlations don't prove causation, but..." These statements are designed to make you believe that the study is reasonable, that the appropriate data safeguards have been employed, and the author of the study is cradling you in the loving arms of reason. They let your guard down. Of course correlations don't prove causality, but the author is presenting them for a reason - after all, as Edward Tufte famously quipped, "correlation isn't causation, but it sure is a hint." The correlation caveat is designed to make the hint seem reasonable, cautious and considered.

Here's the thing, though - correlations don't necessarily imply causation (that's the easy thing to say), but associations aren't necessarily correlations, either. So the next time you hear the results of data dredging about timing and frequency of marketing messages, don't treat these results as facts - treat them as a hypothesis. Could be right, could be coincidence, but don't accept it at face value. Test it on your own data.

Thanks for reading.