BrandSavant

Gaining Insight From Social Media Data

Social Media Data Dredging

by Tom Webster on March 22, 2011

Calumet and Hecla Dredge

Lately I’ve been seeing a rash of conclusions, drawn from large social media datasets, offering some received wisdom about the best times to email, how often to tweet, and other alleged correlations between your behavior as a marketer, and your audience’s behavior.

Some of these offer legitimate correlations. Keep in mind, though, that a correlation is a number, and it generally shows the relationship between one numeric variable and another: i.e., for every hour I write, I produce 5 pages. The time I spend writing can be said to correlate positively with the number of pages I write. There can also be negative correlations, where the increase of one number is correlated with the decrease in another.

Where correlations get tricky is with discrete variables, which my good friend Matt Ridings reminded me of recently. In the case of a satisfaction score, like “very satisfied,” “somewhat satisfied,” etc., although the variables are not numbers, they are derived from a continuous scale, and ordered from low to high, so you can use them as proxies for numbers of a sort.

Some discrete variables, however, do not represent continuous ordered numbers. These include things like the day of the week. Monday is not “more” or “less” than Saturday. Monday is just not Saturday. So when we derive associations between things like emails opened and days of the week, these are hard to express as correlations – think about my hours/pages analogy. If there is a correlation between day of the week and tweets retweeted, how would you express it? For every…what?…an extra retweet is observed? (To really do the work here would be to do a robust multiple regression analysis with a lot of variables – including times and days of the week you send your messages, but many others, as well – to determine just how important days or times are to open rates or tweets. They may contribute something to the variance of open rates, but since I never see anyone actually do this work, it’s all conjecture.)

The Dredges

This, however, is not the biggest problem with all of these “best time to send email” or “number of times to tweet” data points. The real issue is that these loose associations are prime examples of data dredging. I don’t want this blog to turn into statistics school for anyone – that would be a snooze for me, too – but this is an important concept to understand before you blindly accept the conclusions of social media “studies” derived from unstructured data.

Data dredging is simply this – sifting through a big ole’ pile of numbers without a hypothesis, eyeballing things that appear to be related but might just be random chance. If you just take a pile of data that shares one common element (the thing you are interested in), you will likely find some other common elements purely by chance. Until you prove otherwise, these are associations, not necessarily correlations.

The classic example for this is the cancer cluster. If you sift through cancer rates geographically, you will likely find areas of the country that report higher than normal rates for a given form of cancer. If you just “dredge” that data, you might find other things that those populations have in common – maybe they live near a Superfund site. Maybe they also have higher than average incomes. Maybe they live near the highest concentration of fast food restaurants, but equally – maybe they live in an area with the highest concentration of shoe stores. You see where I am going with this?

Sifting through the cancer cluster data and reporting that living near fast food concentrations correlates with high cancer rates might sound reasonable to some, but the exact same process could also lead you to say that shoe stores cause cancer. You need to test these hypotheses on different data sets than the ones used to dredge the correlations in the first place to determine if they are real correlations. So if you do dredge out that cancer rates and shoe stores seem to go together in one data set, you need to apply that theory to an entirely different data set to see if they are, in fact, related.

The fact is, most of these “best times to send an email” stats might just be pure coincidence, and they certainly don’t necessarily apply to you and your business. The right answer for you is to look at your data. Chris Penn has detailed a pretty effective way to do this yourself, if your email service provider offers metered send functionality, and I highly recommend you give his method a try.

You could also just randomly separate your data into two subsets, then go ahead and dredge the heck out of one of them. You’ll need to account for whether or not you tend to email on the same day of the week, as this probably has the single highest correlation to when they are opened, no? But lets say you normalize things to show data from emails sent across every day of the week. In data subset 1, you might observe that emails sent on Wednesdays are opened more frequently. Now you have a hypothesis, not a real correlation. All you need to do is check that hypothesis against the other data subset – if you have a real correlation, and not a coincidence, you’ll see the same correlation, to roughly the same degree.

If you don’t, then you don’t have a correlation. And my real point here is that you should do this work yourself, with your data – it’s the only right answer for you. As my friend DJ Waldow notes about “email best practices,” your email best practices (or Twitter, or Facebook, etc.) are the best practices for you. Your variables will be different from my variables – perhaps dramatically so – and listening to me tell you that Fridays are the best day to send B2B emails when your audience is primarily in the Middle East (where Friday is the first day of the Islamic weekend) is just plain silly.

Finally, the most sinister thing I observe with all this dredger-y: often, the dubious conclusions I see thrown around regarding these issues are accompanied by some form of weak caveat: “of course, correlations don’t prove causation, but…” These statements are designed to make you believe that the study is reasonable, that the appropriate data safeguards have been employed, and the author of the study is cradling you in the loving arms of reason. They let your guard down. Of course correlations don’t prove causality, but the author is presenting them for a reason – after all, as Edward Tufte famously quipped, “correlation isn’t causation, but it sure is a hint.” The correlation caveat is designed to make the hint seem reasonable, cautious and considered.

Here’s the thing, though – correlations don’t necessarily imply causation (that’s the easy thing to say), but associations aren’t necessarily correlations, either. So the next time you hear the results of data dredging about timing and frequency of marketing messages, don’t treat these results as facts – treat them as a hypothesis. Could be right, could be coincidence, but don’t accept it at face value. Test it on your own data.

Thanks for reading.

Be Sociable, Share!
  • http://www.b2bbloggers.com Jeremy Victor

    This is so helpful and something we should all post in our offices:

    “next time you hear the results of data dredging about timing and frequency of marketing messages, don’t treat these results as facts – treat them as a hypothesis. ”

    Run your own tests and determine what works best for your community, audience, list, whatever. The response of you are are working to influence is all that matters.

  • http://twitter.com/webby2001 Tom Webster

    I like the idea of BrandSavant being posted in offices. You are hereby encouraged to do so :)

  • http://twitter.com/pricingright Rags Srinivasan

    Tom,
    Very well written article that neatly illustrates the flaws in these studies. Your recommendation of multiple regression is spot on, but we should still be aware that regression is still correlation.

    Studies that add, “correlations does not mean causation” go on to hint (not so subtly) exactly the same.

    I wonder if most of the data-dredging articles are written for unsuspecting audience which is willing to accept ideas without challenging them.

    Regards
    -rags

  • http://www.techguerilla.com/ Matt Ridings – Techguerilla

    I could kiss you for writing this one Webster. I won’t, which I know is disappointing, but it’s the thought that counts.

    Speaking of counting.

    Many times we also lose sight of the real objective in an attempt to provide some astounding ‘insight’ using data. There’s a vast difference in quantitative and qualitative, I just wish more people focused on the latter. We tend to stop short of measuring our *actual* objectives in favor of easy numbers. Repeat Customer vs. Avg. Order Size vs. Conversions vs. Clickthroughs vs. Opens. Or Comments vs. Time on Page vs. Clicks vs. Tweets.

    Even when done correctly the results can only provide you a very focused *question*. The next step is finding the answer….*why* are the results what they are? It is only through answering that question that one can manipulate the variables involved to fully maximize your individual success. You don’t have to have a deep understanding of applied behavioral analysis, some common sense and a willingness to test hypotheses will take you a long ways.

    Cheers,

    -Matt

  • http://brasstackthinking.com Amber Naslund

    Oh, Webster. Seriously, if Miriam wouldn’t kick my ass, I might just marry you for this. After Ridings kisses you. Which could be awkward. But anyway.

    I’m so glad you brought up the issue of having a hypothesis. Scientific rigor for ages has used this as the basis for experimentation and testing, and it’s astounding that we don’t apply the same simple concept to everything from business plans to what seems to pass for “social media research” these days.

    And Ridings brilliantly points out above that the results – including the data in our beloved reports and dashboard – only really represent questions and indicators, not conclusions. And we often want definitive answers, but lack the discipline to test multiple sets of variables or let go of the preconceived conclusions we WANT to find in the data in favor of teasing them out.

    So it’s either we accept that we’re looking at associations, assumptions, and some educated guesses (or even possible coincidences), or we put our money where our mouth is and invest time, money, and resources in giving the deep data its due.

    You’re a gem. Thanks for writing this.

    A

  • http://twitter.com/loriaustex Lori Witzel

    How about temporary tattoos? :-)

  • http://twitter.com/loriaustex Lori Witzel

    This post? Better than smartnin’ pills, and…even better than coffee. Sanity. Salience. Simplicity. And other things that begin with “s” that I can’t think of at the moment. What Amber and Matt wrote, + my thanks.

  • http://twitter.com/webby2001 Tom Webster

    Got some smartnin’ pills?

  • http://twitter.com/webby2001 Tom Webster

    It’s kinda the scientific method :) Really, there aren’t many processes that wouldn’t be improved by just *one* more question, one more layer of thought. Not two or three, or finding excuses for inaction, but just one little test of our assumptions often pays big dividends.

    Would be a good fight, too – let me tell you.

  • http://twitter.com/webby2001 Tom Webster

    Disappointing, as you say.

    You are dead on about the “easy numbers”. Unstructured data gives us so many more ways to count things, that we can’t help ourselves (me, too!) But I think 2011 will really be the year of rejecting the easy answers to the wrong questions with unstructured data. And I’m all for a renaissance in qualitative research (it’s where I spend most of my client work). It’s the ticket to “why.”

  • http://twitter.com/webby2001 Tom Webster

    Thank you, Rags! Appreciate your comment. True, a regression will still show a correlation – but (properly done) it will at least tell you the amount of variance that “day of the week” and “time of day” actually contribute to the correlation. Maybe none of it matters. Maybe it does.

    And I’d like to believe that audiences are becoming more “suspecting,” a little at a time. We all do our part.

  • Mark W Schaefer

    You are my hero.

  • http://twitter.com/webby2001 Tom Webster

    That’s funny, I thought you were MY hero. Your standards….

  • http://www.communicationammo.com Sean Williams

    G-d bless you, Mr. Webster. I often feel like the lonely voice crying in the proverbial wilderness when I ask (demand!) that someone open up the black box and let researchers test the methods. This is why we need independent testing of any of these magic bullets to measure influence, or impact, or whatever. I feel the same way about news stories that report on the results of a survey undertaken by a brand. It’s not real research if it’s not independently verifiable… The “data dredge” isn’t easy (but a few of us like it), and that sort of deductive research — agenda-less and open — can reveal some interesting stuff indeed. But we all want the net-net — just tell us what we do to solve the puzzle, eh?

    Thanks for writing this — and thanks to @markwschaefer for pointing me to it.
    Sean

    @Commammo

  • http://twitter.com/christopholies Chris Hill

    I totally agree. There’s nothing like using your own data to understand your clients. I can tout what Mashable says about Twitter or Facebook all day long and tell my clients what they should do based on those stats, but if I don’t look at THEIR data, I’ll never be creating practical solutions for them.

    Again, great post!

  • http://socialbutterflyguy.com/ DJ Waldow

    Please add me to the list of “I want to hug/kiss Tom Webster for writing this.” Is it weird that I read this post out loud to my wife this morning?

    I think part of the appeal to “when is the best day/time” studies is that we all want a quick fix. As you say often, you have to do the work. We just did a test with our Factory Direct (now weekly) email newsletter and found that it performed the best – for US – on Friday afternoon. We tested it. We did a metered send that went out over a several day period. Friday showed the best results. Again, the best results for US.

    One more thing…

    This article/post “came across my desk” today: http://www.famecount.com/news/new-study-finds-link-between-social-media-popularity-and-stock-prices-242652 – curious what you think about it. They use phrases like, “Statistically significant correlation” and “Data suggests…” and “ANOVA and linear regression analysis.” Seems legit to me, but…then I read the comment by Matthew Grabowski and I think maybe it is BS.

    Learn me Webster!

  • http://www.convinceandconvert.com jaybaer

    Excellent post TW.

    Before I got all social and such, my former digital agency specialized in marketing testing and optimization. We actually DID the kind of work you suggest, including a ton of full-blown multi-variate testing of emails, landing pages, banner ads, and the like.

    I’ve overseen at least 30 programs of this type and the answer is universally the same – there is no answer. Your customers are not my customers, even if we’re in the same market space.

    Is it nifty to think that Tuesday is the best day to email? Sure, but it’s not true. Or at least, it might not be true. Or maybe it is. The only way you can know is to test it yourself, for your customers.

    And what do you do once you complete a full suite of tests and figure out the optimal time, date, form, fashion to do anything online? You start the tests all over again from the beginning, because norms and consumer behavior and expectations change rapidly (not to mention the fact that your customer group isn’t static – hopefully).

    The question becomes “how much $$ do you want to spend to know that you’re right?” Which is where tying marketing optimization to incremental revenue or LTV becomes very, very important.

    I’ll hug you for this one, but will withhold kisses until this cold goes away.

  • http://twitter.com/KellyeCrane Kellye Crane

    I don’t know you, so I won’t propose we hug, kiss, or marry. But I will say that the core of this post is advice that’s much needed and in short supply: think critically. Don’t take “studies” presented by people who’ve inserted scientist or sociologist into their title at face value (my spouse happens to be both of those things, and it means a lot more than shuffling some numbers around so they prove a given point).

    It’s long been a peeve of mine to see how quickly positive “proof points,” based on junk science, circulate through the social space. This post is an excellent explanation of some of the common gotchas. Let’s hope folks are open to hearing the message!

  • http://lifeanalytics.blogspot.com Themos Kalafatis

    Dear Tom,

    Thank you for this thought provoking post. As someone having discussed about Social Media Data quite often i believe that i should say a few words.

    First of all, it is true that Social Media Analytics has a lot of challenges : Sampling Bias is one of them. If we are talking about Sentiment Analysis then we add “Algorithm Bias” to the list of challenges as well. Having discussed the matter (regarding the challenges of Social Media Data) to both AnalyticBridge.com and “Research Methods” on Linkedin there are lots of challenges and so the knowledge extracted should be treated as hypotheses to work with.

    Of course “correlation does not imply causation”. When presenting results from analyzing social media data , phrases such as “were found” or “appear to” opposed to “x implies y” should be used because of inherent problems with Social Media Data . Whatever “was found” during an analysis is just that and does not imply causation.

    Here is the text from a post on my blog, LifeAnalytics which can be found under the title “Bio Information and Popularity” :

    “It is time to identify the words contained in the biographies of popular Twitter users and to be more specific the biographies of users being in the top 30% (in terms of no. of followers) of a random sample of 10000 users. As i always have stated in these series of posts : Treat results as possible clues only.Please also notice how i used (in this and older posts) the words “appears” or “were found” when discussing correlation”

    Thank you again, i will be posting about this subject in my blog shortly.

  • Kellye

    That first line sounds cranky, but was meant to be a joke, FYI!

  • http://www.edisonresearch.com Tom Webster

    If my posts do not elicit genuine marriage proposals each and every time, then I am a failure as a blogger. Or normal. Forget which. Thanks for commenting! My spouse is also a real, genuine scientist, so I can’t get away with any of that stuff.

  • http://twitter.com/Nectarineimp Peter Mancini

    Correlation does not imply causation. It is a cornerstone of science. The concept is thousands of years old, literally. And yet, as important as it has been for the progress of the Human race we still find smart people don’t always understand this. Two variables may indeed correlate a high degree (positive or negative) but changes in one, won’t produce changes in the other. Also, one may drive the other but not vice versa. Because they correlate does not mean that one is a response variable to the other explanatory variable.

    Example. Ice Cream Sales, outdoor temperature and crime rate all correlate. However, crime rate does not drive ice cream sales or vice versa. Also, you can have the absolute most amazing sale on ice cream and put sales through the roof, but that won’t make it hotter outside.

    We are humans and as such we have evolved great pattern recognition abilities and correlation is something we observe naturally. However, it is a failure of cognition to attribute cause to correlation without having carefully eliminated other potential causes. This is why scientific experiments often have elaborate controls even when the experiment itself is simple.

    Analysts, if they are to be effective and good at what they do, need to learn the scientific method and develop defense against common cognitive failures. Having a strong understanding of how science comes to strong conclusions will help when someone is analyzing big data so that they don’t make weak or unfounded conclusions.

  • http://www.edisonresearch.com Tom Webster

    To be fair, Peter – sometimes correlations do lead to uncovering causality, so it’s always worth riding them down. My big issue is in saying that, but then rather boldly presenting an *association* as a correlation. You are 100% right that there isn’t enough of the scientific method in play here, but social media is a young field, and as it matures, so will standards. Thanks for commenting!

  • http://raulcolon.net Raul Colon

    I see many of my possible clients with the same issues trying to make a science out of everything but taking shortcuts and not going to a very defined process.

    I agree their is no better advice than treating it as a hypothesis and not as facts.

  • http://www.thefourthrevolution.org Jeremie Averous

    Hi Tom
    thanks for this provocative post. Actually I find it very inspiring in terms of what not to do!
    Obviously we are now faced with the difficulty of an overflow of accessible data and making sense out of it is a challenge. And as you mention rightly, associations or clusters might not be correlated unless statistically proven to be significant.
    We know from sciences like epidemiology that below a certain effect, the cause-effect relationship will never be able to be statistically observable, because it will be diluted in the noise from the sampling.
    So I suggest – don’t start data dredging unless you have an objective about what you want to find and some idea that it’ll be statistically significant.
    Thanks for these insights and the inspiration – I’ll post on this soon on my blog
    Cheers

Previous post:

Next post: