Tom Webster, writing and speaking

Social Media Data Analysis 101: Sampling and Reporting

Added on by Tom Webster.

One of the things I hope to do in this space is to help practitioners and enthusiasts of social media become better consumers of the plethora of data being thrown around the interwebs. We see new "data" about Twitter, Facebook and other social network usage on a daily basis, but deciphering that data is often a challenge. What I've observed, however, is that for the most part the actual studies are fine, for what they are. The reporting of these studies, however, is just awful, because there is such a poor understanding of the limits and uses of survey research data.

There are 5 basic questions any consumer--and any reporter--of survey data should always know the answers to before attempting to process the significance of a given survey research project:

  1. Who Paid For The Survey?
  2. How Were The Questions Asked?
  3. Who Was Asked?
  4. Who Wasn't Asked?
  5. What Was The Exact Question?

Survey research projects that adhere to AAPOR standards must be clear about all of the above (with #4 inferred from #3, of course). Today's case in point is with a recent survey of Twitter users in the Crowd Science network. I would encourage anyone interested in social media data to read the actual report (which is fine) and not the "reporting" of the report, which varies wildly in quality. Exhibit A: MediaPost's Research Brief on the study. Here's an excerpt:

Additional survey results include:

41% of Twitter users prefer to contact friends via social media rather than telephone, compared with 25% of non-Twitter social media users, 11%, vs. only 6% of those not using Twitter, actually prefer social media over face-to-face contacts

14% of Twitter users said they have revealed things about themselves in social media that they wouldn't under any other circumstances

8% admitted to "frequently stretching" the truth about themselves online Twitter users tend to be older than non-Twitter social media users (54% over 30 years old, vs. 42%),

They are twice as likely to be self-employed or entrepreneurs (18% vs. 9%)

24% vs. 15% "buy gadgets/devices when they first come out,"

48% vs. 30% have created a website

37% currently maintain a blog, twice as many as non-Twitter social media users

The Crowd Science study was conducted across more than 600,000 visitors to multiple websites between August 5-13, 2009, targeting social media users age 12 and up.

The issue I have with most social media data like this is not with the study itself, but in how the study is reported. For instance--if you read the study, you will know that the actual sample for these statistics was not 600,000, but 718. The total Crowd Science sample base for their social media data may be 600,000, but the number of Twitter users who responded to this particular survey was 718. My point here is not to point out that the survey had a small sample; rather, it's to point out that there are some gaps in how the survey was reported. We'll just call that a sin of omission--it's not really what I want to dwell on here.

The most egregious violation is far more subtle--the term "Twitter users." Twitter users is sprinkled liberally throughout the MediaPost story, and saying that Twitter users do this or do that makes for compelling--and tweetable--sound bites. However, it's simply incorrect to say "Twitter users" unless you are sure that you have a methodologically sound, representative sample of Twitter users. Which, in the case of this particular study, is not the case. If Crowd Science's report claimed this, then the fault would lie with them--but they don't! If you actually take the time to read the whole report, you'll note that Crowd Science was methodical and careful in their description of the sample as either "respondents to this study" or "Twitter users in this survey/we surveyed," etc. When a survey project uses a sample of convenience, as this one does, then the reporting is always on solid ground by referring to the sample as "respondents to this study (who use Twitter). Referring to them as Twitter users, however, implies that the results are projectable across the universe of Twitter users, which cannot be claimed in this case--and, to be fair to Crowd Science, was not claimed in the report.

The methodology of the study was fine--Crowd Science isolated the 718 Twitter users in a larger pool of visitors to websites in the Crowd Science network. That's who was asked (see #3, above). Who wasn't asked? Twitter users who were not visitors to Crowd Science network sites, for one, and we cannot easily characterize the non-response bias there without significantly more data about the Crowd Science network. Again, please do NOT take this as a knock on Crowd Science or their methodology--these concerns are common to many self-selected online surveys. The issue here is the journalistic shorthand that converts "Respondents in this study who use Twitter" into "Twitter users." It's the latter term that gets retweeted ad infinitum into everyone's 140-character summations of this survey, and represents the real crime of this sort of reporting.

What this sort of sloppy reporting does is create a universe where one report of "Twitter users" from survey A contradicts another report of "Twitter users" from survey B, which gives marketers fits and reduces confidence in survey research, even though surveys A and B were both probably perfectly fine.

Got questions or observations about social media data? I'd love to hear from you.