The Hidden Bias Of Social Media Sentiment Analysis

Many of the leading social media monitoring suites come with some form of sentiment analysis technology, and this technology is used for a number of applications to track buzz, or measure crisis management, or the gauge the efficacy of a campaign. I've hinted here in the past that I remain unsure about what to do with sentiment analysis, which has of course prompted a number of folks in the space to drop by and comment, for which I am grateful. I do, however, want to elaborate on my "discomfort" about automated sentiment analysis here, because it is something to which I've given a fair amount of thought, and it deserves an equally fair shake here. One issue, of course, is the open question of what you actually do with sentiment data. I am sure there is some relationship between social media sentiment and other key business metrics, but the onus is on the sentiment analysis folks to show that. When social media sentiment goes up or down, what if anything does that translate to? For example, when the "Motrin Moms" controversy raged across the Twittersphere, social media sentiment for Motrin likely plummeted. But did sales go down? A survey found that actual opinion of average moms was markedly less negative (in fact, most knew nothing about it) so what would you do with this information if you were Motrin? The right answer is not "nothing," I'll grant you, but without some way to square the disconnect between social media analysis and other business measures, what do you then do with sentiment analysis? I'm not saying the sentiment analysis got this wrong--I'm merely saying that it got this different. Without knowing the cause of the delta or the extent of the correlation, I honestly don't know what to do with the data!

Let me state this up front - sentiment analysis is getting better. As processing power and algorithms grow increasingly more powerful and sophisticated, this will naturally happen. Still, every recent comparison I've seen in print between man and machine for determining sentiment has machine losing by enough of a gap that I don't feel I could ever look at the results of an automated sentiment analysis and not feel like I have to go back and check it again--which defeats the purpose, I suppose.

Here's what it's good at: let's say I post this on Twitter:

"I love my Toyota."

I suspect that's a no-brainer for any of the leading sentiment analysis tools. Where they have been getting better lately is with natural language processing and learning how consumers actually talk about specific categories and brands. So if I also post:

"Toyota FTW!"

...many of the leading tools will get this one right, as well. The next frontier for sentiment analysis is doing better with complex phrases that are comparative or conditional, like these:

"If my Toyota would stop when I pressed the brake, I'd LOVE it!"

or

"I love Toyota, but it's pretty tough to beat my Yugo on looks alone!"

At this point, I expect some of the sentiment analysis folks to chime in and comment that yes, their tool can handle these as well. I'm not a computer scientist, so I won't dispute any of that. But consider this: I actually do own a Toyota hybrid, and I am assuming that the accelerating problem and the braking problem will cancel out somehow and make my "average" ride safe. I don't personally believe any of the examples I've just listed. In fact, I believe the opposite. If you've made it this far, you now know my sentiment about the brand in question, but how would a computer handle this particular blog post in an automated sentiment analysis? Am I expressing a sentiment about the car in question, or not--and if so, what?

The answer for many systems would be to take the logical step of avoiding sins of commission (labeling this as positive or negative) and instead risking the sin of omission--not categorizing it at all. In fact, that's exactly what Ignite Social Media's Brian Friedlander found when he examined a Radian6 sentiment analysis and found that 77% of the brand mentions he looked at were tagged as "neutral;" in other words, the algorithm didn't make the wrong choice (labeling a positive as negative), rather in close or complex cases, it defaulted to neutral. From a computer science perspective, that's probably the right choice. Again, I'm no computer scientist, and I am absolutely not picking on Radian6 here.

What I am, however, is a survey research guy, and it is when I put on my sampling methodology hat that I see the hidden bias inherent in this approach--the non-response bias. What Brian's analysis also uncovered makes a lot of sense--while only 28% of the brand mentions tested came from microblogging (like Twitter), 61% of the posts marked with a positive or negative sentiment came from microblogs. This makes total sense--it is much easier for a computer to make the right call on 140 isolated characters of "Toyota FTW!" than it would in this blog post, which will absolutely show up on Toyota's social media monitoring radar by dint of the number of times I've mentioned the brand. A computer scientist would rather have the machine make no choice than make the wrong choice--and that's fair. But consider then what your sentiment profile sample looks like.

If 60% of your identified sentiment comes from 28% of brand mentions, and those mentions are weighted towards Twitter and other microblogging solutions because it's easier to be accurate, than the majority of your sentiment profile is being determined by a tiny universe of unrepresentative consumers (the small percentage of online users with Twitter profiles) and not by the significantly larger sample of consumers on Facebook, leaving comments on blogs or posting to message boards. Now, you can weight the responses derived from Twitter down in the mix, which would mitigate the impact of microblogging on your overall sentiment profile, but determining those weights is tricky, and even then you are left with the non-response bias of all the untagged/"neutral" mentions in other platforms, making comparisons difficult. I believe I have seen Conversition talk about weighting their data by source, so clearly I am not the only one thinking along these lines, but know that this expertise comes from a human, not a machine. And weighting by source to account for differential response rates is not the same thing as weighting to account for a differential in sentiment identification rates.

All of this is one man's roundabout way of saying that I don't have the computer science expertise to challenge the accuracy of sentiment analysis, so I won't. But when you look at your sentiment analysis data, also consider the sources of that analysis. The iceberg analogy probably works best here--some small percentage of your sentiment is "visible" (i.e., easily categorizable) by a machine, while the rest lies submerged under the ocean. But allowing your sentiment profile to be disproportionately weighted by microbloggers, who are the few, and not adequately represented by other social media users, who are the many, may lead you to draw conclusions about the iceberg from the tip that aren't, in fact, accurate. Sample is everything.

Your take? Believe me when I say this--I want to be proven wrong. But I want to be proven wrong by something that doesn't involve a proprietary, black box solution, because that will only be the exception that proves the rule. What say you?