Sparks from a sparkler

Comparing Sentiment Analysis in Yelp and Twitter

Ever since I attended a talk by Julia Silge regarding text mining in R, I’ve been enthralled with the process and its applications. I wanted to explore sentiment analysis from a business intelligence standpoint, so I played with the example of airline customer satisfaction reviews and ratings. I’ll walk you through it below.

While there are many avenues I could use to gauge consumer opinion, I decided to focus on Yelp and Twitter for comparison. The differences between these media are subtle but important. With Yelp, I can find a direct link between the customer’s opinion and their star rating of the experience, while on Twitter I’m not likely to find an objective rating of opinion.

Ideally, I could find a lexicon of word sentiments that can be applied to consumer opinions regardless of where they come from (email, phone calls, Yelp, or another channel). There are plenty of established lexicons publicly available, but for this exercise, I’m exploring one, while also attempting to build a customized lexicon based on Yelp reviews. The data I’m using is publicly available from Yelp and Twitter. The Yelp data is filtered to contain 1,188 reviews from 15 different airlines between June 24, 2006 and July 16, 2014. The median length of a review is 115.00 words, with an average of 856.09 characters. The average star rating is 2.22.

A Simple Analysis

I’m using R and the package tidytext to help with this exercise. I’ll begin with simple counts of words by Yelp star rating. The following chart suggests the words match what I think a high rating review should look like. On the right, we see words like “enjoyable”, “excellent”, and “seamless”. While on the left, the lower-star reviews contain words like “difficult”, “expensive”, and “wtf”.

Yelp Review Word Frequency


Because we are working with word counts, the fact that there are many more 1-star reviews than any other likely influences how some words appear more negative than they might actually be. For example, a neutral word like “connection” might be skewed toward a negative sentiment simply because it is more likely to show up in a 1-star review.

Yelp Ratings

Now, because the range of the Yelp data is limited, it’ll be good to get a feel for how ratings have changed over time, because Yelp was in its infancy for a fair amount of time.

The chart below shows that the average review score decreased steadily between early 2013 and mid-2014. The number of monthly reviews increased in this same period, as represented by the size of the point.

Sentiment Analysis

Let’s dive into the heart of the sentiment analysis. The first step is to identify a sentiment lexicon that fits our use case. There are plenty of lexicons available, but it can be arduous to find one that uniquely matches our intuition of consumer opinion ratings.

I chose to use the sentiment lexicon developed by Hu & Liu and used by Bing and Google *. This lexicon classifies nearly 7,000 words with either a “positive” or “negative” label. After applying these labels to the words in each review, the data starts to reveal the sentiment of each review.

But how do we know these sentiment labels accurately capture consumer sentiment? In the sentiment plot shown above, we don’t see the connection between negative word sentiment and the decrease in average ratings over time that we saw earlier.

In the following chart, we gauge the relation between the Hu & Liu sentiment labels and the Yelp consumer rating. The negative label captures the lower Yelp scores, but the positive label does not capture the higher Yelp scores very well. This means that any random positively labeled word is equally likely to come from a 1-star review as 5-star review.

Customized Sentiment Lexicon

One solution to this problem is to create my own sentiment scores. I have all the necessary information for a machine-learning prediction model. The Yelp data has the raw text and an objective numeric rating indicating the consumer’s opinion.

There is a lot of work going on in the background here, but basically I’m building a generalized linear model using the frequency of each word as input and the Yelp star rating as the output. This allows me to start each review with a baseline rating and add or subtract points based on the words that show up in the review. With our newly created machine learning model, we demonstrate the predicted versus actual scores for each review.

The density chart above indicates the prediction model is not too far off base. I do notice, however, that the predictions tend to hover around the middle, with actual scores between 2- and 4-stars predicted to be quite close to each other. This means my model might not perform very well with mid-range scores. The customized scores do a better job of distinguishing between positive and negative consumer opinions than the established sentiment lexicon. Any randomly chosen positive word is now much more likely to come from a higher-star review.

Influential Words

Now I want to look at the 20 most influential words in the Yelp data, and the analysis lines up with what I’d expect negative and positive reviews to look like. Interestingly, words like “nickel” appear as part of the phrase “nickel-and-dime”; along with the word “averaged” when consumers justify their rating as the aggregate of all their experiences with the airline.

For a larger view of the impact of certain words, I’ve created the chatterplot below. Words toward the top of the chart appear more often in reviews, while those towards the right have a more positive connotation. Words like “never”, “again”, and “rude” show up frequently with negative connotations, and words like “great”, “friendly”, and “best” with positive connotations.

Twitter Opinion

Now that I’ve built the prediction model, I’ll see how it applies to consumer opinions shared on Twitter. My suspicion is that tweets might be too different from Yelp reviews for the predictions to be applicable.

First, the data contains 14,485 tweets from February 17-24, 2015. The average tweet has a median of 19 words and an average of 103.8 characters. This is much shorter than the Yelp reviews.

When calculating the score of each tweet, I’m not surprised to find that there are many 0- score tweets, or tweets composed of words that our prediction model did not account for.

For example, looking at some 0 score tweets, many are either questions or are conversational in tone, as users are more likely to engage in direct dialogue with the airline itself.

## # A tibble: 3 x 4
##   tweet_id tweet_score text                             date              
##   <chr>          <dbl> <chr>                            <dttm>            
## 1 3845               0 @united There are Exit-Row wind~ 2015-02-18 08:34:23
## 2 4247               0 @united stay warm – I will be p~ 2015-02-17 10:35:45
## 3 10569              0 @USAirways if one with @America~ 2015-02-21 08:40:33

I removed the 0-score tweets and continued exploring. After converting these tweet scores to the “star rating” units used in the original Yelp prediction model, the chart below shows that a majority of the tweets are neutral, hovering around 2.4 stars.

Let’s take a look at a few of the tweets with the highest and lowest scores. You can see that they match our intuition of what a consumer opinion would look like.

## # A tibble: 6 x 2
##   stars_predicted text                                                     
##             <dbl> <chr>                                                   
## 1           0.784 @united is the worst. Worst reservation policies. Worst ~
## 2           1.51  @united worst customer service experience ever. 40 minut~
## 3           1.52  @USAirways flight 850. RUDE RUDE RUDE service! Awful.   
## 4           4.11  @AmericanAir awesome flight this morning on AA3230! Awes~
## 5           4.14  @VirginAmerica completely awesome experience last month ~
## 6           4.38  @usairways great crew for flight 504 PHX to YVR tonight!~

As you can see, a machine learning approach to gauging consumer opinions isn’t a one-size-fits-all proposition. Based on these examples, an existing sentiment lexicon may not always match our use case. Given the right data, it is possible to build our own lexicon, but there are limitations to this approach when the tone of the text being analyzed is not consistent between the sources of text. Here, Yelp provides a more thoughtful and long-form review of a business or experience, while Twitter is inherently limited in the amount of text it can display – and the level of sentiment expressed.

In this example, a further analysis could customize a different consumer opinion lexicon using our own labels for each tweet, which would lead into building a new prediction model tailored to Twitter’s short-form and conversational tweets.


  1. PatManDo says:

    Nice work, Kevin. You’re a cool guy.

  2. David Bosak says:

    This is an excellent, enjoyable, and seamless presentation. I genuinely loved reading your article while snacking on munchies, relaxing on the headrest, and singing pleasantly to the babies. Amazingly Awesome!

Leave a Reply

Your email address will not be published. Required fields are marked *