OPIM 609 Final Assignment

October 10th, 2022


Daniel Behrns, Adriana Cailao, Matthew Nicoletta, Adrian Ott, Jeannie Sprangle

  1. Preparation
    1. Packages
    2. Functions
    3. Load Data
  2. Explore
    1. Hotels
    2. Reviews
  3. Clean
    1. Split dataframes
  4. Wordclouds
    1. Good reviews
    2. Bad reviews
  5. Topic Modeling
    1. Good reviews
    2. Bad reviews
  6. n-grams
    1. Good reviews
    2. Bad reviews




Load data

Return to top



Let's look at the hotel table. We can see there are 13 columns with over 283K unique hotels. But looking at the categories, most of these locations are not actually categorized as hotels. Let's see how many are left if we limit the venues to those that are actually hotels.


The following code uses a SQL style merge/join to add hotel name and category information to the reviews dataframe indexing on the hotelID field. We are then able to use the categoies field to filter out non-hotel reviews.

Now that we have filtered the review dataset down to approximately 32K reviews categorized as hotels, let's get a sense of how many of these reviewes have been flagged as fraudulent and/or violating the Yelp terms of service.

We will remove the ~ 28% of reviewes flagged as fraudulent (Y/YR) and verify that only N/NR reviews remain.

Return to top


This data set will be the basis for all further evaluation. To recap, we filtered out any reviews that were for venues not categorized explicitly as a hotel and removed any review flagged as fraudulent. The remaining dataset has all the fields from the data source review table and joined to each review is data from the hotel table.

Next, we will clean the text from reviews using a series of regular expressions to remove non-letter characters and convert to lowercase. This will make it easier for us to perform text analysis. We will also subset our reviews data frame into two smaller data frames for good reviews (ratings of 4/5) and bad reviews (ratings 1/2). For the purposes of our analysis, we consider reviews with a rating of 3 to be neutral.

Segment by good/bad hotels

We experimented with different methods of categorizing the reviews as good or bad in order to perform textual analysis. At first, we simply divided the reviews into two pools--any review that was 4/5 stars was good and any review that was 1/2 stars was bad. While this method was somewhat effective, we found that it failed to account for the fact that good (bad) reviews at very good (bad) hotels have little in common with good (bad) reviews at bad (good) hotels.

So we took at two-tier appraoch to dividing our reviews into good and bad pools for further analysis. First, we calculated the average rating for all of the hotels in our dataset. This is different than the average rating in the original data source hotels table because those averages included reviews that we removed (those flagged as fraudulent, for example). Then we divided the hotels into two groups base on our calculated average, using 3 as the dividing line.

The second step took the best reviews of the above average hotels--those with 4/5 ratings and the worst reviews of the below-average hotels--those with 2/3 ratings. We created two data frames, as you can see below.

Word Clouds

It can be helpful to visualize the corpus in a word cloud where the size of the word is determined by the frequency in which it occurs. Here are word clouds for both the best reviews and the worst reviews.

Good Reviews