OPIM 609 Final Assignment

October 10th, 2022

SAXA X

Daniel Behrns, Adriana Cailao, Matthew Nicoletta, Adrian Ott, Jeannie Sprangle

Contents
  1. Preparation
    1. Packages
    2. Functions
    3. Load Data
  2. Explore
    1. Hotels
    2. Reviews
  3. Clean
    1. Split dataframes
  4. Wordclouds
    1. Good reviews
    2. Bad reviews
  5. Topic Modeling
    1. Good reviews
    2. Bad reviews
  6. n-grams
    1. Good reviews
    2. Bad reviews

Preparation

Packages

Functions

Load data

Return to top


Explore

Hotel

Let's look at the hotel table. We can see there are 13 columns with over 283K unique hotels. But looking at the categories, most of these locations are not actually categorized as hotels. Let's see how many are left if we limit the venues to those that are actually hotels.

Reviews

The following code uses a SQL style merge/join to add hotel name and category information to the reviews dataframe indexing on the hotelID field. We are then able to use the categoies field to filter out non-hotel reviews.

Now that we have filtered the review dataset down to approximately 32K reviews categorized as hotels, let's get a sense of how many of these reviewes have been flagged as fraudulent and/or violating the Yelp terms of service.

We will remove the ~ 28% of reviewes flagged as fraudulent (Y/YR) and verify that only N/NR reviews remain.

Return to top


Clean

This data set will be the basis for all further evaluation. To recap, we filtered out any reviews that were for venues not categorized explicitly as a hotel and removed any review flagged as fraudulent. The remaining dataset has all the fields from the data source review table and joined to each review is data from the hotel table.

Next, we will clean the text from reviews using a series of regular expressions to remove non-letter characters and convert to lowercase. This will make it easier for us to perform text analysis. We will also subset our reviews data frame into two smaller data frames for good reviews (ratings of 4/5) and bad reviews (ratings 1/2). For the purposes of our analysis, we consider reviews with a rating of 3 to be neutral.

Segment by good/bad hotels

We experimented with different methods of categorizing the reviews as good or bad in order to perform textual analysis. At first, we simply divided the reviews into two pools--any review that was 4/5 stars was good and any review that was 1/2 stars was bad. While this method was somewhat effective, we found that it failed to account for the fact that good (bad) reviews at very good (bad) hotels have little in common with good (bad) reviews at bad (good) hotels.

So we took at two-tier appraoch to dividing our reviews into good and bad pools for further analysis. First, we calculated the average rating for all of the hotels in our dataset. This is different than the average rating in the original data source hotels table because those averages included reviews that we removed (those flagged as fraudulent, for example). Then we divided the hotels into two groups base on our calculated average, using 3 as the dividing line.

The second step took the best reviews of the above average hotels--those with 4/5 ratings and the worst reviews of the below-average hotels--those with 2/3 ratings. We created two data frames, as you can see below.

Word Clouds

It can be helpful to visualize the corpus in a word cloud where the size of the word is determined by the frequency in which it occurs. Here are word clouds for both the best reviews and the worst reviews.

Good Reviews

Bad Reviews

As we can see from the word clouds, there is quite a bit of overlap in the most frequently used words, regardless of whether the reviews was good or bad. For example, hotel and room seem to be the most occuring terms. Place, stay, nice, and other terms also occur frequently. While this is a great starting point, there is little to help us understand the context or sentiment associated with these terms.

Back to Top

Topic Modeling

We used the LDA algorithm to perform topic modeling on our good/bad review data frames. Although we experimented with the number of topics to model for each, we found that 4 topics seems appropriate in both cases. We want to see little overlap in the topics, which we have here. Although higher numbers of topics can also achieve plots with little overlap, the interpretability of these additional topics becomes difficult. We offer a brief interpretation of the topics for good/bad reviews below.

Good Reviews

Bad Reviews

N-grams

Using our topic modeling as a good general overview of the most important hotel features discussed in these reviews, we took a deeper dive by looking at the most frequent unigram, bigram, and trigrams in each review pool. The below tables show the top 20 of each and their frequencies.

Good Reviews

Bad Reviews

We can see that some of the features most associated with good rooms are cleanliness and the presence of a flat screen tv, free wifi, and a king size bed. In terms of location, walking distance (presumably to nearby attractions) seemed to be a key feature. Helpful, friendly staff were frequently mentioned as well. For food, customers seemed to appreciate room service and a free breakfast.

In the bad reviews, we see terms related to rooms that indicate they were not clean, had unpleasant odors, or were rundown or dated. Free continental breakfast and room service are again mentioned, emphasizing that these ancillary food-related service experiences are important to customers. Even in bad reviews, we see that interactions with friendly staff are important, but contrasting with the good reviews, friendly only gets you so far--helpful staff seem to significantly improve the guest experience.

Back to Top