
Sentiment Analysis for Hotel Reviews
Personal Project
Project Type
Python, NLP, Hugging Face, Machine Learning, Python
Date:
December 2024
Github Repo:
Project Background
The hospitality industry depends heavily on customer feedback to improve services and boost bookings. However, with hundreds of thousands of reviews across platforms, manually parsing through guest sentiment is time-consuming and inefficient. This project applies Natural Language Processing (NLP) to automate sentiment classification for over 500,000 hotel reviews across Europe.
Executive Summary
Using both NLTK’s VADER and Hugging Face’s DistilBERT, I conducted a comparative sentiment analysis on hotel reviews to evaluate which method better captured guest sentiment and provided more actionable business insights.
​
-
DistilBERT outperformed VADER in accuracy, achieving 89% model accuracy.
-
​Top and bottom performing hotels were identified based on review sentiment.
-
Word frequency and sentiment distribution visualizations were generated to surface recurring praise and concerns.
Deep Dive
Preprocessing:
-
Removed placeholder reviews from dataset (e.g., “No Positive”).
-
Sampled ~10% of the dataset (~51K rows) for performance.
-
Cleaned dataset for labeled sentiment classification.


NLTK (VADER):
-
Used rule-based sentiment scoring.
-
Fast and easy to apply, but limited in understanding sarcasm or complex phrasing.
-
NLTK: lower performance, especially with ambiguous phrases (~79%).


DistilBERT:
-
Leveraged distilbert-base-uncased-finetuned-sst-2-english.
-
Captured contextual sentiment with higher accuracy.
-
Enabled ranking of hotels based on review sentiment scores.
-
~88% accuracy.


Insight Generated
5 Best and Worst Hotels Based on Sentiment Scores
To better understand how sentiment analysis translates to real-world insights, I identified the 5 best-reviewed and 5 worst-reviewed hotels based on the average predicted sentiment score from the DistilBERT model.

Positive and Negative Review Word Clouds
The word cloud reveals the most common terms used across guest reviews, with words like "room," "staff," and "breakfast" appearing most frequently. These terms highlight what guests value most and can guide hotel managers in prioritizing service quality and cleanliness to boost satisfaction.

Citation
Liu, J. (n.d.). 515K hotel reviews data in Europe [Dataset]. Kaggle. https://www.kaggle.com/datasets/jiashenliu/515k-hotel-reviews-data-in-europe
Thank you! Please message any suggestions.