import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import SpectralClustering
from sklearn.metrics import silhouette_score
from sklearn.metrics import calinski_harabasz_score
from mpl_toolkits.mplot3d import Axes3D
import plotly.express as px
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
Unsupervised Learning
Brief Introduction
In this section we will employ dimensionality reduction techniques and clustering methods. These will include principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), K-Means, DBSCAN, and hierarchical clustering. The objective is to explore the different methods, investigate the dataset further, and understand any consistent patterns that may help answer our guiding Data Science questions..
Overview of Methods
- Principal Component Analysis (PCA) is used to reduce the dimensionality of a dataset. It uses statistical techniques to find the features that are most important to the dataset. PCA will yield the projection of the variables onto the first two principal components (their span), which represent the directions of maximum variance. This can then be represented visually to understand the linear relationships within the data.
- n_components: number of components to keep
- t-distributed Stochastic Neighbor Embedding (t-SNE) is a technique for dimensionality reduction used for visualizing high-dimensional data. It is often used to visualize the results of clustering algorithms. The plot of the t-SNE results will show points that have similar profiles or similar attributes close together. These points are then grouped into clusters. It is used to see if clusters exist and how defined and separated these clusters may be within the data.
- n_components: number of components to keep
- perplexity: manages the balance of the global and local structures
- K-Means Clustering works to minimize the within cluster variance in order to partition data into k-clusters. This is done by finding the centroid closest to a point, assinging the point to that centroid, calculating the new centroids, and repeating until the centroid stop shifting. It helps to find patterns within unlabeled data.
- n_clusters: number of clusters to form
- Density-based spatial clustering (DBSCAN) is a density based clustering method that groups points together that have a lot of nearby neighboring points. It is useful to find clusters within data that are non-linear in shape.
- eps: max distance allowed between two points to be considered a neighbor
- min_samples: number of samples to be considered a cluster
- Hierarchical clustering is a clustering technique that creates groups of data that are nested (ie in a hierarchical structure). It does this iteratively by breaking the dataset into large groups and continuously breaking those large groups into more groups, creating a branching structure.
- inputs: dataset and method
Code
# reading in the data
= pd.read_csv("../data-cleaning/clean_yelp_zip.csv")
yelp_zip = "Unnamed: 0"
cols_to_drop = yelp_zip.drop(columns=cols_to_drop) # dropping the unnecessary columns
yelp_zip print(yelp_zip.head(5)) # checking the variables
Median_Income zip_code name cuisine \
0 133211 20001.0 Compass Coffee Coffee & Tea
1 133211 20001.0 Sankofa Video Books & Cafe Bookstores
2 133211 20001.0 La Colombe Coffee Coffee & Tea
3 133211 20001.0 Sylvan Cafe & Bakery Bakeries
4 133211 20001.0 Urban Roast Coffee & Tea
price_range rating review_count latitude longitude price cuisine_cat
0 2 4.1 92.0 38.916256 -77.022773 20.5 cafes
1 1 4.5 167.0 38.925561 -77.023150 5.0 cafes
2 2 4.0 303.0 38.901051 -77.020103 20.5 cafes
3 1 4.0 134.0 38.915393 -77.012592 5.0 cafes
4 2 3.4 921.0 38.898005 -77.024916 20.5 cafes
Dimensionality Reduction
PCA
# Principal Component Analysis (PCA)
# pre-processing to use data in PCA
= yelp_zip.drop(columns=['name', 'cuisine', 'cuisine_cat']) # keeping only numeric variables
numeric_yelp = StandardScaler() # initializing a scaler
scaler = scaler.fit_transform(numeric_yelp) # standardizing the variables
yelp_scaled = PCA(n_components=8) # initiating a pca
pca = pca.fit_transform(yelp_scaled) # reducing dimensionality
yelp_pca
# explained variance to determine optimal number of features:
def plot_variance_explained(pca):
"""helper function to plot explained variance from lab 4.2"""
# visualize variance explained by PCA components
print(pca.explained_variance_ratio_[0:10])
print(np.cumsum(pca.explained_variance_ratio_[0:10]))
# plot explained variance ratio
= plt.subplots()
fig, ax range(1, len(pca.explained_variance_ratio_)+1), pca.explained_variance_ratio_, marker='o')
ax.plot(set(xlabel="number of components", ylabel="explained variance ratio")
ax.
plt.show()
# plot cumulative sum of explained variance
= plt.subplots()
fig, ax range(1, len(pca.explained_variance_ratio_)+1), np.cumsum(pca.explained_variance_ratio_), marker='o')
ax.plot(set(xlabel="number of components", ylabel="cumulative explained variance")
ax.
plt.show()
plot_variance_explained(pca)
[0.26763243 0.18944621 0.15147403 0.13711514 0.11630842 0.08958895
0.04648898 0.00194583]
[0.26763243 0.45707864 0.60855267 0.74566781 0.86197623 0.95156518
0.99805417 1. ]
Here we have plotted the explained variance versus the number of components and the cumulative explained variance versus the number of components. These plots together are meant to help choose the number of components to utilize in intializing a PCA object. Generally, we look for an “elbow point” in the first plot, where the explained variance begins to taper off, however, the plot lacks a true elbow point, thus, we will look mostly at the cumulative explained variance plot for out decision. Based on the first plot, we are looking somewhere in the 4-7 components, range and then we must decide how we want to balance the total variance retained and the model complexity. We are going to decide to choose 6 components to retain about 95% of the variance and proceed with the principal component analysis utilizing this number.
def plot_2D(X,color_vector):
"""Utility plotting function taken from lab 4.2"""
= plt.subplots()
fig, ax 0], X[:,1],c=color_vector, alpha=0.5) #, c=y
ax.scatter(X[:,set(xlabel='PC-1 ', ylabel='PC-2',
ax.='Principal Component Analysis results')
title
ax.grid()
plt.show()
= PCA(n_components=6) # initiating a pca
pca = pca.fit_transform(yelp_scaled) # reducing dimensionality
yelp_pca
= yelp_zip['cuisine_cat'].astype('category').cat.codes # converting cuisine to categorical codes
cuisine_colors = yelp_zip['Median_Income'].astype('category').cat.codes
income_colors = yelp_zip['rating'].astype('category').cat.codes
rating_colors = yelp_zip['review_count'].astype('category').cat.codes
review_colors = yelp_zip['zip_code'].astype('category').cat.codes
zip_colors = yelp_zip['price_range'].astype('category').cat.codes price_colors
# plotting the PCA color coded by the Income values
plot_2D(yelp_pca, income_colors)
In the above plot, we are showing the PCA results with color-coding of the median income levels. We see that there are potential clusters like the light green at the bottom of the plot and the large cluster at (0,1). However, there is a large amount of overlap in the clusters and no clear distribution of the points, indicating that is likely that the influence of cuisine types on the PCA captured patterns is not meaningul. There seems to be no clear strong correlation between the features of the restaurant (median income of area, zipcode, price range, yelp rating, review count, latitude, and longitude) and the income of the surronding area.
# plotting the PCA color coded by the Yelp ratings
plot_2D(yelp_pca, rating_colors)
We then tried to visualize the same PCA with the yelp ratings as the color-coding. Again, we see there is a large amount of overlap in the clusters and almost no clear clusters that are separate from the other clusters. This indicating that it is likely that the influence on the PCA captured patterns is not meaningul for yelp ratings. Again, there is no clear correlation between the features of the restaurant and the yelp rating that the restaurant receives.
# plotting the PCA color coded by the number of reviews
plot_2D(yelp_pca, review_colors)
Next, we again with visualizing the PCA with the number of reviews as the color-coding. We see that there is somewhat of a striping pattern here where the colors seem to aggregate in a vertical fashion. However, these clusters appear at each of the x-coordinates of the plot, making the pattern less clear. There is still a decent amount of overlap, thus we cannot fully conclude either way if there is a correlation between the features of the restaurant and the number of reviews that the restaurant receives.
# plotting the PCA color coded by the cusine categories
plot_2D(yelp_pca, cuisine_colors)
Finally, we color-code the PCA results with the cuisine categories. We see that there is almost complete overlap between clusters and no clear distribution of the points, indicating that is likely that the influence of cuisine types on the PCA captured patterns is not meaningul. We can conclude from the plot that there is no strong correlation between the features of the restaurant and the type of cuisine the restaurant is.
t-SNE
# T-SNE
# with cuisine
def plot_2D(X,color_vector):
"""Utility plotting function taken from lab 4.2, reusing for t-SNE"""
= plt.subplots()
fig, ax 0], X[:,1],c=color_vector, alpha=0.5) #, c=y
ax.scatter(X[:,set(xlabel='t-SNE Dimension 1', ylabel='t-SNE Dimension 2',
ax.='t-SNE Analysis results')
title
ax.grid()
plt.show()
= TSNE(n_components=2) # initiating tsne
tsne = tsne.fit_transform(yelp_scaled)
yelp_tsne plot_2D(yelp_tsne, cuisine_colors)
The above plot is the results of the t-SNE process, which as was said in the introduction, attempts to visualize high-dimensional data in a low-dimensional space. The plot shows that t-SNE produced no distinct clusters. Here the colors are coded by the cuisine types and we see that the points are clearly dispersed throughout the plot in no clear patterns. There is a lot of overlap between points, which indicates that the cuisine types are not clearly distinct from each other in terms of the restaurant features that make up the types.
= yelp_zip.drop(columns=['name', 'cuisine', 'cuisine_cat', 'rating']) # keeping only numeric variables
numeric_yelp = StandardScaler() # initializing a scaler
scaler = scaler.fit_transform(numeric_yelp) # standardizing the variables
yelp_scaled
= TSNE(n_components=2) # initiating tsne
tsne = tsne.fit_transform(yelp_scaled)
yelp_tsne plot_2D(yelp_tsne, rating_colors)
This t-SNE plot is utilizing the Yelp ratings for color coding. The points are distributed throughout the plot in no clear patterns. Again, there is a large amount of overlap between points, which indicates that the Yelp ratings are distributed across the restaurant features pretty randomly.
= yelp_zip.drop(columns=['name', 'cuisine', 'cuisine_cat', 'review_count']) # keeping only numeric variables
numeric_yelp = StandardScaler() # initializing a scaler
scaler = scaler.fit_transform(numeric_yelp) # standardizing the variables
yelp_scaled
= TSNE(n_components=2) # initiating tsne
tsne = tsne.fit_transform(yelp_scaled)
yelp_tsne plot_2D(yelp_tsne, review_colors)
We attempted another t-SNE plot is with the review counts for color coding. Again, we see no clear clustering patterns amoung the points and there is a lot of overlap between the coloring. We are seeing that the review count is also not strongly related to the features of the restaurant.
= yelp_zip.drop(columns=['name', 'cuisine', 'cuisine_cat', 'price_range', 'price']) # keeping only numeric variables
numeric_yelp = StandardScaler() # initializing a scaler
scaler = scaler.fit_transform(numeric_yelp) # standardizing the variables
yelp_scaled
= TSNE(n_components=2) # initiating tsne
tsne = tsne.fit_transform(yelp_scaled)
yelp_tsne plot_2D(yelp_tsne, price_colors)
Next, we attempted again the t-SNE plot, but with the price ranges for color coding. Again, we see no clear clustering patterns amoung the points and there is a lot of overlap between the coloring. We are seeing that the price range of a restaurant is also not strongly related to the features.
= yelp_zip.drop(columns=['name', 'cuisine', 'cuisine_cat', 'Median_Income']) # keeping only numeric variables
numeric_yelp = StandardScaler() # initializing a scaler
scaler = scaler.fit_transform(numeric_yelp) # standardizing the variables
yelp_scaled
= TSNE(n_components=2) # initiating tsne
tsne = tsne.fit_transform(yelp_scaled)
yelp_tsne plot_2D(yelp_tsne, income_colors)
In our last t-SNE attempt, we plotted the t-SNE results with the Median Income values of the zipcode where the restaurnt is located as the colors. This lead to somewhat more interesting results as there is less overlap and some potential loose clustering occurring. We can see that there is a teal clutser around (-10, 20) that may be a distinct cluster, however, there are some outliers for this grouping. There is also a light green cluster situated to the right of this cluster, again that has a pattern, but with outliers. The rest of the clusters are mostly overlapping and dispersed throughout the plot. This leads us to believe that there may be some relation between the median income and the features of the restaurant, however, in specific cases and not across the board for all restaurants and income brackets.
We now played around with the perplexity values for the t-SNE results, utilizing perplexity of 5, 30, and 50. The perplexity is used to manage the balance of the global and local structures within the data. Lower values of perplexity will focus more on the local structure, while a higher perplexity will focus more on the global structure. These will be visualized with color codings for Yelp rating and Median Income. These two were choosen because Yelp rating is most aligned with our reserach questions of understanding what makes us a yelp rating and because median income showed the most preliminary clustering in the above plots.
# experimenting with different perplexity values
# perplexity = 5
= yelp_zip.drop(columns=['name', 'cuisine', 'cuisine_cat', 'Median_Income']) # keeping only numeric variables
numeric_yelp = StandardScaler() # initializing a scaler
scaler = scaler.fit_transform(numeric_yelp) # standardizing the variables
yelp_scaled = TSNE(n_components=2, perplexity=5)
tsne_5 = tsne_5.fit_transform(yelp_scaled)
yelp_tsne5
plot_2D(yelp_tsne5, income_colors)
= yelp_zip.drop(columns=['name', 'cuisine', 'cuisine_cat', 'rating']) # keeping only numeric variables
numeric_yelp = StandardScaler() # initializing a scaler
scaler = scaler.fit_transform(numeric_yelp) # standardizing the variables
yelp_scaled = TSNE(n_components=2, perplexity=5)
tsne_5 = tsne_5.fit_transform(yelp_scaled)
yelp_tsne5
plot_2D(yelp_tsne5, rating_colors)
In the two plots above, we have plotted the t-SNE results utilizing a perplexity value of 5 (as opposed to the default value of 30). In the first plot, we color code with the median income values and the second plot is color coded with the yelp ratings. In both of the plots, we see that the points are distributed across the plots in no clear pattern. There is a large amount of overlap and no emerging clusters, thus we can say that there is no relation between the features of the restaurant and the median income and yelp rating when looking at the local structure of the dataset.
# perplexity = 35
= yelp_zip.drop(columns=['name', 'cuisine', 'cuisine_cat', 'Median_Income']) # keeping only numeric variables
numeric_yelp = StandardScaler() # initializing a scaler
scaler = scaler.fit_transform(numeric_yelp) # standardizing the variables
yelp_scaled = TSNE(n_components=2, perplexity=35)
tsne_35 = tsne_35.fit_transform(yelp_scaled)
yelp_tsne35
plot_2D(yelp_tsne35, income_colors)
= yelp_zip.drop(columns=['name', 'cuisine', 'cuisine_cat', 'rating']) # keeping only numeric variables
numeric_yelp = StandardScaler() # initializing a scaler
scaler = scaler.fit_transform(numeric_yelp) # standardizing the variables
yelp_scaled = TSNE(n_components=2, perplexity=35)
tsne_35 = tsne_35.fit_transform(yelp_scaled)
yelp_tsne35
plot_2D(yelp_tsne35, rating_colors)
We then plotted the t-SNE results utilizing a perplexity value of 35 (which is still relatively close to the default value of 30). Again, the plots are color coded by median income values and yelp ratings, respectively. In the second plot, we again see that the points are distributed across the plots in no clear pattern and there are no emerging clusters, thus we can say that there is no relation between the features of the restaurant and yelp rating when looking more at the global structure of the data. However, looking at teh median income plot, we see the light green and teal clusters near (-10, 15), indicating there may be some emerging relation between the features of the restaurant and the income of the neighborhood.
# experimenting with different perplexity values
# perplexity = 50
= yelp_zip.drop(columns=['name', 'cuisine', 'cuisine_cat', 'Median_Income']) # keeping only numeric variables
numeric_yelp = StandardScaler() # initializing a scaler
scaler = scaler.fit_transform(numeric_yelp) # standardizing the variables
yelp_scaled = TSNE(n_components=2, perplexity=50)
tsne_50 = tsne_50.fit_transform(yelp_scaled)
yelp_tsne50
plot_2D(yelp_tsne50, income_colors)
= yelp_zip.drop(columns=['name', 'cuisine', 'cuisine_cat', 'rating']) # keeping only numeric variables
numeric_yelp = StandardScaler() # initializing a scaler
scaler = scaler.fit_transform(numeric_yelp) # standardizing the variables
yelp_scaled = TSNE(n_components=2, perplexity=50)
tsne_50 = tsne_50.fit_transform(yelp_scaled)
yelp_tsne50
plot_2D(yelp_tsne50, rating_colors)
We finally plot the t-SNE results utilizing a perplexity value of 50 (which is more than the 30 and 35 and will look at the global structure of the data). Again, the plots are color coded by median income values and yelp ratings, respectively. In the second plot, we again see that the points are distributed across the plots in no clear pattern and there are no emerging clusters, thus we can say that there is no relation between the features of the restaurant and yelp rating when looking more at the global structure of the data. However, looking at the median income plot, we see the light green and teal clusters near (-10, 15) and are also seeing an emerging loose cluster of green near (0,10). This could again indicate there may be some emerging relation between the features of the restaurant and the income of the neighborhood, especially when we account for the global structure of the data and look at more broad relationships between points.
Comparison
When we performed the principal component analysis and t-SNE, we overwhelming, found little to no significant results. The PCA yielded that there was no correlation between the features of the restaurant and the median income level, the yelp rating, the number of yelp reviews, and the cuisine type. The t-SNE results backed up this finding by making clear that there was no relationship between the restaurant features and the cuisine type, the yelp rating, the number of reviews, and the price range. T-SNE did, however, begin to show a slight clustering in the relationship between the features of the restaurant and the median income level of the surronding area. This clustering was extremely slight and there was still alot of overlapping and outliers. This clustering began to appear more as the perplexity value increased, however, this pattern may be explained by the relationship between median income and zip code and was too slight to make a true conclusion about the relationship. The small difference in the result is potentially due to the bias from the zip code or from the difference in the technique of PCA and t-SNE. T-SNE is a non-linear method, so it may have been able to find the small relationship in the between income and the features because it was more apparent in the global structure, which was seen as the perplexity value increased. Overall, we come to no true conclusions about the relationship between the features of the restaurant.
Clustering Methods
# pre-processing steps
# scaling the numeric features
= StandardScaler() # initializing a scaler
scaler = scaler.fit_transform(yelp_zip[['Median_Income', 'zip_code', 'price_range', 'rating', 'review_count', 'latitude', 'longitude', 'price']])
scaled_nums 'Median_Income', 'zip_code', 'price_range', 'rating', 'review_count', 'latitude', 'longitude', 'price']] = scaled_nums
yelp_zip[[
print(yelp_zip.head(5))
= yelp_zip.drop(columns = ['name', 'cuisine', 'cuisine_cat']) yelp_zip
Median_Income zip_code name cuisine \
0 0.488229 -0.731905 Compass Coffee Coffee & Tea
1 0.488229 -0.731905 Sankofa Video Books & Cafe Bookstores
2 0.488229 -0.731905 La Colombe Coffee Coffee & Tea
3 0.488229 -0.731905 Sylvan Cafe & Bakery Bakeries
4 0.488229 -0.731905 Urban Roast Coffee & Tea
price_range rating review_count latitude longitude price \
0 -0.120364 0.023268 -0.653447 0.673732 0.156349 -0.208398
1 -2.054779 1.125458 -0.530862 1.299282 0.137330 -1.642612
2 -0.120364 -0.252279 -0.308575 -0.348487 0.291018 -0.208398
3 -2.054779 -0.252279 -0.584799 0.615718 0.669956 -1.642612
4 -0.120364 -1.905564 0.701521 -0.553228 0.048241 -0.208398
cuisine_cat
0 cafes
1 cafes
2 cafes
3 cafes
4 cafes
# checking for highly correlated features
sns.pairplot(yelp_zip)
There are no clear correlated features that need to be removed for the further analysis.
# checking the values are standardized
print(yelp_zip.head(10))
Median_Income zip_code price_range rating review_count latitude \
0 0.488229 -0.731905 -0.120364 0.023268 -0.653447 0.673732
1 0.488229 -0.731905 -2.054779 1.125458 -0.530862 1.299282
2 0.488229 -0.731905 -0.120364 -0.252279 -0.308575 -0.348487
3 0.488229 -0.731905 -2.054779 -0.252279 -0.584799 0.615718
4 0.488229 -0.731905 -0.120364 -1.905564 0.701521 -0.553228
5 0.488229 -0.731905 -0.120364 -2.181111 0.415491 0.433059
6 0.488229 -0.731905 -0.120364 0.574363 0.101674 0.015760
7 0.488229 -0.731905 -0.120364 0.849911 3.354251 -0.253399
8 0.488229 -0.731905 -0.120364 0.298816 -0.256273 -0.375423
9 0.488229 -0.731905 -0.120364 0.298816 -0.674695 0.283428
longitude price
0 0.156349 -0.208398
1 0.137330 -1.642612
2 0.291018 -0.208398
3 0.669956 -1.642612
4 0.048241 -0.208398
5 0.687199 -0.208398
6 0.047625 -0.208398
7 0.433704 -0.208398
8 -0.043970 -0.208398
9 -0.043504 -0.208398
K-Means
# K-Means
def kmean_fit(k): # kmean function
"""KMean funciton taken from lab 4.1"""
global yelp_zip, model
=KMeans(n_clusters=k,max_iter=5000,tol=1e-6)
model
model.fit(yelp_zip)return model.inertia_
=[]
ks=[]
inertiasfor k in range(1,20):
=kmean_fit(k)
inertia
ks.append(k)
inertias.append(inertia)
= plt.plot(ks,inertias,"-o")
plt1 'Intertia versus number of clusters (elbow method)')
plt.title('k')
plt.xlabel('inertia')
plt.ylabel(
plt.show(plt1)
### Silhouette Score
= [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20] # different possible number of clusters
range_n_clusters = [] # initalizing empty list for silhouette scores
sil_scores
for n_clusters in range_n_clusters: # looping through the number of clusters
= KMeans(n_clusters=n_clusters, random_state=10) # using Kmeans with each cluster value
clusterer = clusterer.fit_predict(yelp_zip) # getting the labels
cluster_labels = silhouette_score(yelp_zip, cluster_labels) # getting the silhoeutte score
silhouette_avg
sil_scores.append(silhouette_avg)
= plt.plot(range_n_clusters, sil_scores, "-o", color='orange') # plotting number of clusters and silhouette score
plt2 'Silhouette Scores versus number of clusters')
plt.title('k')
plt.xlabel('silhouette score')
plt.ylabel( plt.show(plt2)
Above we have plotted the inertia versus the number of clusters and the silhouette score versus the number of clusters. These plots together are meant to help choose the number of clusters to utilize in intializing k-Means. Generally, we look for an “elbow point” in the first plot, where the inertia begins to taper off, however, the plot lacks a true elbow point, thus, we will look mostly at the silhoeutte score for the decision. Based on the first plot, we are looking somewhere in the 5-12 cluster range. We are going to decide to choose 7 clusters to obtain a relatively high silhouette score.
# Initializing K Means with optimal k value
= 7 # based on the silhouette scores
k = KMeans(n_clusters=k, max_iter=5000, tol=1e-6)
model
model.fit(yelp_zip)= model.predict(yelp_zip)
labels
# Scatter plot of the clusters using ratings and median income
'rating'], yelp_zip['Median_Income'], c=labels, cmap='viridis')
plt.scatter(yelp_zip['KMeans Clustering')
plt.title('Yelp Rating')
plt.xlabel('Median Income')
plt.ylabel( plt.show()
In the above plot we see the yelp ratings plotted against the median income with K-Means clustering. The plot shows no clear clusters, lots of overlap, and points that are dispersed across both axes. This indicates that there is no strong correlation between the yelp rating and the median income of the area.
# Scatter plot of the clusters using ratings and price range
'rating'], yelp_zip['price_range'], c=labels, cmap='viridis')
plt.scatter(yelp_zip['KMeans Clustering')
plt.title('Yelp Rating')
plt.xlabel('Price Range')
plt.ylabel( plt.show()
In the above plot we see the yelp ratings plotted against the price ranges with K-Means clustering. The plot shows no clear clusters, lots of overlap, and points that are dispersed across both axes. This indicates that there is no strong correlation between the yelp rating and the price range of the restaurant.
# Scatter plot of the clusters using ratings and review count
'rating'], yelp_zip['review_count'], c=labels, cmap='viridis')
plt.scatter(yelp_zip['KMeans Clustering')
plt.title('Yelp Rating')
plt.xlabel('Number of Reviews')
plt.ylabel( plt.show()
In the above plot we see the yelp rating plotted against the number of reviews with K-Means clustering. The plot shows only one potential cluster, which is the green color towards the top of the graph. This would point to more reviews having highing rating, but it is not a strong cluster. Other than that there are no clear clusters and the points that are dispersed across both axes. This indicates that there is small potential for some correlation between the yelp rating and the number of reviews, but in a small subset and not across the board.
# Scatter plot of the clusters using review count and price range
'review_count'], yelp_zip['price_range'], c=labels, cmap='viridis')
plt.scatter(yelp_zip['KMeans Clustering')
plt.title('Number of Yelp Reviews')
plt.xlabel('Price Range')
plt.ylabel( plt.show()
In the above plot we see the price range plotted against the number of Yelp reviews with K-Means clustering. The plot shows no clear clusters and the points are very dispersed across both axes. This indicates that there is no strong correlation between the price range of the restaurant and the number of yelp reviews.
# Scatter plot of the clusters using price range and median income
'price_range'], yelp_zip['Median_Income'], c=labels, cmap='viridis')
plt.scatter(yelp_zip['KMeans Clustering')
plt.title('Price Range')
plt.xlabel('Median Income')
plt.ylabel( plt.show()
Finally, we plotted the median income against the price range utilizing K-Means clustering. The plot again shows no clear clusters and the points are very dispersed across both axes. This indicates that there is no strong correlation between the price range of the restaurant and the median income.
# used CHATGPT for help with 3D plot
= pd.DataFrame(yelp_zip, columns=['rating', 'price_range', 'review_count'])
yelp_clustered 'cluster'] = labels
yelp_clustered[
= px.scatter_3d(
fig
yelp_clustered,='rating', # X-axis
x='price_range', # Y-axis
y='review_count', # Z-axis
z='cluster', # Color points by cluster
color='cluster', # Optional: Different symbols for each cluster
symbol=['rating', 'price_range', 'review_count'], # Show details on hover
hover_data='Interactive 3D Clustering of Yelp Data'
title
)
=dict(size=5)) # Adjust marker size
fig.update_traces(marker=dict(
fig.update_layout(scene='Rating',
xaxis_title='Price Range',
yaxis_title='Review Count'
zaxis_title
)) fig.show()
Unable to display output for mime type(s): application/vnd.plotly.v1+json
In the above plot we try to utilize a 3D plot to visualize the review count, price range and the rating see the price range with K-Means clustering. The plots shows a potential cluster of the purple x’s which appear towards the top of the plot. This again, points to a potential relationship between the number of reviews and the yelp rating of the restaurant for the two middle price ranges. Again, this lends to a loose conclusion about having more reviews having a slight positive effect on the yelp rating. the other clusters are relatively less strong and are more overlapping.
DBSCAN
# DBSCAN
# modified from lab 4.1
= [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] # possible values of epsilon (hyperparameter)
range_eps= [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] # possible number of samples (hyperparameter)
range_samples = [] # silhouette scores
sil_scores max = -1 # initiailizing a max value
= []
eps_and_sample
for epsilon in range_eps: # iterating through both of the hyperparamters
for sample in range_samples:
= DBSCAN(eps=epsilon, min_samples=sample).fit(yelp_zip) # using DBSCAN
clustering = clustering.labels_ # getting the cluster values
cluster_labels if len(set(cluster_labels)) > 1 and -1 in cluster_labels: # ignoring the noise
= silhouette_score(yelp_zip, cluster_labels) # getting the silhouette scores
silhouette_avg
sil_scores.append(silhouette_avg)if silhouette_avg > max: # if the silhouette score is greater than the current max value
max = silhouette_avg # make this the new max
# save the hyperparamter values
eps_and_sample.append((epsilon, sample)) print(max)
print(eps_and_sample[-1]) # get only the hyperparamters for the max value
= 0.9 # from the results above
eps = 10
min_samples = DBSCAN(eps=eps, min_samples=min_samples) # initialize DBSCAN with optimal hyperparameters
clustering = clustering.fit_predict(yelp_zip)
cluster_labels
'rating'], yelp_zip['review_count'], c=cluster_labels, cmap='viridis')
plt.scatter(yelp_zip['DBSCAN Clustering')
plt.title('Yelp Rating')
plt.xlabel('Number of Reviews')
plt.ylabel( plt.show()
0.20967415255007046
(1.0, 2)
In the above, we visualized the number of reviews against the yelp ratings utilizing DBSCAN clustering. The plot shows no clear clusters as the points of each color are dispersed across the yelp axis and the points are mostly noise. This indicates that there is no strong correlation between the number of reviews and the yelp rating.
# DBSCAN rating and income
'rating'], yelp_zip['Median_Income'], c=cluster_labels, cmap='viridis')
plt.scatter(yelp_zip['DBSCAN Clustering')
plt.title('Yelp Rating')
plt.xlabel('Median Income')
plt.ylabel( plt.show()
We then used the same technique of DBSCAN to visualize the number of reviews against the median income. We find that the plot shows no clear clusters, dispersed points, and mostly points that would be classified as noise. This indicates that there is no strong correlation between the number of reviews and the surronding’s area median income.
# DBSCAN rating and price
'rating'], yelp_zip['price_range'], c=cluster_labels, cmap='viridis')
plt.scatter(yelp_zip['DBSCAN Clustering')
plt.title('Yelp Rating')
plt.xlabel('Price Range')
plt.ylabel( plt.show()
Finally, we tried to visualize the Yelp rating and the price range of the restaurant. We find that the plot shows no clear clusters as the points of each color are dispersed across the yelp axis. This indicates that there is no strong correlation between the Yelp rating and the price range of the restaurant.
Hierarchical Clustering
# Hierarchical clustering
= linkage(yelp_zip, method='ward') # creating linkage matrix
linkage_matrix
# plotting the hierarchical clustering (utilized CHATGPT for help here)
=(10, 7))
plt.figure(figsize=yelp_zip.index, leaf_rotation=90, leaf_font_size=10)
dendrogram(linkage_matrix, labels'Hierarchical Clustering Dendrogram')
plt.title('Sample Index')
plt.xlabel('Distance')
plt.ylabel(
plt.show()
= 4 # from dendogram inspection
num_clusters = fcluster(linkage_matrix, num_clusters, criterion='maxclust')
clusters 'Cluster'] = clusters yelp_zip[
In the above plot, we visualized a dendogram utilizing hierarchical clustering on the Yelp data. The dendogram splits the data into 3 color clusters. The red and green clusters are the most similar because they have the smallest height of their joining link, however, this link is still relatively high as it is about 28 in terms of distance, indicating low similarity. The similarity between orange and the mix of green and red is even higher at about 35 on the distance axis. This seems to indicate that there are not any truly similar clusters and likely no natural clusters within the dataset.
Summary and Intepretation
In this section we employed dimensionality reduction through principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE) as well as clustering through K-Means, DBSCAN, and hierarchical clustering. The objective was to explore the different methods, compare their effectiveness, and understand any consistent patterns within the Yelp dataset. By and large we found little significant results and patterns within the data.
Beginning with the PCA we found no correlation between the different features of the restaurant and the median income level, the yelp rating, the number of yelp reviews, and the cuisine type of the restaurant. For t-SNE we found no relationship between the restaurant features and the cuisine type, the yelp rating, the number of reviews, and the price range. There was a small relationship between the features of the restaurant and the median income level, which appeared more clearly with higher perplexity values, however, it was too insignficant a cluster to make a true comment about the relationship between the number of reviews and the yelp rating. Overall, both of these dimensionality reduction techniques yielded little significant findings.
We then moved to clustering methods, particularly K-Means. With this method, we found no strong correlation between yelp rating and income, yelp rating and price range, price range and number of reviews, price range and median income level. There was no clear clustering. There was some correlation present between the number of reviews and the yelp rating, which was more prevelant when utilizing the 3D model and become apparent for the mid-range pricing restaurants. When we utilized DBSCAN there was no correlation and clustering in every case; between number of reviews and yelp rating, betwween number of reviews and median income, and between yelp rating and price range. In addition, the hierarchical clustering showcased that there were no natural clusterings in the data.
Overall, the results were too sparse and inconclusive for us to find any significant and reappearing patterns within the dataset. This may lead us to find that restaurants are quite unique entities and yelp ratings are hard to group and predict.