Data Collection

Introduction

The first step in any project is to collect data. The idea for this project stemmed from restaurants. The constant question of where to eat, what food is good, and what neighborhood has the best food are all questions people ask daily. The one place everyone goes to determine where they should eat is Yelp. Here, we will be collecting data from Yelp, just like what you would see on your phone. These features will hopefully help us dig deeper into our story.

The first step is to create and load our API key from Yelp. For security reasons, this key is stored elsewhere and is not directly included here. To create your own API, go to the Yelp’s developer portal to get started.

#this imports the api to scrap
# import json package 
import json
#opens and finds the api to load it 
with open('/Users/rachnarawalpally/project-rachnarawalpally/technical-details/data-collection/api-key.json') as f:
    keys = json.load(f)
    #labels the api to call it in the code 
API_KEY = keys['yelp']

Method: Obtaining Data

The following code blocks outline the process of fetching data from the Yelp API and cleaning it into a readable CSV file.

The first block includes the necessary documentation to call the Yelp API and retrieve data. Some of this code is sourced directly from Yelp’s developer website. https://docs.developer.yelp.com/docs/fusion-intro - Here is the link to getting started using a API from Yelp.

The second block processes the received data by extracting only the relevant information needed for the CSV file. This includes gathering key details that are readily available on Yelp, such as restaurant name, rating, and location. The extracted data is then stored in a Pandas DataFrame.

The final block converts the Pandas DataFrame into a CSV file. Since each API response contains information for up to 50 restaurants, each CSV file includes data for a maximum of 50 Yelp rated restaurants.

#reads in the request pacakge 
import requests
# this sets up the API request 
headers = {'Authorization': f'Bearer {API_KEY}'}
# this function fetches restaurant's data from Yelp's API
# location = which area to search for 
# term = picks between restuarants, coffee shops, or bars (tried to get as much information about yelp reivews in DC)
# limit and offset = these are specific for yelp, limit to search for 50 restaurants at a time
# and start the offset at 0 so it goes to the next restaurant, like going to the next page 
def fetch_restaurant_data(location, term="restaurants", limit=50, offset=0):
    url = 'https://api.yelp.com/v3/businesses/search'
    params = {
        'term': term,
        'location': location,
        'limit': limit,
        'offset': offset,  # Include offset in the params
    }
    response = requests.get(url, headers=headers, params=params)
    data = response.json()
    return data

offset = 0  # Start from 50 for the next set of results
limit = 50   # You want to get 50 results at a time

# Fetch data for restaurants in Washington D.C.
restaurants = fetch_restaurant_data('Washington D.C.', offset=offset, limit=limit)
# import package 
import pandas as pd
# this creates the function to clean up the json requests from above and clean it up and have it as a pandas dataframe 
def process_data(restaurants_data):
    # creates an emtpy list to store the data in 
    restaurant_list = []
    for business in restaurants_data['businesses']:
        # this adds all these specific items together 
        restaurant_list.append({
            'name': business['name'],
            'cuisine': business['categories'][0]['title'] if business['categories'] else 'Unknown',
            'price_range': business.get('price', 'N/A'),
            'rating': business.get('rating', 'N/A'),
            'review_count': business.get('review_count','N/A'),
            'neighborhoods': business.get('neighborhoods', 'N/A'),
            'latitude': business['coordinates']['latitude'],
            'longitude': business['coordinates']['longitude'],
            'zip_code': business['location']['zip_code'],
        })
        #saves it as a dataframe 
    df = pd.DataFrame(restaurant_list)
    return df
# takes the function above to create a dataframe from the json information above 
df_restaurants = process_data(restaurants)
# prints the first few results to insure everything looks good 
print(df_restaurants.head())
                   name             cuisine price_range  rating  review_count  \
0  Unconventional Diner        New American          $$     4.4          2946   
1             L'Ardente             Italian         $$$     4.5          1242   
2          Grazie Nonna             Italian          $$     4.1           536   
3      Old Ebbitt Grill                Bars          $$     4.2         11086   
4      Gypsy Kitchen DC  Tapas/Small Plates          $$     4.3           919   

  neighborhoods   latitude  longitude zip_code  
0           N/A  38.906139 -77.023800    20001  
1           N/A  38.898919 -77.014074    20001  
2           N/A  38.904010 -77.035000    20005  
3           N/A  38.897967 -77.033342    20005  
4           N/A  38.914880 -77.031550    20009  
# print the results again to insure everything is correct 
#print(df_restaurants)
#saves the data frame to a csv file int he raw_data folder
df_restaurants.to_csv('../../data/raw-data/df_coffee5.csv')
#/Users/rachnarawalpally/project-rachnarawalpally/data/raw-data
                                   name                cuisine price_range  \
0               Pitango Gelato & Coffee           Coffee & Tea          $$   
1                      Capital One Café           Coffee & Tea         N/A   
2                    Mah-Ze-Dahr Bakery               Bakeries          $$   
3        Junction Bistro Bar and Bakery               Bakeries          $$   
4                          Coffee Alley           Coffee & Tea         N/A   
5                       Gregorys Coffee               Bakeries          $$   
6                           Atrium Cafe                  Cafes           $   
7                         Mitsitam Cafe               American          $$   
8                        Cafe Levantine               Lebanese         N/A   
9                      Capital One Café  Banks & Credit Unions         N/A   
10                               Zeleno             Sandwiches          $$   
11                        Le Caprice DC                  Cafes           $   
12                        Union Kitchen           Coffee & Tea          $$   
13                       Bluestone Lane           Coffee & Tea         N/A   
14              Corella Café and Lounge           Coffee & Tea          $$   
15          Casey's Coffee & Sandwiches           Coffee & Tea          $$   
16                Union Kitchen Grocery           Coffee & Tea          $$   
17     L.A. Burdick Handmade Chocolates   Chocolatiers & Shops         N/A   
18            Point Chaud Cafe & Crepes              Creperies          $$   
19                     Vigilante Coffee           Coffee & Tea          $$   
20                       Three Whistles   Shared Office Spaces           $   
21           Adulis Coffee and Roastery           Coffee & Tea         N/A   
22                          Timgad Café                  Cafes         N/A   
23                         Café du Parc                 French          $$   
24                             Morsel's           Coffee & Tea         $$$   
25                           Sheba Café                  Cafes         N/A   
26                           Licht Cafe                  Cafes         N/A   
27                         Blank Street           Coffee & Tea         N/A   
28                          Colada Shop           Coffee & Tea          $$   
29                     Commonwealth Joe                  Cafes           $   
30                          Uptown Cafe           Coffee & Tea           $   
31                       Morning My Day               Bakeries           $   
32                        Caseys Coffee           Coffee & Tea         N/A   
33                    Merriweather Cafe                  Cafes         N/A   
34                Soricha Tea & Theater           Coffee & Tea          $$   
35                        Peet's Coffee           Coffee & Tea          $$   
36                    La Bohemia Bakery               Bakeries           $   
37                    Milk + Honey Café           Coffee & Tea         N/A   
38                     Baker’s Daughter     Breakfast & Brunch         N/A   
39                  Turkish Coffee Lady           Coffee & Tea          $$   
40                         Cortado Cafe                  Cafes          $$   
41  Tiger Sugar Boba Bubble Tea shop DC             Bubble Tea          $$   
42   Call Your Mother Deli - Georgetown                  Delis          $$   
43                        Cafe Integral           Coffee & Tea          $$   
44                         Black Coffee           Coffee & Tea          $$   
45                        Coffee Nature           Coffee & Tea           $   
46                    Bread & Chocolate     Breakfast & Brunch          $$   
47                          Le Bon Cafe           Coffee & Tea          $$   
48                         Mo Mo Bakery               Bakeries           $   
49                        The Hill Cafe           Coffee & Tea         N/A   

    rating  review_count neighborhoods   latitude  longitude zip_code  
0      4.2          1040           N/A  38.895058 -77.021854    20004  
1      4.3            24           N/A  38.867232 -76.988468    20020  
2      4.3           137           N/A  38.858644 -77.049471    22202  
3      4.2            95           N/A  38.894935 -77.002259    20002  
4      4.0             1           N/A  38.898571 -77.021774    20001  
5      3.8            87           N/A  38.876988 -77.004496    20003  
6      3.8           106           N/A  38.884277 -77.018194    20024  
7      3.5           538           N/A  38.888184 -77.016863    20560  
8      4.7            29           N/A  38.935452 -77.179605    22101  
9      4.0            63           N/A  38.904992 -77.062633    20007  
10     4.3            95           N/A  38.911483 -77.044128    20009  
11     3.5           349           N/A  38.932815 -77.032744    20010  
12     4.1            95           N/A  38.906762 -77.023699    20001  
13     3.3            25           N/A  38.894308 -77.029739    20004  
14     3.9            39           N/A  38.983660 -77.092950    20814  
15     2.9            25           N/A  38.883440 -77.016027    20024  
16     4.1            16           N/A  38.912090 -77.003690    20002  
17     4.3            89           N/A  38.907180 -77.063050    20007  
18     3.7            57           N/A  38.920154 -77.071873    20007  
19     4.3           199           N/A  38.992035 -76.933845    20740  
20     4.4           110           N/A  38.889560 -77.091200    22201  
21     4.6             7           N/A  38.985367 -77.027355    20910  
22     0.0             0           N/A  38.897080 -77.010790    20001  
23     3.4           496           N/A  38.896491 -77.032656    20004  
24     3.0             3           N/A  38.922855 -77.053824    20008  
25     4.6             8           N/A  38.934610 -77.033200    20010  
26     4.9            11           N/A  38.916837 -77.035461    20009  
27     3.4            18           N/A  38.906160 -77.063010    20007  
28     4.0            68           N/A  38.907064 -77.043662    20036  
29     4.6           471           N/A  38.862669 -77.054934    22202  
30     3.6            83           N/A  38.905458 -77.005096    20002  
31     4.8            27           N/A  38.998113 -77.031311    20910  
32     3.5            10           N/A  38.899840 -77.007540    20002  
33     3.8            12           N/A  38.943640 -77.052620    20008  
34     4.6           469           N/A  38.833030 -77.191437    22003  
35     3.8           108           N/A  38.899230 -77.039980    20006  
36     4.3           258           N/A  39.057995 -77.112355    20852  
37     3.2             9           N/A  38.884751 -77.017456    20472  
38     3.6            23           N/A  38.904533 -77.062452    20007  
39     4.6           187           N/A  38.805617 -77.050411    22314  
40     4.6           116           N/A  38.813170 -77.111088    22304  
41     3.9            23           N/A  38.922077 -76.996569    20002  
42     4.4           382           N/A  38.907617 -77.068837    20007  
43     3.7            23           N/A  38.916040 -77.046870    20009  
44     4.1           106           N/A  38.917900 -77.096820    20007  
45     4.2           159           N/A  38.954480 -77.083120    20016  
46     3.3           399           N/A  38.905660 -77.050450    20037  
47     3.7           235           N/A  38.887260 -77.003357    20003  
48     3.9            16           N/A  39.052131 -77.051026    20902  
49     4.3            56           N/A  38.891074 -76.983391    20002  

Final Thoughts

This code demonstrates how to retrieve data from Yelp’s API to gather information on businesses such as restaurants, coffee shops, and bars (as used in this project). The script can be easily modified to query any other types of businesses available on Yelp. It provides a simple and efficient way to. Fetch Yelp data based on specific parameters. Convert the raw JSON response into a pandas DataFrame. Save the DataFrame as a CSV file for later analysis. By altering the search term (e.g., “restaurants”, “coffee shops”, or “bars”) and adjusting other parameters like limit and offset, you can customize the data retrieval to suit your needs. This method simplifies the process of gathering Yelp data for analysis and ensures easy access to relevant business information.