According to an article in the Journal of Adolescent Health by Hall et. al titled The State of Sex Education in the United States there are currently less than 15 states that require sex education be medically accurate. At the same, they observed that while the quality of sex education in the United States went down from 2006 to 2014, there was an analogous trend where teen pregnancies went down and the use of contraceptions for teenagers went up during the same time period. They stated that:
These coincident trends suggest that adolescents are receiving information about birth control and condoms elsewhere.
By elsewhere, I think a natural interpretation would mean the Internet. In particular, I believe platforms like Youtube provide access to important sex education related content. Some content creators of particular import in this space are Lindsay Doe (a clinical sexologist that runs the Sexplanations channel), Hannah Witton, and Laci Green.
However while such creators provide valuable information for public consumption, many of these videos are blocked from earning ad-revenue automatically upon upload to Youtube due to their supposedly “explicit nature”. Because most views from subscribers generally occur within the first day of upload, this greatly affects these creators’ revenue streams even if the creators are able to successfully appeal for their videos to be remonetized. So the central question is whether or not an alternate approach to Youtube’s algorithm could be developed so that it doesn’t affect these content creators disproportionately? For my fifth and final project at Metis I created an audio and text based model to accomplish this.
Data
Acquiring Data:
In Youtube’s reasoning to demonetize these videos due to their explicit nature, I decided that my model had to be able to distinguish between the educational content and truly explicit content, i.e. pornography. To do so I obtained 125 sex education-related videos from Youtube and 125 videos from Pornhub, where both types of videos were from the most popular categories in their respective spaces. These videos were then cut down to 5000+ 30 second video segments which totaled to roughly 42 hours worth of video. These 5000+ segments were then what my model was going to classify as either explicit or educational.
If you have ever been out for drinks in the financial district of San Francisco, you may have stumbled across Mikkeller bar. This Dutch beer hall is a great place to go with friends after work with an atmosphere that is just lively enough for great conversation. With an expansive draft list and selection of sours, it is the type of place that I want to always drink at.
Now, if I looked up Mikkeller on Yelp I would not only get the bar itself but bars of a similar profile that are in the area which is great if I wanted to try a place that had a similar vibe.
However this type of search does not translate when you search at a different location. For instance, if I tried to search for Mikkeller bar in Chicago:
While you could possibly get similar results by searching for specific keywords like beer hall or gastropub, I felt that a more specific search method should be implemented. Why is it that Yelp does not have a feature where the user could provide inputs on bars that they enjoy, and then expand on that search at a location of the user’s choosing? This search method would also be able to capture less tangible aspects like ambience. Therefore I decided that for my 4th Metis project, I would build this feature using unsupervised learning and NLP techniques.
Data
Data Sources:
For this project, I ended up getting my data from the Yelp challenge dataset. A publicly available dataset provided by Yelp, this was an invaluable resource as Yelp’s TOS is very anti-webscraping. The dataset itself was huge as there was over 6 million reviews for 800,000+ different businesses, spread out over 10 different metropolitan areas in the US and Canada. So much geographic diversity was a bonus for my project as I wanted it to be location agnostic. The data from Yelp was stored in .json format so I queried and filtered the data using a local MongoDB instance. In retrospect as the .json files were massive (the review text JSON was 4.72 GB), analyzing data for this project would have been more efficient if I had transferred everything to a large memory AWS instance. The 8 GB of RAM that my Macbook Air has was not quite adequate enough to run analysis on the data without micromanaging computer resources.
Data Filtering:
After loading my data into MongoDB, my first step for data processing was to filter the list of businesses for bars only. My first approach (using pymongo) was to perform a regex, matching any business that had “Bars” in its category type. “Bars” also had to have at least 50 reviews as so that there would be enough text data to draw inferences:
# Create subset of business that contain bars in question
db.business.aggregate([
{"$match": {'categories': {'$regex': 'Bars'}}},
{"$match": {"review_count": {"$gte": 50}}},
{"$out": "bars"}]);
This approach did ending up decreasing the number of businesses to search through from 180,000+ to 5000+. However, one quirk I quickly realized was that the regex approach not only captured bar types like wine bars or sports bars, but also sushi bars. I ended up being more restrictive and wrote two functions to return a business that only contained a specific category (in this case “bars”):
def check_for_cat(categories, cat_type):
"""
Function that returns whether or not given business is a given type, returns a boolean
"""
# splits input category list from json file
categories_list = categories.split(',')
# loop through each category to see if it is equal to cat_type
for category in categories_list:
# lowercase the category and strip any leading/trailing spaces
mod_cat = category.strip().lower()
if mod_cat == cat_type:
return True
return False
def filter_yelplist(locations, cat_type):
"""
Function that returns list of yelp locations filtered for presence of a business category
"""
return [loc for loc in locations if check_for_cat(loc['categories'], cat_type) == True]
This additional filtering gave me about 4000+ bars to work with.
Next, I had to join the list of bars with the reviews for each bar. Normally in a SQL database, this would be accomplish by a simple join command. However MongoDB has a NoSQL database structure, so I was not able to do that (I did experiment with the $lookup command). I ended up writing the following query to get the relevant data:
# Create subset of reviews to only find reviews that contain the bar ids
locations = list(db.bars.find({}, {'_id': 0, 'business_id': 1, 'categories': 1}))
filtered_bars = filter_yelplist(locations, 'bars')
allbar_ids = [i['business_id'] for i in filtered_bars]
db.reviews.aggregate([
{'$match': {'business_id': {'$in': allbar_ids}}},
{"$out": "bar_reviews"}]);
After all of the data processing I ended up with a subset of data that contained 4000+ bars that each had at least 50 reviews. In total there were 800,000+ reviews that I could analyze.
Text Processing
After I had the data filtered, I still had to process the review text itself before passing it to a tokenizer. There were three things that I did:
Remove numbers and punctuation by using regex
Convert all case types to lowercase
Lemmatize all words
Lemmatization was important because I didn’t want the model to include different forms of the same word. Lemmatizing accomplishes that as it returns the word’s lemma or “dictionary form”, so plurals and participial forms of a given word are replaced with the base form. For example, the lemma of “walking” would be “walk” and the lemma of “better” would be “good”.
Feature Engineering
The ambience of a bar I feel is very important, therefore I used NLP on the user reviews to capture a proxy of that. After experimenting with the different token vectorizers and dimensionality reduction techniques, I ended up utilizing a combination of count vectorization and Latent Dirichlet Allocation (LDA) to give me the most interpretability.
Vectorization
In performing vectorization I needed to generate a relevant list of tokens from the review text. While NLTK provides a base set of English stop words to be excluded, there were additional tokens that had to be excluded in the topic generation. The types of tokens that ended up in the exclusion list fell into two categories: location-related and food-related. Location-related tokens were excluded in order to make the model location agnostic, while food-related tokens were excluded so that the topics were focused on the drinks and the ambience. In terms of the type of vectorizer, the CountVectorizer model ended up performing better than TFIDF. I believe that the CountVectorizer performed better than TFIDF because important words to describe a bar (i.e. beer, music, drinks) would show up repeatedly and the CountVectorizer ended up promoted repetition while TFIDF penalized it.
Topic Modeling
By using LDA I was able to perform topic modeling and generated 9 topics that I felt were indicative of different types of bars in my dataset:
2 of the bar archetypes ended up being the most descriptive (Outdoor and Brunch) as most of the bars ended up being described by those two topics. This could be demonstrated by how those two topics mapped to the TSNE plots:
The rest of the 7 topics ended up being more specific as they generally described a few bars in the dataset. For example the arcade and video game bars ended up being mapped to my TSNE plot as follows:
Categorical Features
In addition to the NLP generated features, there were some valuable categorical features added to the model. The Yelp dataset actually contained many categorical features: i.e. whether or not a place was good for dancing or if they had a television available. However I decided not to utilize any of those features as I wanted to determine proxies for those characteristics through topic modeling. Thus, the only categorical feature I ended up taking directly from the dataset was the price range of the business.
While the topic modeling gave me insight on the ambience of a bar, I also wanted my model to focus a bar’s primary alcohol type. To accomplish this, I decided to amplify the presence of certain keywords in the reviews. For example, for each bar I detected the number of times the token “whiskey” was mentioned. Dividing that count by the number of total reviews, I have a proportion that signaled how much of a whiskey bar a given bar was. I ended up creating these presence type categorical features on the keywords: whiskey, vodka, beer, wine, tequila, rum, shot, gin, brandy, soju, cider, and sake.
Recommendation Engine
Giving recommendations
With the features established, I created a pipeline to generate features for each of the bars in my dataset. A recommendation was given by comparing the cosine similarity of an input bar against all the other bars in the dataset. If there were multiple bars given as input, the cosine similarity of the averaged vector of the input bars was what was used. This similarity comparison formed the basis of my recommendation engine. Based off the dataset these were some of the possible recommendations:
This recommendation engine was then deployed on Heroku through a Flask app and an interactive Tableau dashboard as shown:
If you would like to try out some recommendations yourself you can access the app here.
Future Work
If given more time to extend this project, some improvements that I would have liked to work on:
More Relevant Data: The current dataset did not include bar locations from major metro areas like San Francisco or New York, which I believe would help to create more relevant recommendations
Linking Interactive Graphics: If this feature was actually to be deployed through Yelp or a similar service, I feel that interactive graphics would be a big selling point. Thus, I would have liked to link the Tableau dashboard and the Flask interface so that there was a greater degree of control of the end user.
If you would like to learn more about this project you can find the code and data at my Github repo.