Projects | Data Dillon

Fare Forecaster | Save Time, Make Money

Mon, 15 Apr 2024 00:00:00 +0000

Fare Forecaster: Save Time, Make Money

By Dillon Diatlo

The App

https://fareforecaster.streamlit.app/

Problem Statement

In the world of NYC for-hire vehicle drivers, high-fare trips are few and far between. But maybe they don’t have to be. The objective of this experiment is to sample 212 million rows of data from 2022 NYC cab trips to predict––by borough and time––which taxi zones will have on average the highest paying trips. Exploring features such as request time, weather, trip duration, date, and many more, the hope is to expose patterns contributing to high-fare rides in order to enable NYC drivers to optimize their time and make the most money possible.

Data Dictionary

Feature	Type	Dataset	Description
req_index	datetime	041324_taxi_recs	The dataframe index representing the day of each trip
trip_miles	float	041324_taxi_recs	The distance of every for-hire-vehicle trip
tips	float	041324_taxi_recs	The money each passenger tipped the driver
congestion_surcharge	float	041324_taxi_recs	An additional passenger surcharge given when taking a trip during a busy time of day
temp	float	041324_taxi_recs	The temperature of NYC
preciptype	int	041324_taxi_recs	What the weather precipitation was like, represented as follows:
			‘missing’: 0, ‘rain’: 1, ‘snow’: 2, ‘rain,snow’: 3, ‘rain,freezingrain,snow’: 4, ‘rain,freezingrain,snow,ice’: 5, ‘rain,freezingrain’: 6
zone	object	041324_taxi_recs	Which NYC taxi zone the trip began in (there are 265 unique zones)
borough_name	object	041324_taxi_recs	Which NYC borough the trip began in
trip_duration	float	041324_taxi_recs	The length of time a trip took from pickup to dropoff
month	int	041324_taxi_recs	The month a trip took place in
day_of_month	int	041324_taxi_recs	A number between 1-31 representing which day of the month the trip took place
driver_made	float	041324_taxi_recs	The total amount the driver made after employer charge, includes tip
day_of_week	object	041324_taxi_recs	The day of the week the trip took place
hour	int	041324_taxi_recs	The hour in which the trip was requested and began
minute	int	041324_taxi_recs	The minute of the hour at which the trip was requested

Original Data Source

https://data.cityofnewyork.us/Transportation/2022-High-Volume-FHV-Trip-Records/g6pj-fsah/about_data

Executive Summary

Project objective

Problem Statement: The objective of this experiment is to use 212 million rows of data from 2022 NYC cab trips to predict––by borough and time––which taxi zones will have on average the highest paying trips.

Goals: The goal of this project is to build a working tool for-hire vehicle drivers in NYC can use to optimize their income and time.

Methodology

The methodology involves several steps:

Data Downsizing
- Downloaded 212,000,000 rows of data in half-month sizes
- Used Pandas “chunksize” parameter to downsize each hal=fmonth dataframe during import
- Concatenated the downsized half-month dataframes into 12, downsized full-month dataframes
Data Cleaning
- Created a new dataframe of all 12 dataframes
- Replaced and dropped NaNs
- Transformed object-type data into datetime
- Set the index as dates for time series analysis
- Dropped unnecessary columns
Exploratory Data Analysis (EDA):
- Reset the df index for time series analysis
- Created heatmaps to uncover any unexpected correlations
- Explored data by day, week, and month to spot trends
- Analyzed data by NYC Borough through bar graphs and box plots
- Investigated distribution of driver pay and more via histograms
- Looked for correlations via scatter plots
Feature Engineering
- Created a “driver_made” column by combining ’tips’ with ‘driver_pay’
- Extracted hour, minute, and date columns to explore distributions and trends
- Used ordinal mapping on borough names and precipitation
- Built a ’trip_duration’ column by subtracting dropoff_time by pickup_time
- OneHotEncoded zones in a prepocessing pipeline
- Dropped missing values and encoded others
- Scaled features to more accurately find patterns when modeling
Sampling
- Further sampled data with .sample() method to speed model exploration process
Train-Test-Split
- Divided data into training and testing sets to assess model performance and accuracy
Model Selection
- Built and tested 4 different model types
- Tuned parameters of the 4 different model types and many pipelines to improve accuracy scores
- Models tested:
  - GradientBoost (winner)
  - RandomForest
  - LassoCV
  - RidgeRegression
  - LinearRegression
  - XGBoost
Model Evaluation
- Fit models with training data and ran models with testing data
- Analyzed model performance via metrics including R^2 score, MSE, and RMSE
- Pickled the model with the best evalutaiton
Streamlit Tools
- Constructued a Streamlit web application for real-time use of the model
- Illustrated and thematically fit the application
- Deployed the application

Key Findings & Plots

The following features via a GradientBoostRegressor supported the the most accurate predictive model with 87% R^2 score:
- trip_miles, congestion_surcharge, temp, preciptype, zone, borough_name, trip_duration, month, day_of_month, day_of_week, hour, minute
Driver pay per trip tends to increase during the first 5 months of the year, before plateauing at an average of about $21
The borough with the highest driver-pay per trip is Queens (by a slight margin), then Manhattan, Brooklyn, Staten Island, and the Bronx
- This is most likely because JFK International Airport is in Queens and on average has the highest driver-pay per trip by month
Average driver pay by request hour is bi-modal, with highest average pay around 7am and 4pm, with a slight spike in pay around 9pm
Size of the average tip given peaks every 5 minutes, beginning at the bottom of the hour, with request minute 10 providing on average the largest tip per trip
Most trips fall under 25 minutes in length, with the highest amount of trips around 8-10 minute range
Ride requests hit a low around the 30 minute mark of the hour and peak around the 45 minute mark
Drivers make between $5-$15 for the majority of their trips

Conclusion and Recommendations

Conclusion

The analysis of this 212M-rowed dataset revealed a number of key insights regarding for-hire vehicle metrics in NYC.

First and foremost, I hit an immediate bump in the road, finding it necessary to sample the data from 212M rows to about 4M which. While sufficient, this is only about 2% of the original data. However, if not sampled, the initial dataset proved difficult and timely to work with––undescoring the importance of using tools like Spark, TensorFlow, DataBricks, and BigQuery.

By manipulating the data through Python, Pandas, Matplotlib.pyplot and more, I was able to determine, by borough, which taxi zone would produce the highest average revenue per trip. The model which performed best was a pipeline including StandardScaler and OneHotEncoder transformers paired with a GradientBoostRegressor estimator. This model achieved an impressive testing score of 87% accuracy in predicting revenue a driver might make per trip, in a given taxi zone––down to a given minute.

Pickling this model and pairing it with Streamlit tools proved efficient in building an audience-facing applicaiton that, when given the borough, weather, and more, can return the taxi zone in that borough where a driver can make the most money.

Overall, the findings reinforce the importance of location and time in determining the average revenue for a driver per trip, with less weight-carrying, though still important features, like congestion and temperature.

Recommendations & Next Steps

TOOLS - Because of the sheer size of the data available, future experiments should utilize tools built for large-scale datasets, like Spark or BigQuery, to ensure analysis of all data.
TIMESPAN - The for-hire vehicle data was limited to a single year (2022). Improved metrics, as well as deeper and more confident insights can be found through multiple years of data.
GEO-GRANULARITY - While currently specified by one of 265 NYC taxi zones, more granular geo-location data can create more specific recomendations within taxi zones, possibly down to requests by block.
BIASIS - Initially not biases, downsizing and sampling ultimate left the data a bit skewed towards certain days and months. In the future, one can try different ways of parameter tuning, sampling and downsizing to get the least bias results.
DEMAND - Currently, this model does not take into consideration taxi zone based on demand. To optimize in real-time, one can take into account demand data by tazi zone and time.

Additional EDA Plots

Predicting Successful Songs

Mon, 08 Apr 2024 00:00:00 +0000

Popularity Prediction using Spotify Features

Authors: Dillon Diatlo, Elaine Chen, Harnish Shah, Rajashree Choudhary

Role

Data Cleaning
Data Visualization
Data Modeling (build, train, validate)
Presentation

Problem Statement

In the expansive domain of Spotify, a platform that serves as a hub for global music communities and boasts a diverse repertoire of genres, the objective is to predict the popularity using songs released prior to 2019. This entails analyzing song features within Spotify’s structure to anticipate their acclaim. Utilizing Spotify’s vast database and insights into each song’s characteristics, the aim is to discern the elements contributing to a song’s success. Exploring factors such as genre, danceability, valency, and tempo etc, this predictive analysis seeks to unveil prevalent patterns and trends in the music landscape up to 2018. Ultimately, this exploration offers a means to comprehend and forecast the resonance of songs within Spotify’s diverse and dynamic ecosystem.

Data Dictionary

Feature	Type	Description	Range
acousticness	float	A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.	0 - 1
danceability	float	Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.	0 - 1
duration_min	float	The duration of the track in minutes.
energy	float	Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.	0 - 1
instrumentalness	float	Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content.	0 - 1
key	integer	The key the track is in. Integers map to pitches using standard Pitch Class notation. If no key was detected, the value is -1.	-1 - 11
liveness	float	Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.	0 - 1
loudness	float	The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Values typically range between -60 and 0 dB.	-60 - 0 dB
mode	integer	Mode indicates the modality (major or minor) of a track, represented by 1 for major and 0 for minor.	0 - 1
speechiness	float	Speechiness detects the presence of spoken words in a track.	0 - 1
tempo	float	The overall estimated tempo of a track in beats per minute (BPM).
time_signature	integer	An estimated time signature, ranging from 3 to 7.	3 - 7
track_href	object	A link to the Web API endpoint providing full details of the track.
type	object	The object type, which must be “audio_features”.
valence	float	A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track.	0 - 1
population	object	A 0-to-100 score that ranks how popular an artist is relative to other artists on Spotify.	0 - 100
genre	object	The genre of the track

Executive Summary

Project objective

We will leverage Spotify’s features database to categorize songs into popularity groups based on the following ranges:

Popularity 0-20: Least popular
Popularity 21-40: Less popular
Popularity 41-60: Medium popular
Popularity 61-80: More popular
Popularity 81-100: Most popular

Methodology

The methodology involves several steps:

Data Cleaning:
- Impute missing track name
Exploratory Data Analysis (EDA):
- Each Spotify track features correlation
- Popularity distribution
- Numerical features statisticical summary
- Categorical features distribution
- Top artists, top songs exploratory
Feature Engineering
- Transpose feature genre, to remove records that same song marked in different genre
Train-Test Split:
- The dataset is divided into training and testing sets to assess model performance accurately.
Model Selection:
- Several classification algorithms are utilized for model training, including Clustering, Desicion Tree Classification, Random Forest Classification, ExtraTrees Classification and Neural Network
Model Evaluation:
- Each classification model is trained using the training data and evaluated using testing data.
- Model performance metrics such as training score, testing score, and accuracy is calculated to assess the effectiveness of each model.

Plots

Features

Heatmap: Numeric Feature Correlations

Looking at feature correlations to narrow down which variables to use when predicting popularity
Feature correlation can also help determine co-linearity and bias

Feature Relationships: Energy and Acousticness

Looking at feature correlations can help to determine co-linearity and bias
Trend: As acousticness goes up, energy decreases

Feature Relationships: Energy and Loudness

Looking at feature correlations can help to determine co-linearity and bias
Trend: Energy and loudness have a clear positive correlation

Feature Relationships: Loudness and Acousticness

Looking at feature correlations can help to determine co-linearity and bias
Trend: As acousticness goes up, loudness decreases, confirming previous two visualizations

Genre

Distribution of Songs by Genre

Song Duration by Genre

Violin Plot: Energy by Genre

Here you can see that each genre can often be determined by where a songs energy level fits

Popularity

Distribution of Songs by Popularity Ranking

Most songs have a mid-range popularity ranking, making it difficult to create a model that can accurately predict popularity
We see a bimodal trend here because just under 8k songs have not been ranked at all, most likely due to the nature of of how many songs are on Spotify and never really listened to

Heatmap: Popularity Feature Correlations

Loudness is the feature that correlates with popularity most

Popularity by Loudness

Most popular songs seem to land between -10 and -5 decibels

Distribution of Key by Popularity

Songs in the key of C tend to be the most popular, followed by G and D (though I suppose any punk music fan could have told you that!)

Top 10 Most Popular Spotify Songs

We see here a bias in the data, since the same songs fall under multiple genres the top 10 most popular songs are only 5 songs

Top 10 Least Popular Spotify Songs

In the wise words of Rodney Dangerfield, Ludwig van Beethoven gets “No respect!”

Top 20 Most Popular Spotify Artists

Conclusion and Recommendations

Conclusion

Our analysis revealed several key insights regarding the classification of song popularity using machine learning algorithms. Initially, through k-means clustering, we determined the optimal number of clusters, which informed our grouping of popularity into three classes. Subsequent experimentation with classification algorithms highlighted Random Forest as the top performer, achieving impressive training and testing scores. However, we identified a challenge with imbalanced popularity classes, particularly within the range of 34 to 66. To address this, we refined our approach by dividing popularity into five classes and employing a Dense Neural Network (DNN), resulting in improved accuracy rates. Overall, these findings underscore the importance of careful feature selection and addressing data imbalances to enhance the effectiveness of predictive models in classifying song popularity.

Recommendations

The dataset displays an imbalance, notably in the most popular group (Popularity: 81 to 100). Future studies should prioritize acquiring more up-to-date and balanced data to ensure robust analysis and accurate predictions.
Given the limitations in data extraction due to Spotify’s updated user safety policy, we had restricted access to the latest dataset. For improved insights, future studies should collaborate directly with Spotify to access real-time data sources, enabling access to up-to-date information for analysis and decision-making.

NLP & Sentiment Analysis

Fri, 15 Mar 2024 00:00:00 +0000

NLP & Sentiment Analysis

by Dillon Diatlo

Problem Statement

The objective of this project is to utilize Vectorization and Sentiment Analysis techniques to build a classification model that can accurately categorize which subreddit––r/running or r/Swimming––a user’s post belongs to. In doing so, I will:

Find out which subreddit is less happy
Use Natural Language Processing to build classification models that can accurately categorize Reddit posts

Data Dictionary

Feature	Type	Dataset	Description
datetime	object	full	The date and time the reddit user posted in the format, Y-MM-DD, H:M:S
all_text	object	full	The combined text of every post’s title and user’s post description
subreddit	int	full	Binary categorization of subreddits r/running [0] and r/Swimming [1]
sentiment	float	full	A number ranked between -1 and 1 suggesting the overall sentiment of the all_text copy
post_word_count	int	full	The number of words in each individual all_text copy
title_word_count	int	full	The number of words in each individual post title
upper_count	int	full	The number of upper case letters found in individual all_text posts
lower_count	int	full	The number of lower case letters found in individual all_text posts

Executive Summary

Project Objective

Problem Statement: The objective of this project is to fit to accurately categorize Reddit posts into two separate subreddits, r/running and r/Swimming, based on word frequency and classification techniques.

Goals: The goals of this project are to analyze the subreddits r/running and r/Swimming in order to learn, based on sentiment analysis, which subreddit users are less happy, and then to build a classification model that can use post words to accurately predict which subreddit a post belongs to.

Project Methodology:

Data Collection:

PRAW Reddit API
r/running
r/Swimming
full.csv

full.csv is an accumulation of the following csv’s

run feb 26.csv
run feb 27.csv
swim feb 26.csv
swim feb 27.csv

Data Cleaning:

Joined all csvs into a single csv
Replaced NaN’s with empty strings and then combined title column and selftext column into a single column: all_text
Turned created_utc column into processable dates and times
Dropped weekly automatic r/Swimming rows
Dropped extra columns and created columns to count uppercase, lowercase, and post/title words
CountVectorized the all_text column for manipulation

Exploratory Data Analysis (EDA):

Created a lemmatizer function and sentiment analysis function
Found the top 10 and 25 most frequently used words, bigrams, and trigrams for both subreddits
Explored the correlation between datetime and subreddit sentiment
Explored the correlation between post/title word count and subreddit sentiment
Explored the distribution of sentiment for posts within both subreddits
Found the mean positivity of both subreddits

Modeling Techniques:

Found my baseline accuracy of .45
CountVectorizer paired with a Multinomial seemed to perform poorly
RandomForest and Logistic Regression performed the best overall
When paired with TfidVectorizer and a GridSearchCV, I found that both Logistic Regression and RandomForest performed better and well
After adding in a lemmatizer and only seeing a slight change, I decided to take my nearly 97%

Key Findings:

Based on the analysis conducted, putting lemmatized data through a TfidVectorizer with my LogisticRegression model and fitting that to a GridSearchCV built the most accurate classification model at 96.8%. When looking at the confusion matrix, this performend just minimally better than non-lemmatized data through a TfidVectorizer and Randomforest, fit to a GridSearchCV.
r/runner users seem to be generally happier with a mean positve sentiment analysis of .55, compared to r/Swimming .31. This could mean swimmer are less happy, but can also mean swimmers are less concerned with accomplishmentmets like “finish lines” and “goals” and more concerned with technique.
There is a correlation between year and posting sentiment
All models outperformed my baseline of 45% accuracy.

Plots

Top 10 Words Being Used in r/running

Top 10 Bigrams Being Used in r/running

Bigrams can give us more insight into context and what type of things users in r/running are talking about
You can see here “half marathon”, “finish line”, and “feel like” are all emotionally driven bigrams, which could mean runners are more focused on accomplishment

Top 10 Trigrams Being Used in r/running

Trigrams can give us even more insight into context and what type of things users in r/running are talking about
You can see here r/runner users are very focused on goals and accomplishments

Top 10 Words Being Used in r/Swimming

Top 10 Bigrams Being Used in r/Swimming

You can see here bigrams like “first time”, “tech suit”, and “any advice” which suggest r/Swimming users may be more focused on technique rather than accomplishment like r/running users

Top 10 Trigrams Being Used in r/Swimming

You can see here Trigrams like like “does anyone know”, “would greatly appreciated” and “wa wondering anyone”, which suggest curiousity and advice-driven users in r/Swimming
You may also notice “drive usp sharing”, “format pjpg webp” and other weird not-really-english ones. These are because there were many links in the content, which cna be further pre-processed and filtered out given more time

Average Positive Sentiment

Sentiment analysis of user posts in r/running and r/Swimming suggested that, on average, user posts are .24 more positive than r/Swimming user posts
This does not necessarily mean r/running users are happier, but as we previously saw it could be an indicator of what the two groups are posting about

r/Swimming: Distribution of Sentiment

Distribution here is not normal, rather it’s left skewed with a bimodal curve
The sentiment of most r/Swimming posts falls between ~ -.2 and 1, with high accumulations of post sentiment in those same points, suggesting user posts generally fall between neutral or positive

r/running: Distribution of Sentiment

Distribution here is not normal, rather it’s left skewed
Analysis of the distribution suggest most r/running user posts are positive, with the majority falling between ~.6 and 1.00

r/running: Year vs. Sentiment

Here you can see a high accumulation of positive posts forming between 2020 and 2022, this correlates with the Covid-19 pandemic, a time when many people were beginning to run outside since gyms and facilities were closed

r/Swimming: Year vs. Sentiment

Here you can see the same bimodal split with a heavy accumulation of neutral-sentiment posts and positive posts

Implications and Conclusion:

Tfidvectorizors, paired with LogisticRegression and GridSearchCV seem to work better using anything with a CountVectorizer as Tfidvectorizers represent term frequency and rarity across corpuses, so this paired with the best hyper parameters will result in a solid prediction. Better than a CountVectorizer which is just frequency and, therefore, may include a lot of less important words.

In terms of sentiment analysis, runners have a more positive sentiment. Based on ‘Top 10 r/running Trigrams’, they seem to be more accomplishment focused, concentrating on wins, goals, and races. These are all emotionally driven terms. At the same time, running has a lower barrier to entry than swimming. It’s something most of us can do intuitively. It’s therefore easier to start, practice, get better, try races, and more. Runners may not be happier, but they are more emotionally focused.

Comparatively, posts in r/Swimming are more neutral, as can be seen in the ‘Year vs Sentiment r/Swimming’ and the multi spikes in ‘Sentiment Distribution r/Swimming’ . Additionally, r/Swimming users seem to be less accomplishment focused and more imporovement focused. This can be seen in ‘Top 10 r/Swimming Bigrams’ with words like “any advice”, “first time”, “tech suit”, and “started swimming”.

Next Steps

Next steps are to begin narrowing down our variables to create a regression model that has a lower RMSE and can fit even better.

DATETIME - Models were based on post word frequency, though I wonder if we can get a better model if we also base it on datetime.
STEMMING - Data was run through a lemmatizer, though not a stemmer. We should go back and see if stemming effects our models at all.
FILTER - Swimming’s trigrams include many long, uninterpretable strings. Go back and create a stop word filter to filter out words longer than a certain length
PINPOINT USERS - If we are advertising happiness to sad users, we’ll need more than just the subreddit they look through. Let’s collect their user names and pinpoint them directly. We can also scrape their data and see where else they post and begin to build classes within our swimmer target audience.