<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Projects | Data Dillon</title><link>https://datadillon.netlify.app/project/</link><atom:link href="https://datadillon.netlify.app/project/index.xml" rel="self" type="application/rss+xml"/><description>Projects</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Mon, 15 Apr 2024 00:00:00 +0000</lastBuildDate><image><url>https://datadillon.netlify.app/media/icon_hu084d2819d5787271ab93a3b21afd0b3d_30671_512x512_fill_lanczos_center_3.png</url><title>Projects</title><link>https://datadillon.netlify.app/project/</link></image><item><title>Fare Forecaster | Save Time, Make Money</title><link>https://datadillon.netlify.app/project/taxi/</link><pubDate>Mon, 15 Apr 2024 00:00:00 +0000</pubDate><guid>https://datadillon.netlify.app/project/taxi/</guid><description>&lt;h1 id="fare-forecaster-save-time-make-money">Fare Forecaster: Save Time, Make Money&lt;/h1>
&lt;p>By Dillon Diatlo&lt;/p>
&lt;h2 id="the-app">The App&lt;/h2>
&lt;p>&lt;a href="https://fareforecaster.streamlit.app/" target="_blank" rel="noopener">https://fareforecaster.streamlit.app/&lt;/a>&lt;/p>
&lt;h2 id="problem-statement">Problem Statement&lt;/h2>
&lt;p>In the world of NYC for-hire vehicle drivers, high-fare trips are few and far between. But maybe they don&amp;rsquo;t have to be. The objective of this experiment is to sample 212 million rows of data from 2022 NYC cab trips to predict––by borough and time––which taxi zones will have on average the highest paying trips. Exploring features such as request time, weather, trip duration, date, and many more, the hope is to expose patterns contributing to high-fare rides in order to enable NYC drivers to optimize their time and make the most money possible.&lt;/p>
&lt;h2 id="data-dictionary">Data Dictionary&lt;/h2>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Feature&lt;/th>
&lt;th>Type&lt;/th>
&lt;th>Dataset&lt;/th>
&lt;th>Description&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>req_index&lt;/strong>&lt;/td>
&lt;td>&lt;em>datetime&lt;/em>&lt;/td>
&lt;td>041324_taxi_recs&lt;/td>
&lt;td>The dataframe index representing the day of each trip&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>trip_miles&lt;/strong>&lt;/td>
&lt;td>&lt;em>float&lt;/em>&lt;/td>
&lt;td>041324_taxi_recs&lt;/td>
&lt;td>The distance of every for-hire-vehicle trip&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>tips&lt;/strong>&lt;/td>
&lt;td>&lt;em>float&lt;/em>&lt;/td>
&lt;td>041324_taxi_recs&lt;/td>
&lt;td>The money each passenger tipped the driver&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>congestion_surcharge&lt;/strong>&lt;/td>
&lt;td>&lt;em>float&lt;/em>&lt;/td>
&lt;td>041324_taxi_recs&lt;/td>
&lt;td>An additional passenger surcharge given when taking a trip during a busy time of day&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>temp&lt;/strong>&lt;/td>
&lt;td>&lt;em>float&lt;/em>&lt;/td>
&lt;td>041324_taxi_recs&lt;/td>
&lt;td>The temperature of NYC&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>preciptype&lt;/strong>&lt;/td>
&lt;td>&lt;em>int&lt;/em>&lt;/td>
&lt;td>041324_taxi_recs&lt;/td>
&lt;td>What the weather precipitation was like, represented as follows:&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&amp;lsquo;missing&amp;rsquo;: 0, &amp;lsquo;rain&amp;rsquo;: 1, &amp;lsquo;snow&amp;rsquo;: 2, &amp;lsquo;rain,snow&amp;rsquo;: 3, &amp;lsquo;rain,freezingrain,snow&amp;rsquo;: 4, &amp;lsquo;rain,freezingrain,snow,ice&amp;rsquo;: 5, &amp;lsquo;rain,freezingrain&amp;rsquo;: 6&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>zone&lt;/strong>&lt;/td>
&lt;td>&lt;em>object&lt;/em>&lt;/td>
&lt;td>041324_taxi_recs&lt;/td>
&lt;td>Which NYC taxi zone the trip began in (there are 265 unique zones)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>borough_name&lt;/strong>&lt;/td>
&lt;td>&lt;em>object&lt;/em>&lt;/td>
&lt;td>041324_taxi_recs&lt;/td>
&lt;td>Which NYC borough the trip began in&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>trip_duration&lt;/strong>&lt;/td>
&lt;td>&lt;em>float&lt;/em>&lt;/td>
&lt;td>041324_taxi_recs&lt;/td>
&lt;td>The length of time a trip took from pickup to dropoff&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>month&lt;/strong>&lt;/td>
&lt;td>&lt;em>int&lt;/em>&lt;/td>
&lt;td>041324_taxi_recs&lt;/td>
&lt;td>The month a trip took place in&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>day_of_month&lt;/strong>&lt;/td>
&lt;td>&lt;em>int&lt;/em>&lt;/td>
&lt;td>041324_taxi_recs&lt;/td>
&lt;td>A number between 1-31 representing which day of the month the trip took place&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>driver_made&lt;/strong>&lt;/td>
&lt;td>&lt;em>float&lt;/em>&lt;/td>
&lt;td>041324_taxi_recs&lt;/td>
&lt;td>The total amount the driver made after employer charge, includes tip&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>day_of_week&lt;/strong>&lt;/td>
&lt;td>&lt;em>object&lt;/em>&lt;/td>
&lt;td>041324_taxi_recs&lt;/td>
&lt;td>The day of the week the trip took place&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>hour&lt;/strong>&lt;/td>
&lt;td>&lt;em>int&lt;/em>&lt;/td>
&lt;td>041324_taxi_recs&lt;/td>
&lt;td>The hour in which the trip was requested and began&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>minute&lt;/strong>&lt;/td>
&lt;td>&lt;em>int&lt;/em>&lt;/td>
&lt;td>041324_taxi_recs&lt;/td>
&lt;td>The minute of the hour at which the trip was requested&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h2 id="original-data-source">Original Data Source&lt;/h2>
&lt;p>&lt;a href="https://data.cityofnewyork.us/Transportation/2022-High-Volume-FHV-Trip-Records/g6pj-fsah/about_data" target="_blank" rel="noopener">https://data.cityofnewyork.us/Transportation/2022-High-Volume-FHV-Trip-Records/g6pj-fsah/about_data&lt;/a>&lt;/p>
&lt;h2 id="executive-summary">Executive Summary&lt;/h2>
&lt;h3 id="project-objective">Project objective&lt;/h3>
&lt;p>&lt;strong>Problem Statement:&lt;/strong>
The objective of this experiment is to use 212 million rows of data from 2022 NYC cab trips to predict––by borough and time––which taxi zones will have on average the highest paying trips.&lt;/p>
&lt;p>&lt;strong>Goals:&lt;/strong>
The goal of this project is to build a working tool for-hire vehicle drivers in NYC can use to optimize their income and time.&lt;/p>
&lt;h3 id="methodology">Methodology&lt;/h3>
&lt;p>The methodology involves several steps:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;em>Data Downsizing&lt;/em>&lt;/p>
&lt;ul>
&lt;li>Downloaded 212,000,000 rows of data in half-month sizes&lt;/li>
&lt;li>Used Pandas &amp;ldquo;chunksize&amp;rdquo; parameter to downsize each hal=fmonth dataframe during import&lt;/li>
&lt;li>Concatenated the downsized half-month dataframes into 12, downsized full-month dataframes&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;em>Data Cleaning&lt;/em>&lt;/p>
&lt;ul>
&lt;li>Created a new dataframe of all 12 dataframes&lt;/li>
&lt;li>Replaced and dropped NaNs&lt;/li>
&lt;li>Transformed object-type data into datetime&lt;/li>
&lt;li>Set the index as dates for time series analysis&lt;/li>
&lt;li>Dropped unnecessary columns&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;em>Exploratory Data Analysis (EDA):&lt;/em>&lt;/p>
&lt;ul>
&lt;li>Reset the df index for time series analysis&lt;/li>
&lt;li>Created heatmaps to uncover any unexpected correlations&lt;/li>
&lt;li>Explored data by day, week, and month to spot trends&lt;/li>
&lt;li>Analyzed data by NYC Borough through bar graphs and box plots&lt;/li>
&lt;li>Investigated distribution of driver pay and more via histograms&lt;/li>
&lt;li>Looked for correlations via scatter plots&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;em>Feature Engineering&lt;/em>&lt;/p>
&lt;ul>
&lt;li>Created a &amp;ldquo;driver_made&amp;rdquo; column by combining &amp;rsquo;tips&amp;rsquo; with &amp;lsquo;driver_pay&amp;rsquo;&lt;/li>
&lt;li>Extracted hour, minute, and date columns to explore distributions and trends&lt;/li>
&lt;li>Used ordinal mapping on borough names and precipitation&lt;/li>
&lt;li>Built a &amp;rsquo;trip_duration&amp;rsquo; column by subtracting dropoff_time by pickup_time&lt;/li>
&lt;li>OneHotEncoded zones in a prepocessing pipeline&lt;/li>
&lt;li>Dropped missing values and encoded others&lt;/li>
&lt;li>Scaled features to more accurately find patterns when modeling&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;em>Sampling&lt;/em>&lt;/p>
&lt;ul>
&lt;li>Further sampled data with .sample() method to speed model exploration process&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;em>Train-Test-Split&lt;/em>&lt;/p>
&lt;ul>
&lt;li>Divided data into training and testing sets to assess model performance and accuracy&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;em>Model Selection&lt;/em>&lt;/p>
&lt;ul>
&lt;li>Built and tested 4 different model types&lt;/li>
&lt;li>Tuned parameters of the 4 different model types and many pipelines to improve accuracy scores&lt;/li>
&lt;li>Models tested:
&lt;ul>
&lt;li>GradientBoost (winner)&lt;/li>
&lt;li>RandomForest&lt;/li>
&lt;li>LassoCV&lt;/li>
&lt;li>RidgeRegression&lt;/li>
&lt;li>LinearRegression&lt;/li>
&lt;li>XGBoost&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;em>Model Evaluation&lt;/em>&lt;/p>
&lt;ul>
&lt;li>Fit models with training data and ran models with testing data&lt;/li>
&lt;li>Analyzed model performance via metrics including R^2 score, MSE, and RMSE&lt;/li>
&lt;li>Pickled the model with the best evalutaiton&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;em>Streamlit Tools&lt;/em>&lt;/p>
&lt;ul>
&lt;li>Constructued a Streamlit web application for real-time use of the model&lt;/li>
&lt;li>Illustrated and thematically fit the application&lt;/li>
&lt;li>Deployed the application&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;h3 id="key-findings--plots">Key Findings &amp;amp; Plots&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>The following features via a GradientBoostRegressor supported the the most accurate predictive model with 87% R^2 score:&lt;/p>
&lt;ul>
&lt;li>trip_miles, congestion_surcharge, temp, preciptype, zone, borough_name, trip_duration, month, day_of_month, day_of_week, hour, minute&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Driver pay per trip tends to increase during the first 5 months of the year, before plateauing at an average of about $21
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Avg. Driver Makes per Trip" srcset="
/project/taxi/avg_pay_per_trip_huc589bbb35dcca0daa83b2607b44099db_145210_1135bd31adf2578e09fef6ce49f78d94.webp 400w,
/project/taxi/avg_pay_per_trip_huc589bbb35dcca0daa83b2607b44099db_145210_0909b1f6f7279cfa5b0c521795bd8e14.webp 760w,
/project/taxi/avg_pay_per_trip_huc589bbb35dcca0daa83b2607b44099db_145210_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://datadillon.netlify.app/project/taxi/avg_pay_per_trip_huc589bbb35dcca0daa83b2607b44099db_145210_1135bd31adf2578e09fef6ce49f78d94.webp"
width="760"
height="507"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The borough with the highest driver-pay per trip is Queens (by a slight margin), then Manhattan, Brooklyn, Staten Island, and the Bronx
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Avg. Driver Makes per Trip" srcset="
/project/taxi/avg_trip_pay_by_burough_hu7266580b56d3cfc31a511924d55f2d58_27742_fcdc06f98915440a61349c8eaa92241c.webp 400w,
/project/taxi/avg_trip_pay_by_burough_hu7266580b56d3cfc31a511924d55f2d58_27742_8f19a42fa65f7fa91a942e6f5b9aed4b.webp 760w,
/project/taxi/avg_trip_pay_by_burough_hu7266580b56d3cfc31a511924d55f2d58_27742_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://datadillon.netlify.app/project/taxi/avg_trip_pay_by_burough_hu7266580b56d3cfc31a511924d55f2d58_27742_fcdc06f98915440a61349c8eaa92241c.webp"
width="760"
height="608"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;ul>
&lt;li>This is most likely because JFK International Airport is in Queens and on average has the highest driver-pay per trip by month&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Average driver pay by request hour is bi-modal, with highest average pay around 7am and 4pm, with a slight spike in pay around 9pm
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Avg. Driver Makes per Trip by Hour" srcset="
/project/taxi/avg_pay_trip_hour_hubfae2250e9491859c0375697959a422d_103601_2b0bce1e047c6c9058e3dc022b284712.webp 400w,
/project/taxi/avg_pay_trip_hour_hubfae2250e9491859c0375697959a422d_103601_baeec4e591ff864cb7819a75d0d7c481.webp 760w,
/project/taxi/avg_pay_trip_hour_hubfae2250e9491859c0375697959a422d_103601_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://datadillon.netlify.app/project/taxi/avg_pay_trip_hour_hubfae2250e9491859c0375697959a422d_103601_2b0bce1e047c6c9058e3dc022b284712.webp"
width="760"
height="507"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Size of the average tip given peaks every 5 minutes, beginning at the bottom of the hour, with request minute 10 providing on average the largest tip per trip
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Avg. Driver Makes per Trip by Minute" srcset="
/project/taxi/avg_tip_by_min_hub7976860877786c12df1203fdbe47afe_27744_7a9c7edd616d3398d75a812b830f942a.webp 400w,
/project/taxi/avg_tip_by_min_hub7976860877786c12df1203fdbe47afe_27744_9b5661c9c33936ea92f4c35b1273a17a.webp 760w,
/project/taxi/avg_tip_by_min_hub7976860877786c12df1203fdbe47afe_27744_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://datadillon.netlify.app/project/taxi/avg_tip_by_min_hub7976860877786c12df1203fdbe47afe_27744_7a9c7edd616d3398d75a812b830f942a.webp"
width="760"
height="633"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Most trips fall under 25 minutes in length, with the highest amount of trips around 8-10 minute range
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Distribution of Trip Length" srcset="
/project/taxi/dist_trip_duration_hu5572333f2137d0f0ab6044912bfff14d_33469_fd734596090bc2b76241f3d2fe480965.webp 400w,
/project/taxi/dist_trip_duration_hu5572333f2137d0f0ab6044912bfff14d_33469_3e4044813520105c288881e202bdb400.webp 760w,
/project/taxi/dist_trip_duration_hu5572333f2137d0f0ab6044912bfff14d_33469_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://datadillon.netlify.app/project/taxi/dist_trip_duration_hu5572333f2137d0f0ab6044912bfff14d_33469_fd734596090bc2b76241f3d2fe480965.webp"
width="760"
height="608"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Ride requests hit a low around the 30 minute mark of the hour and peak around the 45 minute mark
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Ride Requests by Minute" srcset="
/project/taxi/dist_rides_by_minute_of_hour_hud9488859dbc87f91196eefc65931d48b_35894_a9b964c68e45b153cbd67bd605c36c8a.webp 400w,
/project/taxi/dist_rides_by_minute_of_hour_hud9488859dbc87f91196eefc65931d48b_35894_db0a23711f29ddbd431a265447ef54b6.webp 760w,
/project/taxi/dist_rides_by_minute_of_hour_hud9488859dbc87f91196eefc65931d48b_35894_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://datadillon.netlify.app/project/taxi/dist_rides_by_minute_of_hour_hud9488859dbc87f91196eefc65931d48b_35894_a9b964c68e45b153cbd67bd605c36c8a.webp"
width="760"
height="608"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Drivers make between $5-$15 for the majority of their trips
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Distribution of Pay by Trip" srcset="
/project/taxi/dist_driver_make_per_ride_hu6251595d2da3749dd16ff11709a11c37_43304_9124020d9b8ae17de014652d97dfe232.webp 400w,
/project/taxi/dist_driver_make_per_ride_hu6251595d2da3749dd16ff11709a11c37_43304_d54bf0180c2a3a8fb2446d7edf28f793.webp 760w,
/project/taxi/dist_driver_make_per_ride_hu6251595d2da3749dd16ff11709a11c37_43304_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://datadillon.netlify.app/project/taxi/dist_driver_make_per_ride_hu6251595d2da3749dd16ff11709a11c37_43304_9124020d9b8ae17de014652d97dfe232.webp"
width="760"
height="608"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="conclusion-and-recommendations">Conclusion and Recommendations&lt;/h2>
&lt;h4 id="conclusion">Conclusion&lt;/h4>
&lt;p>The analysis of this 212M-rowed dataset revealed a number of key insights regarding for-hire vehicle metrics in NYC.&lt;/p>
&lt;p>First and foremost, I hit an immediate bump in the road, finding it necessary to sample the data from 212M rows to about 4M which. While sufficient, this is only about 2% of the original data. However, if not sampled, the initial dataset proved difficult and timely to work with––undescoring the importance of using tools like Spark, TensorFlow, DataBricks, and BigQuery.&lt;/p>
&lt;p>By manipulating the data through Python, Pandas, Matplotlib.pyplot and more, I was able to determine, by borough, which taxi zone would produce the highest average revenue per trip. The model which performed best was a pipeline including StandardScaler and OneHotEncoder transformers paired with a GradientBoostRegressor estimator. This model achieved an impressive testing score of 87% accuracy in predicting revenue a driver might make per trip, in a given taxi zone––down to a given minute.&lt;/p>
&lt;p>Pickling this model and pairing it with Streamlit tools proved efficient in building an audience-facing applicaiton that, when given the borough, weather, and more, can return the taxi zone in that borough where a driver can make the most money.&lt;/p>
&lt;p>Overall, the findings reinforce the importance of location and time in determining the average revenue for a driver per trip, with less weight-carrying, though still important features, like congestion and temperature.&lt;/p>
&lt;h4 id="recommendations--next-steps">Recommendations &amp;amp; Next Steps&lt;/h4>
&lt;ol>
&lt;li>
&lt;p>&lt;ins>TOOLS&lt;/ins> - Because of the sheer size of the data available, future experiments should utilize tools built for large-scale datasets, like Spark or BigQuery, to ensure analysis of all data.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;ins>TIMESPAN&lt;/ins> - The for-hire vehicle data was limited to a single year (2022). Improved metrics, as well as deeper and more confident insights can be found through multiple years of data.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;ins>GEO-GRANULARITY&lt;/ins> - While currently specified by one of 265 NYC taxi zones, more granular geo-location data can create more specific recomendations within taxi zones, possibly down to requests by block.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;ins>BIASIS&lt;/ins> - Initially not biases, downsizing and sampling ultimate left the data a bit skewed towards certain days and months. In the future, one can try different ways of parameter tuning, sampling and downsizing to get the least bias results.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;ins>DEMAND&lt;/ins> - Currently, this model does not take into consideration taxi zone based on demand. To optimize in real-time, one can take into account demand data by tazi zone and time.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;hr>
&lt;p>Additional EDA Plots
&lt;div class="gallery-grid">
&lt;div class="gallery-item gallery-item--medium">
&lt;a data-fancybox="gallery-taxi_plots" href="https://datadillon.netlify.app/media/albums/taxi_plots/avg_fare_week.png" >
&lt;img src="https://datadillon.netlify.app/media/albums/taxi_plots/avg_fare_week_hu64af1b25011d21fb047aefdde3d7a796_79582_750x750_fit_q75_h2_lanczos_3.webp" loading="lazy" alt="avg_fare_week.png" width="750" height="536">
&lt;/a>
&lt;/div>
&lt;div class="gallery-item gallery-item--medium">
&lt;a data-fancybox="gallery-taxi_plots" href="https://datadillon.netlify.app/media/albums/taxi_plots/avg_pay_per_temp.png" >
&lt;img src="https://datadillon.netlify.app/media/albums/taxi_plots/avg_pay_per_temp_hue45c29d09a09155699d169d4275fff04_27693_750x750_fit_q75_h2_lanczos_3.webp" loading="lazy" alt="avg_pay_per_temp.png" width="750" height="600">
&lt;/a>
&lt;/div>
&lt;div class="gallery-item gallery-item--medium">
&lt;a data-fancybox="gallery-taxi_plots" href="https://datadillon.netlify.app/media/albums/taxi_plots/avg_pay_per_trip_day_of_month.png" >
&lt;img src="https://datadillon.netlify.app/media/albums/taxi_plots/avg_pay_per_trip_day_of_month_huce24d853ccad2b6789b0cb00ad48f641_34082_750x750_fit_q75_h2_lanczos_3.webp" loading="lazy" alt="avg_pay_per_trip_day_of_month.png" width="750" height="600">
&lt;/a>
&lt;/div>
&lt;div class="gallery-item gallery-item--medium">
&lt;a data-fancybox="gallery-taxi_plots" href="https://datadillon.netlify.app/media/albums/taxi_plots/box_plot_avg_pay_borough_no_outliers.png" >
&lt;img src="https://datadillon.netlify.app/media/albums/taxi_plots/box_plot_avg_pay_borough_no_outliers_hu84330c4a16bcd6d61247c2f877c366f5_26428_750x750_fit_q75_h2_lanczos_3.webp" loading="lazy" alt="box_plot_avg_pay_borough_no_outliers.png" width="640" height="480">
&lt;/a>
&lt;/div>
&lt;div class="gallery-item gallery-item--medium">
&lt;a data-fancybox="gallery-taxi_plots" href="https://datadillon.netlify.app/media/albums/taxi_plots/busiest_borough_when_cold.png" >
&lt;img src="https://datadillon.netlify.app/media/albums/taxi_plots/busiest_borough_when_cold_huf54de82af5ff44c98bdde430b91a564a_39755_750x750_fit_q75_h2_lanczos_3.webp" loading="lazy" alt="busiest_borough_when_cold.png" width="750" height="500">
&lt;/a>
&lt;/div>
&lt;div class="gallery-item gallery-item--medium">
&lt;a data-fancybox="gallery-taxi_plots" href="https://datadillon.netlify.app/media/albums/taxi_plots/corr_tips_tripduration.png" >
&lt;img src="https://datadillon.netlify.app/media/albums/taxi_plots/corr_tips_tripduration_huc60caada8309bf8c871ae69a9826334c_442552_750x750_fit_q75_h2_lanczos_3.webp" loading="lazy" alt="corr_tips_tripduration.png" width="750" height="500">
&lt;/a>
&lt;/div>
&lt;div class="gallery-item gallery-item--medium">
&lt;a data-fancybox="gallery-taxi_plots" href="https://datadillon.netlify.app/media/albums/taxi_plots/sheet-1.png" >
&lt;img src="https://datadillon.netlify.app/media/albums/taxi_plots/sheet-1_hua3637c06c3e49e0ac3c73aab90788191_510237_750x750_fit_q75_h2_lanczos_3.webp" loading="lazy" alt="sheet-1.png" width="750" height="372">
&lt;/a>
&lt;/div>
&lt;/div>
&lt;/p>
&lt;hr></description></item><item><title>Predicting Successful Songs</title><link>https://datadillon.netlify.app/project/spotify/</link><pubDate>Mon, 08 Apr 2024 00:00:00 +0000</pubDate><guid>https://datadillon.netlify.app/project/spotify/</guid><description>&lt;h1 id="popularity-prediction-using-spotify-features">Popularity Prediction using Spotify Features&lt;/h1>
&lt;p>Authors: Dillon Diatlo, Elaine Chen, Harnish Shah, Rajashree Choudhary&lt;/p>
&lt;h2 id="role">Role&lt;/h2>
&lt;ul>
&lt;li>Data Cleaning&lt;/li>
&lt;li>Data Visualization&lt;/li>
&lt;li>Data Modeling (build, train, validate)&lt;/li>
&lt;li>Presentation&lt;/li>
&lt;/ul>
&lt;h2 id="problem-statement">Problem Statement&lt;/h2>
&lt;p>In the expansive domain of &lt;a href="https://www.spotify.com/" target="_blank" rel="noopener">Spotify&lt;/a>, a platform that serves as a hub for global music communities and boasts a diverse repertoire of genres, the objective is to predict the popularity using songs released prior to 2019. This entails analyzing song features within Spotify&amp;rsquo;s structure to anticipate their acclaim. Utilizing Spotify&amp;rsquo;s vast database and insights into each song&amp;rsquo;s characteristics, the aim is to discern the elements contributing to a song&amp;rsquo;s success. Exploring factors such as genre, danceability, valency, and tempo etc, this predictive analysis seeks to unveil prevalent patterns and trends in the music landscape up to 2018. Ultimately, this exploration offers a means to comprehend and forecast the resonance of songs within Spotify&amp;rsquo;s diverse and dynamic ecosystem.&lt;/p>
&lt;h2 id="data-dictionary">Data Dictionary&lt;/h2>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Feature&lt;/th>
&lt;th>Type&lt;/th>
&lt;th>Description&lt;/th>
&lt;th>Range&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>acousticness&lt;/td>
&lt;td>&lt;em>float&lt;/em>&lt;/td>
&lt;td>A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.&lt;/td>
&lt;td>0 - 1&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>danceability&lt;/td>
&lt;td>&lt;em>float&lt;/em>&lt;/td>
&lt;td>Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.&lt;/td>
&lt;td>0 - 1&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>duration_min&lt;/td>
&lt;td>&lt;em>float&lt;/em>&lt;/td>
&lt;td>The duration of the track in minutes.&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>energy&lt;/td>
&lt;td>&lt;em>float&lt;/em>&lt;/td>
&lt;td>Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.&lt;/td>
&lt;td>0 - 1&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>instrumentalness&lt;/td>
&lt;td>&lt;em>float&lt;/em>&lt;/td>
&lt;td>Predicts whether a track contains no vocals. &amp;ldquo;Ooh&amp;rdquo; and &amp;ldquo;aah&amp;rdquo; sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly &amp;ldquo;vocal&amp;rdquo;. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content.&lt;/td>
&lt;td>0 - 1&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>key&lt;/td>
&lt;td>&lt;em>integer&lt;/em>&lt;/td>
&lt;td>The key the track is in. Integers map to pitches using standard Pitch Class notation. If no key was detected, the value is -1.&lt;/td>
&lt;td>-1 - 11&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>liveness&lt;/td>
&lt;td>&lt;em>float&lt;/em>&lt;/td>
&lt;td>Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.&lt;/td>
&lt;td>0 - 1&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>loudness&lt;/td>
&lt;td>&lt;em>float&lt;/em>&lt;/td>
&lt;td>The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Values typically range between -60 and 0 dB.&lt;/td>
&lt;td>-60 - 0 dB&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>mode&lt;/td>
&lt;td>&lt;em>integer&lt;/em>&lt;/td>
&lt;td>Mode indicates the modality (major or minor) of a track, represented by 1 for major and 0 for minor.&lt;/td>
&lt;td>0 - 1&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>speechiness&lt;/td>
&lt;td>&lt;em>float&lt;/em>&lt;/td>
&lt;td>Speechiness detects the presence of spoken words in a track.&lt;/td>
&lt;td>0 - 1&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>tempo&lt;/td>
&lt;td>&lt;em>float&lt;/em>&lt;/td>
&lt;td>The overall estimated tempo of a track in beats per minute (BPM).&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>time_signature&lt;/td>
&lt;td>&lt;em>integer&lt;/em>&lt;/td>
&lt;td>An estimated time signature, ranging from 3 to 7.&lt;/td>
&lt;td>3 - 7&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>track_href&lt;/td>
&lt;td>&lt;em>object&lt;/em>&lt;/td>
&lt;td>A link to the Web API endpoint providing full details of the track.&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>type&lt;/td>
&lt;td>&lt;em>object&lt;/em>&lt;/td>
&lt;td>The object type, which must be &amp;ldquo;audio_features&amp;rdquo;.&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>valence&lt;/td>
&lt;td>&lt;em>float&lt;/em>&lt;/td>
&lt;td>A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track.&lt;/td>
&lt;td>0 - 1&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>population&lt;/td>
&lt;td>&lt;em>object&lt;/em>&lt;/td>
&lt;td>A 0-to-100 score that ranks how popular an artist is relative to other artists on Spotify.&lt;/td>
&lt;td>0 - 100&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>genre&lt;/td>
&lt;td>&lt;em>object&lt;/em>&lt;/td>
&lt;td>The genre of the track&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h2 id="executive-summary">Executive Summary&lt;/h2>
&lt;h3 id="project-objective">Project objective&lt;/h3>
&lt;p>We will leverage Spotify&amp;rsquo;s features database to categorize songs into popularity groups based on the following ranges:&lt;/p>
&lt;ul>
&lt;li>Popularity 0-20: Least popular&lt;/li>
&lt;li>Popularity 21-40: Less popular&lt;/li>
&lt;li>Popularity 41-60: Medium popular&lt;/li>
&lt;li>Popularity 61-80: More popular&lt;/li>
&lt;li>Popularity 81-100: Most popular&lt;/li>
&lt;/ul>
&lt;h3 id="methodology">Methodology&lt;/h3>
&lt;p>The methodology involves several steps:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>Data Cleaning:&lt;/p>
&lt;ul>
&lt;li>Impute missing track name&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Exploratory Data Analysis (EDA):&lt;/p>
&lt;ul>
&lt;li>Each Spotify track features correlation&lt;/li>
&lt;li>Popularity distribution&lt;/li>
&lt;li>Numerical features statisticical summary&lt;/li>
&lt;li>Categorical features distribution&lt;/li>
&lt;li>Top artists, top songs exploratory&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Feature Engineering&lt;/p>
&lt;ul>
&lt;li>Transpose feature genre, to remove records that same song marked in different genre&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Train-Test Split:&lt;/p>
&lt;ul>
&lt;li>The dataset is divided into training and testing sets to assess model performance accurately.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Model Selection:&lt;/p>
&lt;ul>
&lt;li>Several classification algorithms are utilized for model training, including Clustering, Desicion Tree Classification, Random Forest Classification, ExtraTrees Classification and Neural Network&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Model Evaluation:&lt;/p>
&lt;ul>
&lt;li>Each classification model is trained using the training data and evaluated using testing data.&lt;/li>
&lt;li>Model performance metrics such as training score, testing score, and accuracy is calculated to assess the effectiveness of each model.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;h3 id="plots">Plots&lt;/h3>
&lt;h4 id="features">Features&lt;/h4>
&lt;p>Heatmap: Numeric Feature Correlations&lt;/p>
&lt;ul>
&lt;li>Looking at feature correlations to narrow down which variables to use when predicting popularity&lt;/li>
&lt;li>Feature correlation can also help determine co-linearity and bias
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Heatmap: Numeric Feature Correlations" srcset="
/project/spotify/Numeric_Features_correlation_heatmap_hucb4c2b8f5fa6d847e9d809f40854aa0e_158444_be6d0b8ddb8f5c3e4803e5d23feccd47.webp 400w,
/project/spotify/Numeric_Features_correlation_heatmap_hucb4c2b8f5fa6d847e9d809f40854aa0e_158444_b7f0efd44d9ff3c3396043e2b90ef6a8.webp 760w,
/project/spotify/Numeric_Features_correlation_heatmap_hucb4c2b8f5fa6d847e9d809f40854aa0e_158444_1200x1200_fit_q75_h2_lanczos.webp 1200w"
src="https://datadillon.netlify.app/project/spotify/Numeric_Features_correlation_heatmap_hucb4c2b8f5fa6d847e9d809f40854aa0e_158444_be6d0b8ddb8f5c3e4803e5d23feccd47.webp"
width="760"
height="760"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/li>
&lt;/ul>
&lt;p>Feature Relationships: Energy and Acousticness&lt;/p>
&lt;ul>
&lt;li>Looking at feature correlations can help to determine co-linearity and bias&lt;/li>
&lt;li>Trend: As acousticness goes up, energy decreases
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Feature Relationships: Energy and Acousticness" srcset="
/project/spotify/high_cor_Energy_by_Acousticness_hud63fd205c8b4edff91294e544e6c9802_147292_bdd39ce958b77cb65d412ca980679906.webp 400w,
/project/spotify/high_cor_Energy_by_Acousticness_hud63fd205c8b4edff91294e544e6c9802_147292_df1caee8482610bd0c9141ee6b0e718f.webp 760w,
/project/spotify/high_cor_Energy_by_Acousticness_hud63fd205c8b4edff91294e544e6c9802_147292_1200x1200_fit_q75_h2_lanczos.webp 1200w"
src="https://datadillon.netlify.app/project/spotify/high_cor_Energy_by_Acousticness_hud63fd205c8b4edff91294e544e6c9802_147292_bdd39ce958b77cb65d412ca980679906.webp"
width="760"
height="475"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/li>
&lt;/ul>
&lt;p>Feature Relationships: Energy and Loudness&lt;/p>
&lt;ul>
&lt;li>Looking at feature correlations can help to determine co-linearity and bias&lt;/li>
&lt;li>Trend: Energy and loudness have a clear positive correlation
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Feature Relationships: Energy and Loudness" srcset="
/project/spotify/correlation_Energy_Loudness_hud63fd205c8b4edff91294e544e6c9802_106719_5152ebca41054d1274fc2093ac797ac9.webp 400w,
/project/spotify/correlation_Energy_Loudness_hud63fd205c8b4edff91294e544e6c9802_106719_69573e474978b432306f9f9625af8afa.webp 760w,
/project/spotify/correlation_Energy_Loudness_hud63fd205c8b4edff91294e544e6c9802_106719_1200x1200_fit_q75_h2_lanczos.webp 1200w"
src="https://datadillon.netlify.app/project/spotify/correlation_Energy_Loudness_hud63fd205c8b4edff91294e544e6c9802_106719_5152ebca41054d1274fc2093ac797ac9.webp"
width="760"
height="475"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/li>
&lt;/ul>
&lt;p>Feature Relationships: Loudness and Acousticness&lt;/p>
&lt;ul>
&lt;li>Looking at feature correlations can help to determine co-linearity and bias&lt;/li>
&lt;li>Trend: As acousticness goes up, loudness decreases, confirming previous two visualizations
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Feature Relationships: Loudness and Acousticness" srcset="
/project/spotify/correlation_Loudness_Acousticness_hud63fd205c8b4edff91294e544e6c9802_119730_f1298fa5906a9151983dd0ccdedef635.webp 400w,
/project/spotify/correlation_Loudness_Acousticness_hud63fd205c8b4edff91294e544e6c9802_119730_ce931f6b4780487053d47eaf8bfd262d.webp 760w,
/project/spotify/correlation_Loudness_Acousticness_hud63fd205c8b4edff91294e544e6c9802_119730_1200x1200_fit_q75_h2_lanczos.webp 1200w"
src="https://datadillon.netlify.app/project/spotify/correlation_Loudness_Acousticness_hud63fd205c8b4edff91294e544e6c9802_119730_f1298fa5906a9151983dd0ccdedef635.webp"
width="760"
height="475"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/li>
&lt;/ul>
&lt;h4 id="genre">Genre&lt;/h4>
&lt;p>Distribution of Songs by Genre
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Distribution of Songs by Genre" srcset="
/project/spotify/Genre_Distribution_huc0ec0901b1553ed00a576508d8b99047_300940_e3bfa5707df2e68bde1d68000b383e70.webp 400w,
/project/spotify/Genre_Distribution_huc0ec0901b1553ed00a576508d8b99047_300940_fbd4df37160fe9f4f796aaf671a11230.webp 760w,
/project/spotify/Genre_Distribution_huc0ec0901b1553ed00a576508d8b99047_300940_1200x1200_fit_q75_h2_lanczos.webp 1200w"
src="https://datadillon.netlify.app/project/spotify/Genre_Distribution_huc0ec0901b1553ed00a576508d8b99047_300940_e3bfa5707df2e68bde1d68000b383e70.webp"
width="760"
height="405"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>Song Duration by Genre
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Song Duration by Genre" srcset="
/project/spotify/Duration_of_Songs_in_Different_Genres_hu82303b6b86a747f2233b823e8ea8008f_220409_f4e57f6c14ac78fbeb199b8c5e2b204c.webp 400w,
/project/spotify/Duration_of_Songs_in_Different_Genres_hu82303b6b86a747f2233b823e8ea8008f_220409_c841a28c9b1a7a81092a2c2fbee3639d.webp 760w,
/project/spotify/Duration_of_Songs_in_Different_Genres_hu82303b6b86a747f2233b823e8ea8008f_220409_1200x1200_fit_q75_h2_lanczos.webp 1200w"
src="https://datadillon.netlify.app/project/spotify/Duration_of_Songs_in_Different_Genres_hu82303b6b86a747f2233b823e8ea8008f_220409_f4e57f6c14ac78fbeb199b8c5e2b204c.webp"
width="760"
height="608"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>Violin Plot: Energy by Genre&lt;/p>
&lt;ul>
&lt;li>Here you can see that each genre can often be determined by where a songs energy level fits
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Violin Plot: Energy by Genre" srcset="
/project/spotify/Violin_Energy_Genre_hu80de1c058815d18a8461f9c656a2217a_544560_b9efb6749a8a336a1651d5db9b9685e3.webp 400w,
/project/spotify/Violin_Energy_Genre_hu80de1c058815d18a8461f9c656a2217a_544560_ec3b12b4fd9a376b0309eb275b9c4fe3.webp 760w,
/project/spotify/Violin_Energy_Genre_hu80de1c058815d18a8461f9c656a2217a_544560_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://datadillon.netlify.app/project/spotify/Violin_Energy_Genre_hu80de1c058815d18a8461f9c656a2217a_544560_b9efb6749a8a336a1651d5db9b9685e3.webp"
width="760"
height="507"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/li>
&lt;/ul>
&lt;h4 id="popularity">Popularity&lt;/h4>
&lt;p>Distribution of Songs by Popularity Ranking&lt;/p>
&lt;ul>
&lt;li>Most songs have a mid-range popularity ranking, making it difficult to create a model that can accurately predict popularity&lt;/li>
&lt;li>We see a bimodal trend here because just under 8k songs have not been ranked at all, most likely due to the nature of of how many songs are on Spotify and never really listened to
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Distribution of Songs by Popularity Ranking" srcset="
/project/spotify/Popularity_Histgram_hu1da99b27d2f489088cb4848b279ea6e0_71429_941f6a162eea73f3d063ea440e89639a.webp 400w,
/project/spotify/Popularity_Histgram_hu1da99b27d2f489088cb4848b279ea6e0_71429_bb8fe4834f5d9192112ea12ed6765b0c.webp 760w,
/project/spotify/Popularity_Histgram_hu1da99b27d2f489088cb4848b279ea6e0_71429_1200x1200_fit_q75_h2_lanczos.webp 1200w"
src="https://datadillon.netlify.app/project/spotify/Popularity_Histgram_hu1da99b27d2f489088cb4848b279ea6e0_71429_941f6a162eea73f3d063ea440e89639a.webp"
width="760"
height="760"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/li>
&lt;/ul>
&lt;p>Heatmap: Popularity Feature Correlations&lt;/p>
&lt;ul>
&lt;li>Loudness is the feature that correlates with popularity most
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Heatmap: Popularity Feature Correlations" srcset="
/project/spotify/Spotify_Popul_Corrs_huf3a170f922b9625b1f258c0fbbc5b3c2_104756_f82becc72c812ec1da49175e8e3acb7a.webp 400w,
/project/spotify/Spotify_Popul_Corrs_huf3a170f922b9625b1f258c0fbbc5b3c2_104756_9c6f9ad3bf4bf18f55be9cea9b2bb26e.webp 760w,
/project/spotify/Spotify_Popul_Corrs_huf3a170f922b9625b1f258c0fbbc5b3c2_104756_1200x1200_fit_q75_h2_lanczos.webp 1200w"
src="https://datadillon.netlify.app/project/spotify/Spotify_Popul_Corrs_huf3a170f922b9625b1f258c0fbbc5b3c2_104756_f82becc72c812ec1da49175e8e3acb7a.webp"
width="760"
height="760"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/li>
&lt;/ul>
&lt;p>Popularity by Loudness&lt;/p>
&lt;ul>
&lt;li>Most popular songs seem to land between -10 and -5 decibels
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Popularity by Loudness" srcset="
/project/spotify/Relation_between_Loudness_and_Popularity_hud63fd205c8b4edff91294e544e6c9802_144592_3d4031950a2971e6c83534da857f5e88.webp 400w,
/project/spotify/Relation_between_Loudness_and_Popularity_hud63fd205c8b4edff91294e544e6c9802_144592_50db2eb99a62929c94f60730c45fe342.webp 760w,
/project/spotify/Relation_between_Loudness_and_Popularity_hud63fd205c8b4edff91294e544e6c9802_144592_1200x1200_fit_q75_h2_lanczos.webp 1200w"
src="https://datadillon.netlify.app/project/spotify/Relation_between_Loudness_and_Popularity_hud63fd205c8b4edff91294e544e6c9802_144592_3d4031950a2971e6c83534da857f5e88.webp"
width="760"
height="570"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/li>
&lt;/ul>
&lt;p>Distribution of Key by Popularity&lt;/p>
&lt;ul>
&lt;li>Songs in the key of C tend to be the most popular, followed by G and D (though I suppose any punk music fan could have told you that!)
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Distribution of Key by Popularity" srcset="
/project/spotify/Key_Distribution_among_all_Popularity_hued37f61a7c6a9d144184504d97f70729_91120_638de71ea63bb20c4b6b3bb8bd7b588a.webp 400w,
/project/spotify/Key_Distribution_among_all_Popularity_hued37f61a7c6a9d144184504d97f70729_91120_a50c18538865176e64b033f153515ccf.webp 760w,
/project/spotify/Key_Distribution_among_all_Popularity_hued37f61a7c6a9d144184504d97f70729_91120_1200x1200_fit_q75_h2_lanczos.webp 1200w"
src="https://datadillon.netlify.app/project/spotify/Key_Distribution_among_all_Popularity_hued37f61a7c6a9d144184504d97f70729_91120_638de71ea63bb20c4b6b3bb8bd7b588a.webp"
width="760"
height="570"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/li>
&lt;/ul>
&lt;p>Top 10 Most Popular Spotify Songs&lt;/p>
&lt;ul>
&lt;li>We see here a bias in the data, since the same songs fall under multiple genres the top 10 most popular songs are only 5 songs
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Top 10 Most Popular Spotify Songs" srcset="
/project/spotify/Top_10_pop_hu5b71fb5a76d220cf93969e529e25cb7e_96721_c3fb769c6fe79f7c79c3c73914a8986c.webp 400w,
/project/spotify/Top_10_pop_hu5b71fb5a76d220cf93969e529e25cb7e_96721_921f206faf7f0c48f170147da9540f82.webp 760w,
/project/spotify/Top_10_pop_hu5b71fb5a76d220cf93969e529e25cb7e_96721_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://datadillon.netlify.app/project/spotify/Top_10_pop_hu5b71fb5a76d220cf93969e529e25cb7e_96721_c3fb769c6fe79f7c79c3c73914a8986c.webp"
width="760"
height="391"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/li>
&lt;/ul>
&lt;p>Top 10 Least Popular Spotify Songs&lt;/p>
&lt;ul>
&lt;li>In the wise words of Rodney Dangerfield, Ludwig van Beethoven gets &amp;ldquo;No respect!&amp;rdquo;
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Top 10 Least Popular Spotify Songs" srcset="
/project/spotify/Least_popular_10_huce635e5488969f27ca75e0b699e271c0_54853_90ba6c81e33f05fee3a6b46d6ffb9609.webp 400w,
/project/spotify/Least_popular_10_huce635e5488969f27ca75e0b699e271c0_54853_9a7f152313c8f1d56e1351b4b4c5dfc0.webp 760w,
/project/spotify/Least_popular_10_huce635e5488969f27ca75e0b699e271c0_54853_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://datadillon.netlify.app/project/spotify/Least_popular_10_huce635e5488969f27ca75e0b699e271c0_54853_90ba6c81e33f05fee3a6b46d6ffb9609.webp"
width="760"
height="321"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/li>
&lt;/ul>
&lt;p>Top 20 Most Popular Spotify Artists
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Top 20 Most Popular Spotify Artists" srcset="
/project/spotify/Top_20_Artists_hue23de5156996d1cfcb4a87424a1ecda9_42528_ff1e57efc3500412bd9b813acc1de37d.webp 400w,
/project/spotify/Top_20_Artists_hue23de5156996d1cfcb4a87424a1ecda9_42528_f64596c5ee9ec000f00465c976c93bd6.webp 760w,
/project/spotify/Top_20_Artists_hue23de5156996d1cfcb4a87424a1ecda9_42528_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://datadillon.netlify.app/project/spotify/Top_20_Artists_hue23de5156996d1cfcb4a87424a1ecda9_42528_ff1e57efc3500412bd9b813acc1de37d.webp"
width="296"
height="608"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h3 id="conclusion-and-recommendations">Conclusion and Recommendations&lt;/h3>
&lt;h4 id="conclusion">Conclusion&lt;/h4>
&lt;p>Our analysis revealed several key insights regarding the classification of song popularity using machine learning algorithms. Initially, through k-means clustering, we determined the optimal number of clusters, which informed our grouping of popularity into three classes. Subsequent experimentation with classification algorithms highlighted Random Forest as the top performer, achieving impressive training and testing scores. However, we identified a challenge with imbalanced popularity classes, particularly within the range of 34 to 66. To address this, we refined our approach by dividing popularity into five classes and employing a Dense Neural Network (DNN), resulting in improved accuracy rates. Overall, these findings underscore the importance of careful feature selection and addressing data imbalances to enhance the effectiveness of predictive models in classifying song popularity.&lt;/p>
&lt;h4 id="recommendations">Recommendations&lt;/h4>
&lt;ul>
&lt;li>The dataset displays an imbalance, notably in the most popular group (Popularity: 81 to 100). Future studies should prioritize acquiring more up-to-date and balanced data to ensure robust analysis and accurate predictions.&lt;/li>
&lt;li>Given the limitations in data extraction due to Spotify&amp;rsquo;s updated user safety policy, we had restricted access to the latest dataset. For improved insights, future studies should collaborate directly with Spotify to access real-time data sources, enabling access to up-to-date information for analysis and decision-making.&lt;/li>
&lt;/ul></description></item><item><title>NLP &amp; Sentiment Analysis</title><link>https://datadillon.netlify.app/project/reddit/</link><pubDate>Fri, 15 Mar 2024 00:00:00 +0000</pubDate><guid>https://datadillon.netlify.app/project/reddit/</guid><description>&lt;h2 id="nlp--sentiment-analysis">NLP &amp;amp; Sentiment Analysis&lt;/h2>
&lt;p>by Dillon Diatlo&lt;/p>
&lt;h2 id="problem-statement">Problem Statement&lt;/h2>
&lt;p>The objective of this project is to utilize Vectorization and Sentiment Analysis techniques to build a classification model that can accurately categorize which subreddit––r/running or r/Swimming––a user&amp;rsquo;s post belongs to. In doing so, I will:&lt;/p>
&lt;ul>
&lt;li>Find out which subreddit is less happy&lt;/li>
&lt;li>Use Natural Language Processing to build classification models that can accurately categorize Reddit posts&lt;/li>
&lt;/ul>
&lt;h2 id="data-dictionary">Data Dictionary&lt;/h2>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Feature&lt;/th>
&lt;th>Type&lt;/th>
&lt;th>Dataset&lt;/th>
&lt;th>Description&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>datetime&lt;/strong>&lt;/td>
&lt;td>&lt;em>object&lt;/em>&lt;/td>
&lt;td>full&lt;/td>
&lt;td>The date and time the reddit user posted in the format, Y-MM-DD, H:M:S&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>all_text&lt;/strong>&lt;/td>
&lt;td>&lt;em>object&lt;/em>&lt;/td>
&lt;td>full&lt;/td>
&lt;td>The combined text of every post&amp;rsquo;s title and user&amp;rsquo;s post description&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>subreddit&lt;/strong>&lt;/td>
&lt;td>&lt;em>int&lt;/em>&lt;/td>
&lt;td>full&lt;/td>
&lt;td>Binary categorization of subreddits r/running [0] and r/Swimming [1]&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>sentiment&lt;/strong>&lt;/td>
&lt;td>&lt;em>float&lt;/em>&lt;/td>
&lt;td>full&lt;/td>
&lt;td>A number ranked between -1 and 1 suggesting the overall sentiment of the all_text copy&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>post_word_count&lt;/strong>&lt;/td>
&lt;td>&lt;em>int&lt;/em>&lt;/td>
&lt;td>full&lt;/td>
&lt;td>The number of words in each individual all_text copy&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>title_word_count&lt;/strong>&lt;/td>
&lt;td>&lt;em>int&lt;/em>&lt;/td>
&lt;td>full&lt;/td>
&lt;td>The number of words in each individual post title&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>upper_count&lt;/strong>&lt;/td>
&lt;td>&lt;em>int&lt;/em>&lt;/td>
&lt;td>full&lt;/td>
&lt;td>The number of upper case letters found in individual all_text posts&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>lower_count&lt;/strong>&lt;/td>
&lt;td>&lt;em>int&lt;/em>&lt;/td>
&lt;td>full&lt;/td>
&lt;td>The number of lower case letters found in individual all_text posts&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h2 id="executive-summary">Executive Summary&lt;/h2>
&lt;h3 id="project-objective">Project Objective&lt;/h3>
&lt;p>&lt;strong>Problem Statement&lt;/strong>:
The objective of this project is to fit to accurately categorize Reddit posts into two separate subreddits, r/running and r/Swimming, based on word frequency and classification techniques.&lt;/p>
&lt;p>&lt;strong>Goals&lt;/strong>:
The goals of this project are to analyze the subreddits r/running and r/Swimming in order to learn, based on sentiment analysis, which subreddit users are less happy, and then to build a classification model that can use post words to accurately predict which subreddit a post belongs to.&lt;/p>
&lt;h3 id="project-methodology">Project Methodology:&lt;/h3>
&lt;p>&lt;strong>Data Collection&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>PRAW Reddit API&lt;/li>
&lt;li>r/running&lt;/li>
&lt;li>r/Swimming&lt;/li>
&lt;li>full.csv&lt;/li>
&lt;/ul>
&lt;p>full.csv is an accumulation of the following csv&amp;rsquo;s&lt;/p>
&lt;ul>
&lt;li>run feb 26.csv&lt;/li>
&lt;li>run feb 27.csv&lt;/li>
&lt;li>swim feb 26.csv&lt;/li>
&lt;li>swim feb 27.csv&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Data Cleaning&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>Joined all csvs into a single csv&lt;/li>
&lt;li>Replaced NaN&amp;rsquo;s with empty strings and then combined title column and selftext column into a single column: all_text&lt;/li>
&lt;li>Turned created_utc column into processable dates and times&lt;/li>
&lt;li>Dropped weekly automatic r/Swimming rows&lt;/li>
&lt;li>Dropped extra columns and created columns to count uppercase, lowercase, and post/title words&lt;/li>
&lt;li>CountVectorized the all_text column for manipulation&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Exploratory Data Analysis (EDA)&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>Created a lemmatizer function and sentiment analysis function&lt;/li>
&lt;li>Found the top 10 and 25 most frequently used words, bigrams, and trigrams for both subreddits&lt;/li>
&lt;li>Explored the correlation between datetime and subreddit sentiment&lt;/li>
&lt;li>Explored the correlation between post/title word count and subreddit sentiment&lt;/li>
&lt;li>Explored the distribution of sentiment for posts within both subreddits&lt;/li>
&lt;li>Found the mean positivity of both subreddits&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Modeling Techniques&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>Found my baseline accuracy of .45&lt;/li>
&lt;li>CountVectorizer paired with a Multinomial seemed to perform poorly&lt;/li>
&lt;li>RandomForest and Logistic Regression performed the best overall&lt;/li>
&lt;li>When paired with TfidVectorizer and a GridSearchCV, I found that both Logistic Regression and RandomForest performed better and well&lt;/li>
&lt;li>After adding in a lemmatizer and only seeing a slight change, I decided to take my nearly 97%&lt;/li>
&lt;/ul>
&lt;h3 id="key-findings">Key Findings:&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Based on the analysis conducted, putting lemmatized data through a TfidVectorizer with my LogisticRegression model and fitting that to a GridSearchCV built the most accurate classification model at 96.8%. When looking at the confusion matrix, this performend just minimally better than non-lemmatized data through a TfidVectorizer and Randomforest, fit to a GridSearchCV.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>r/runner users seem to be generally happier with a mean positve sentiment analysis of .55, compared to r/Swimming .31. This could mean swimmer are less happy, but can also mean swimmers are less concerned with accomplishmentmets like &amp;ldquo;finish lines&amp;rdquo; and &amp;ldquo;goals&amp;rdquo; and more concerned with technique.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>There is a correlation between year and posting sentiment&lt;/p>
&lt;/li>
&lt;li>
&lt;p>All models outperformed my baseline of 45% accuracy.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="plots">Plots&lt;/h3>
&lt;p>Top 10 Words Being Used in r/running
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Top 10 Words Being Used in r/running" srcset="
/project/reddit/Top_10_Words_rRun_hu3f7d0a1223261fa00839f3666e605317_27711_293adbed78f8ca55b20710907ccb5121.webp 400w,
/project/reddit/Top_10_Words_rRun_hu3f7d0a1223261fa00839f3666e605317_27711_821aa7bbffd6022fc14d50fb40e1a604.webp 760w,
/project/reddit/Top_10_Words_rRun_hu3f7d0a1223261fa00839f3666e605317_27711_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://datadillon.netlify.app/project/reddit/Top_10_Words_rRun_hu3f7d0a1223261fa00839f3666e605317_27711_293adbed78f8ca55b20710907ccb5121.webp"
width="760"
height="760"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>Top 10 Bigrams Being Used in r/running&lt;/p>
&lt;ul>
&lt;li>Bigrams can give us more insight into context and what type of things users in r/running are talking about&lt;/li>
&lt;li>You can see here &amp;ldquo;half marathon&amp;rdquo;, &amp;ldquo;finish line&amp;rdquo;, and &amp;ldquo;feel like&amp;rdquo; are all emotionally driven bigrams, which could mean runners are more focused on accomplishment
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Top 10 Bigrams Being Used in r/running" srcset="
/project/reddit/Bi_10_Run_hu52540a4e4d0bfe8e090a5b3a8f1f926b_38329_b5b8ea5c90aea4301906bdd52f2a4da7.webp 400w,
/project/reddit/Bi_10_Run_hu52540a4e4d0bfe8e090a5b3a8f1f926b_38329_c9f7f80b93c11dc457a4ad6e28664410.webp 760w,
/project/reddit/Bi_10_Run_hu52540a4e4d0bfe8e090a5b3a8f1f926b_38329_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://datadillon.netlify.app/project/reddit/Bi_10_Run_hu52540a4e4d0bfe8e090a5b3a8f1f926b_38329_b5b8ea5c90aea4301906bdd52f2a4da7.webp"
width="760"
height="760"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/li>
&lt;/ul>
&lt;p>Top 10 Trigrams Being Used in r/running&lt;/p>
&lt;ul>
&lt;li>Trigrams can give us &lt;strong>even more&lt;/strong> insight into context and what type of things users in r/running are talking about&lt;/li>
&lt;li>You can see here r/runner users are very focused on goals and accomplishments
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Top 10 Bigrams Being Used in r/running" srcset="
/project/reddit/Tri_10_Run_hu477042bb75c8ca3e642b6d7ede8645b3_69524_9afbd7e77901e0f47733b447af285f56.webp 400w,
/project/reddit/Tri_10_Run_hu477042bb75c8ca3e642b6d7ede8645b3_69524_06cb1a7279252e59e94a6288309ee1e9.webp 760w,
/project/reddit/Tri_10_Run_hu477042bb75c8ca3e642b6d7ede8645b3_69524_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://datadillon.netlify.app/project/reddit/Tri_10_Run_hu477042bb75c8ca3e642b6d7ede8645b3_69524_9afbd7e77901e0f47733b447af285f56.webp"
width="760"
height="760"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/li>
&lt;/ul>
&lt;p>Top 10 Words Being Used in r/Swimming
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Top 10 Bigrams Being Used in r/Swimming" srcset="
/project/reddit/Top_10_Words_rSwim_hu09725d954fe752638974a5bb25561ba1_28028_0533a2dbb894b8b5e3843e27cd3efe6a.webp 400w,
/project/reddit/Top_10_Words_rSwim_hu09725d954fe752638974a5bb25561ba1_28028_366699189eb3c94268436ee5ae9041b6.webp 760w,
/project/reddit/Top_10_Words_rSwim_hu09725d954fe752638974a5bb25561ba1_28028_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://datadillon.netlify.app/project/reddit/Top_10_Words_rSwim_hu09725d954fe752638974a5bb25561ba1_28028_0533a2dbb894b8b5e3843e27cd3efe6a.webp"
width="760"
height="760"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>Top 10 Bigrams Being Used in r/Swimming&lt;/p>
&lt;ul>
&lt;li>You can see here bigrams like &amp;ldquo;first time&amp;rdquo;, &amp;ldquo;tech suit&amp;rdquo;, and &amp;ldquo;any advice&amp;rdquo; which suggest r/Swimming users may be more focused on technique rather than accomplishment like r/running users
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Top 10 Bigrams Being Used in r/Swimming " srcset="
/project/reddit/Bi_10_Swim_hu204e705e1c7c8241f216cce1927486a2_38053_04ed2aede5713c2c8366d951c867dd33.webp 400w,
/project/reddit/Bi_10_Swim_hu204e705e1c7c8241f216cce1927486a2_38053_44a616d578dfafcb5aa7aa8e460f3dab.webp 760w,
/project/reddit/Bi_10_Swim_hu204e705e1c7c8241f216cce1927486a2_38053_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://datadillon.netlify.app/project/reddit/Bi_10_Swim_hu204e705e1c7c8241f216cce1927486a2_38053_04ed2aede5713c2c8366d951c867dd33.webp"
width="760"
height="760"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/li>
&lt;/ul>
&lt;p>Top 10 Trigrams Being Used in r/Swimming&lt;/p>
&lt;ul>
&lt;li>You can see here Trigrams like like &amp;ldquo;does anyone know&amp;rdquo;, &amp;ldquo;would greatly appreciated&amp;rdquo; and &amp;ldquo;wa wondering anyone&amp;rdquo;, which suggest curiousity and advice-driven users in r/Swimming&lt;/li>
&lt;li>You may also notice &amp;ldquo;drive usp sharing&amp;rdquo;, &amp;ldquo;format pjpg webp&amp;rdquo; and other weird not-really-english ones. These are because there were many links in the content, which cna be further pre-processed and filtered out given more time
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Top 10 Trigrams Being Used in r/Swimming " srcset="
/project/reddit/Tri_10_Swim_hu8b6b3fb7e8f64f518c13e616d8c4bd2a_55128_aac47f1d93b696d42f7924fdbe3f51c0.webp 400w,
/project/reddit/Tri_10_Swim_hu8b6b3fb7e8f64f518c13e616d8c4bd2a_55128_6f702748dd2020902a0558efde501afe.webp 760w,
/project/reddit/Tri_10_Swim_hu8b6b3fb7e8f64f518c13e616d8c4bd2a_55128_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://datadillon.netlify.app/project/reddit/Tri_10_Swim_hu8b6b3fb7e8f64f518c13e616d8c4bd2a_55128_aac47f1d93b696d42f7924fdbe3f51c0.webp"
width="760"
height="760"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/li>
&lt;/ul>
&lt;p>Average Positive Sentiment&lt;/p>
&lt;ul>
&lt;li>Sentiment analysis of user posts in r/running and r/Swimming suggested that, on average, user posts are .24 more positive than r/Swimming user posts&lt;/li>
&lt;li>This does not necessarily mean r/running users are happier, but as we previously saw it could be an indicator of what the two groups are posting about
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Average Positive Sentiment" srcset="
/project/reddit/Mean_Sent_Compare_hu6c738c5d241cd375c8dba37296279fad_18448_60e0e1739d2379d27b3daea4eff40b3e.webp 400w,
/project/reddit/Mean_Sent_Compare_hu6c738c5d241cd375c8dba37296279fad_18448_894c3ead41d9d78bd4d345ff8c48b6ea.webp 760w,
/project/reddit/Mean_Sent_Compare_hu6c738c5d241cd375c8dba37296279fad_18448_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://datadillon.netlify.app/project/reddit/Mean_Sent_Compare_hu6c738c5d241cd375c8dba37296279fad_18448_60e0e1739d2379d27b3daea4eff40b3e.webp"
width="640"
height="480"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/li>
&lt;/ul>
&lt;p>r/Swimming: Distribution of Sentiment&lt;/p>
&lt;ul>
&lt;li>Distribution here is not normal, rather it&amp;rsquo;s left skewed with a bimodal curve&lt;/li>
&lt;li>The sentiment of most r/Swimming posts falls between ~ -.2 and 1, with high accumulations of post sentiment in those same points, suggesting user posts generally fall between neutral or positive
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="r/Swimming: Distribution of Sentiment" srcset="
/project/reddit/Swim_DistSentiment_hu93be25b24d387adfb2c3f63d2cbfd6da_17979_7d02f92cc791cc27772753d37433da6f.webp 400w,
/project/reddit/Swim_DistSentiment_hu93be25b24d387adfb2c3f63d2cbfd6da_17979_a32faef6a9637aaeaa5a42a29da25e36.webp 760w,
/project/reddit/Swim_DistSentiment_hu93be25b24d387adfb2c3f63d2cbfd6da_17979_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://datadillon.netlify.app/project/reddit/Swim_DistSentiment_hu93be25b24d387adfb2c3f63d2cbfd6da_17979_7d02f92cc791cc27772753d37433da6f.webp"
width="640"
height="480"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/li>
&lt;/ul>
&lt;p>r/running: Distribution of Sentiment&lt;/p>
&lt;ul>
&lt;li>Distribution here is not normal, rather it&amp;rsquo;s left skewed&lt;/li>
&lt;li>Analysis of the distribution suggest most r/running user posts are positive, with the majority falling between ~.6 and 1.00
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="r/running: Distribution of Sentiment" srcset="
/project/reddit/Run_DistSentiment_hu61d20699f0080ee52e74dda6d8f06d10_16966_c503388aa979cd25834f1d744c8146a2.webp 400w,
/project/reddit/Run_DistSentiment_hu61d20699f0080ee52e74dda6d8f06d10_16966_743168d2e7cacbaa1de071ca4f26c1ef.webp 760w,
/project/reddit/Run_DistSentiment_hu61d20699f0080ee52e74dda6d8f06d10_16966_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://datadillon.netlify.app/project/reddit/Run_DistSentiment_hu61d20699f0080ee52e74dda6d8f06d10_16966_c503388aa979cd25834f1d744c8146a2.webp"
width="640"
height="480"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/li>
&lt;/ul>
&lt;p>r/running: Year vs. Sentiment&lt;/p>
&lt;ul>
&lt;li>Here you can see a high accumulation of positive posts forming between 2020 and 2022, this correlates with the Covid-19 pandemic, a time when many people were beginning to run outside since gyms and facilities were closed
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="r/running: Year vs. Sentiment" srcset="
/project/reddit/Run_Year_Sentiment_hu0c3553ed86c14d9f5940f11d932a7fb4_79761_3643ad6e5096f71de084f7e1532d1704.webp 400w,
/project/reddit/Run_Year_Sentiment_hu0c3553ed86c14d9f5940f11d932a7fb4_79761_804b70f40022cf4e41134a0c054d848a.webp 760w,
/project/reddit/Run_Year_Sentiment_hu0c3553ed86c14d9f5940f11d932a7fb4_79761_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://datadillon.netlify.app/project/reddit/Run_Year_Sentiment_hu0c3553ed86c14d9f5940f11d932a7fb4_79761_3643ad6e5096f71de084f7e1532d1704.webp"
width="640"
height="480"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/li>
&lt;/ul>
&lt;p>r/Swimming: Year vs. Sentiment&lt;/p>
&lt;ul>
&lt;li>Here you can see the same bimodal split with a heavy accumulation of neutral-sentiment posts and positive posts
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="r/Swimming: Year vs. Sentiment" srcset="
/project/reddit/Swim_Year_Sentiment_hub3ebfe7c1426ef3e5cbb6830ec42a189_88991_0de4af5c87e2648f490765b8903dd782.webp 400w,
/project/reddit/Swim_Year_Sentiment_hub3ebfe7c1426ef3e5cbb6830ec42a189_88991_dddbb0bf3c5afa983c7ce1a47c54391e.webp 760w,
/project/reddit/Swim_Year_Sentiment_hub3ebfe7c1426ef3e5cbb6830ec42a189_88991_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://datadillon.netlify.app/project/reddit/Swim_Year_Sentiment_hub3ebfe7c1426ef3e5cbb6830ec42a189_88991_0de4af5c87e2648f490765b8903dd782.webp"
width="640"
height="480"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/li>
&lt;/ul>
&lt;h3 id="implications-and-conclusion">Implications and Conclusion:&lt;/h3>
&lt;p>Tfidvectorizors, paired with LogisticRegression and GridSearchCV seem to work better using anything with a CountVectorizer as Tfidvectorizers represent term frequency and rarity across corpuses, so this paired with the best hyper parameters will result in a solid prediction. Better than a CountVectorizer which is just frequency and, therefore, may include a lot of less important words.&lt;/p>
&lt;p>In terms of sentiment analysis, runners have a more positive sentiment. Based on &amp;lsquo;Top 10 r/running Trigrams&amp;rsquo;, they seem to be more accomplishment focused, concentrating on wins, goals, and races. These are all emotionally driven terms. At the same time, running has a lower barrier to entry than swimming. It&amp;rsquo;s something most of us can do intuitively. It&amp;rsquo;s therefore easier to start, practice, get better, try races, and more. Runners may not be happier, but they are more emotionally focused.&lt;/p>
&lt;p>Comparatively, posts in r/Swimming are more neutral, as can be seen in the &amp;lsquo;Year vs Sentiment r/Swimming&amp;rsquo; and the multi spikes in &amp;lsquo;Sentiment Distribution r/Swimming&amp;rsquo; . Additionally, r/Swimming users seem to be less accomplishment focused and more imporovement focused. This can be seen in &amp;lsquo;Top 10 r/Swimming Bigrams&amp;rsquo; with words like &amp;ldquo;any advice&amp;rdquo;, &amp;ldquo;first time&amp;rdquo;, &amp;ldquo;tech suit&amp;rdquo;, and &amp;ldquo;started swimming&amp;rdquo;.&lt;/p>
&lt;h3 id="next-steps">Next Steps&lt;/h3>
&lt;p>Next steps are to begin narrowing down our variables to create a regression model that has a lower RMSE and can fit even better.&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;ins>DATETIME&lt;ins> - Models were based on post word frequency, though I wonder if we can get a better model if we also base it on datetime.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;ins>STEMMING&lt;ins> - Data was run through a lemmatizer, though not a stemmer. We should go back and see if stemming effects our models at all.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;ins>FILTER&lt;ins> - Swimming&amp;rsquo;s trigrams include many long, uninterpretable strings. Go back and create a stop word filter to filter out words longer than a certain length&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;ins>PINPOINT USERS&lt;ins> - If we are advertising happiness to sad users, we&amp;rsquo;ll need more than just the subreddit they look through. Let&amp;rsquo;s collect their user names and pinpoint them directly. We can also scrape their data and see where else they post and begin to build classes within our swimmer target audience.&lt;/p>
&lt;/li>
&lt;/ol></description></item></channel></rss>