Predicting Successful Songs

Last updated on May 30, 2024

gif by Dillon Diatlo

Popularity Prediction using Spotify Features

Authors: Dillon Diatlo, Elaine Chen, Harnish Shah, Rajashree Choudhary

Role

Data Cleaning
Data Visualization
Data Modeling (build, train, validate)
Presentation

Problem Statement

In the expansive domain of Spotify, a platform that serves as a hub for global music communities and boasts a diverse repertoire of genres, the objective is to predict the popularity using songs released prior to 2019. This entails analyzing song features within Spotify’s structure to anticipate their acclaim. Utilizing Spotify’s vast database and insights into each song’s characteristics, the aim is to discern the elements contributing to a song’s success. Exploring factors such as genre, danceability, valency, and tempo etc, this predictive analysis seeks to unveil prevalent patterns and trends in the music landscape up to 2018. Ultimately, this exploration offers a means to comprehend and forecast the resonance of songs within Spotify’s diverse and dynamic ecosystem.

Data Dictionary

Feature	Type	Description	Range
acousticness	float	A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.	0 - 1
danceability	float	Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.	0 - 1
duration_min	float	The duration of the track in minutes.
energy	float	Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.	0 - 1
instrumentalness	float	Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content.	0 - 1
key	integer	The key the track is in. Integers map to pitches using standard Pitch Class notation. If no key was detected, the value is -1.	-1 - 11
liveness	float	Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.	0 - 1
loudness	float	The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Values typically range between -60 and 0 dB.	-60 - 0 dB
mode	integer	Mode indicates the modality (major or minor) of a track, represented by 1 for major and 0 for minor.	0 - 1
speechiness	float	Speechiness detects the presence of spoken words in a track.	0 - 1
tempo	float	The overall estimated tempo of a track in beats per minute (BPM).
time_signature	integer	An estimated time signature, ranging from 3 to 7.	3 - 7
track_href	object	A link to the Web API endpoint providing full details of the track.
type	object	The object type, which must be “audio_features”.
valence	float	A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track.	0 - 1
population	object	A 0-to-100 score that ranks how popular an artist is relative to other artists on Spotify.	0 - 100
genre	object	The genre of the track

Executive Summary

Project objective

We will leverage Spotify’s features database to categorize songs into popularity groups based on the following ranges:

Popularity 0-20: Least popular
Popularity 21-40: Less popular
Popularity 41-60: Medium popular
Popularity 61-80: More popular
Popularity 81-100: Most popular

Methodology

The methodology involves several steps:

Data Cleaning:
- Impute missing track name
Exploratory Data Analysis (EDA):
- Each Spotify track features correlation
- Popularity distribution
- Numerical features statisticical summary
- Categorical features distribution
- Top artists, top songs exploratory
Feature Engineering
- Transpose feature genre, to remove records that same song marked in different genre
Train-Test Split:
- The dataset is divided into training and testing sets to assess model performance accurately.
Model Selection:
- Several classification algorithms are utilized for model training, including Clustering, Desicion Tree Classification, Random Forest Classification, ExtraTrees Classification and Neural Network
Model Evaluation:
- Each classification model is trained using the training data and evaluated using testing data.
- Model performance metrics such as training score, testing score, and accuracy is calculated to assess the effectiveness of each model.

Plots

Features

Heatmap: Numeric Feature Correlations

Looking at feature correlations to narrow down which variables to use when predicting popularity
Feature correlation can also help determine co-linearity and bias

Feature Relationships: Energy and Acousticness

Looking at feature correlations can help to determine co-linearity and bias
Trend: As acousticness goes up, energy decreases

Feature Relationships: Energy and Loudness

Looking at feature correlations can help to determine co-linearity and bias
Trend: Energy and loudness have a clear positive correlation

Feature Relationships: Loudness and Acousticness

Looking at feature correlations can help to determine co-linearity and bias
Trend: As acousticness goes up, loudness decreases, confirming previous two visualizations

Genre

Distribution of Songs by Genre

Song Duration by Genre

Violin Plot: Energy by Genre

Here you can see that each genre can often be determined by where a songs energy level fits

Popularity

Distribution of Songs by Popularity Ranking

Most songs have a mid-range popularity ranking, making it difficult to create a model that can accurately predict popularity
We see a bimodal trend here because just under 8k songs have not been ranked at all, most likely due to the nature of of how many songs are on Spotify and never really listened to

Heatmap: Popularity Feature Correlations

Loudness is the feature that correlates with popularity most

Popularity by Loudness

Most popular songs seem to land between -10 and -5 decibels

Distribution of Key by Popularity

Songs in the key of C tend to be the most popular, followed by G and D (though I suppose any punk music fan could have told you that!)

Top 10 Most Popular Spotify Songs

We see here a bias in the data, since the same songs fall under multiple genres the top 10 most popular songs are only 5 songs

Top 10 Least Popular Spotify Songs

In the wise words of Rodney Dangerfield, Ludwig van Beethoven gets “No respect!”

Top 20 Most Popular Spotify Artists

Conclusion and Recommendations

Conclusion

Our analysis revealed several key insights regarding the classification of song popularity using machine learning algorithms. Initially, through k-means clustering, we determined the optimal number of clusters, which informed our grouping of popularity into three classes. Subsequent experimentation with classification algorithms highlighted Random Forest as the top performer, achieving impressive training and testing scores. However, we identified a challenge with imbalanced popularity classes, particularly within the range of 34 to 66. To address this, we refined our approach by dividing popularity into five classes and employing a Dense Neural Network (DNN), resulting in improved accuracy rates. Overall, these findings underscore the importance of careful feature selection and addressing data imbalances to enhance the effectiveness of predictive models in classifying song popularity.

Recommendations

The dataset displays an imbalance, notably in the most popular group (Popularity: 81 to 100). Future studies should prioritize acquiring more up-to-date and balanced data to ensure robust analysis and accurate predictions.
Given the limitations in data extraction due to Spotify’s updated user safety policy, we had restricted access to the latest dataset. For improved insights, future studies should collaborate directly with Spotify to access real-time data sources, enabling access to up-to-date information for analysis and decision-making.

Deep Learning