Part Two - Data Cleaning, Preparation, and Initial Training Results

Introduction #

“Data Cleaning” is the act of removing unnecessary columns from the dataset as well as reducing complexity of the dataset through techniques like scaling and normalization. I’ll get into specifics below, but through these steps, we ended up with 10 classes, 16 columns, and significantly better training results.

Removing Unnecessary Columns #

The dataset started with 21 columns, however, several were deemed unnecessary right out of the gate. One column, “Unnamed: 0”, was simply the index of the row in the dataset. Likewise, “track_id”, was the unique identifier assigned to the track by Spotify. Neither of these were needed for training, and leaving them in would have likely reduced the efficacy of the model.

Additionally, we also chose to remove the columns: “artists”, “track_name”, and “album_name”. This decision was made because although a field like “artist” could be a significant indicator of the genre — given that most artists stick within a single genre — the purpose of our model was to make predictions based on the characteristics of the music, not the track metadata. Including these fields could potentially even reduce the accuracy of some genres in situations where an artist spans multiple genres or changes genre mid-career.

Data Encoding #

The categorical columns (key, time_signature, and mode) were one-hot encoded, which can improve performance in some models by removing ordinal relationship in the data. For example, if a column is called “color” and the options are red, green, and blue, some models will form conclusions that red is greater than green and green is greater than blue. One-Hot Encoding prevents this by creating binary features for each value. In our own project, we took a column such as “key”, which has values like “key_f”, “key_g”, and “key_a”, and one-hot encoded it using Pandas get_dummies function, which converted each possible value into its own column of a binary type (0 or 1), where a 1 is used in the column that represents that row. For example:

Row ID	Key
1	key_a
2	key_f
3	key_g

Became:

Row ID	Key_a	Key_f	Key_g
1	1	0	0
2	0	1	0
3	0	0	1

Following One-Hot encoding, we took all the numerical columns, and applied normalization using Scikit-Learn’s StandardScaler module, which takes each column and scales it appropriately to ensure that no column is unfairly weighted in comparison to others.

Initial Model Training #

After encoding and normalizing the dataset, we trained several initial models including a Random Forest Classifier, a Logistic Regression Classifier, a K-Nearest Neighbor Classifier, and a Support Vector Classifier. None of these models performed very well, with accuracy scores in the low to mid-twenties. Re-examining the dataset, we assumed that this lack of performance was likely due to many of the genres being quite similar. For example, what’s really the difference between “alt-rock”, “punk”, and “punk-rock”? So with this idea, we manually mapped the 114 genres down into just ten genres: world, electronic, rock, pop, instrumental, metal, kids, folk, jazz and hip-hop.

Class Rebalancing #

Remapping the genres caused an immediate issue with the data, however. It wasn’t just 14 subgenres per new genre — some genres had 7 genres mapped into them and some had 20. This caused quite a class imbalance in the new dataset. A class imbalance occurs when one (or several) classes have more samples than the rest, which leads to the model learning those classes more than the others. In our newly remapped training dataset, the “world” genre had nearly 4 times the number of samples of the “jazz” genre, and double the number of samples of the instrumental genre. This would likely cause the models we train to predict “world” more often than they should, or conversely, cause the models to miss the jazz genre when they shouldn’t.

To remediate this, we used imblearn’s SMOTE() module to generate synthetic data samples to balance out each class (genre) of music. In the end, we ended up with 14,000 samples in each genre.

Results after Rebalancing the Classes #

The accuracy results of each model did improve after remapping the genres. Additionally, when discussing the accuracy scores below, it’s important to remember that with 10 genres, random guessing would have a 10% chance of being correct.

Model	Accuracy Score
Logistic Regression	33%
Random Forest Classifier	57%
k-Nearest Neighbor	45%
Support Vector Classifier (RBF Kernel)	47%

In the next post, I’ll be covering steps we took to try and improve the performance of each of these models, including Feature Engineering and Dimensionality Reduction.