Part Three - Feature Engineering, Dimensionality Reduction, and Final Model Results

Introduction #

After training an initial model using the dataset as it was immediately after the data prep step, we saw results of marginal success. The next step in the project was to try to improve our models by reducing the complexity of our datasets. There were several methods we decided on to do this: Backwards Feature Elimination, Principal Component Analysis, Linear Discriminant Analysis, and Kernel Principal Component Analysis. After each method, we trained and tested each model again and recorded their results.

Backward Feature Elimination #

Backward Elimination works by training a model with all columns, then dropping each column and retraining the model between each drop to figure out which column contributed least to the success of the model. After testing each column and deciding which to drop, the test is repeated, iterating over each remaining column to see which is least important. This is repeated until a threshold is met (this threshold is a hyperparameter decided on by the team). In the end, a subset of columns remained: the “most important” columns. Interestingly, in our case, all the categorical columns were dropped. The remaining columns were popularity, duration_ms, danceability, energy, loudness, speechiness, instrumentalness, liveness, valence, and tempo. Retraining each model with the new reduced column set resulted in nearly no improvement in accuracy.

Model	Original Score	New Score
Logistic Regression	33%	32%
Random Forest Classifier	57%	56%
k-Nearest Neighbor	45%	47%
Support Vector Classifier (RBF Kernel)	47%	48%

Principal Component Analysis #

Principal Component Analysis (PCA) is another method of dimensionality reduction that essentially replaces all features with a new set of variables called “Principal Components.” These new components are ordered by how much variance they capture in order to better explain the spread of differences in the original data. PCA isn’t usually something that you’d write yourself, as there are multiple well-documented libraries that can handle the transformation for you. In our project we used Scikit-Learn’s PCA module. Unfortunately, none of the models improved in performance, with the Random Forest Classifier actually dropping in performance.

Model	Original Score	New Score
Logistic Regression	33%	30%
Random Forest Classifier	57%	49%
k-Nearest Neighbor	45%	45%
Support Vector Classifier (RBF Kernel)	47%	45%

Linear Discriminant Analysis #

Linear Discriminant Analysis (LDA) uses the features and the dataset’s labels to combine multiple features into new ones. This is meant to maximize the separation between classes. Another way to think about it is that LDA creates more distinct borders between each class, making it clearer when a sample belongs to one class versus another. Like our previous attempts, LDA saw no improvement in model performance.

Model	Original Score	New Score
Logistic Regression	33%	33%
Random Forest Classifier	57%	56%
k-Nearest Neighbor	45%	45%
Support Vector Classifier (RBF Kernel)	47%	41%

Kernel Principal Component Analysis #

Kernel PCA (KPCA) is a non-linear variant of PCA that excels at capturing complex, non-linear structures in the data by transforming the data into a higher-dimensional feature space. We selected the ‘rbf’ kernel for its recognized all-around performance, but unfortunately KPCA performed worse across all models, likely due to the fact that our data wasn’t that complex to begin with.

Model	Original Score	New Score
Logistic Regression	33%	27%
Random Forest Classifier	57%	46%
k-Nearest Neighbor	45%	41%
Support Vector Classifier (RBF Kernel)	47%	39%

The Silver Lining #

While none of our feature engineering or dimensionality reduction techniques yielded meaningful performance increases, they are far from useless. One important benefit of taking these steps is the reduction in the size of the dataset required for training. Especially nowadays, where storage and memory costs can be significant, a reduction in dataset size can mean meaningful savings in both storage and processing costs.

Future Work #

This was the end of my project for the class, but I do plan to publish one additional post where I attempt to take this further by utilizing a deep learning model via PyTorch, hopefully leading to better performance than the methods tried above.