Model Training

Data 311 Final Project

Alyssa, Bryan, and Vivian

Data Cleaning & Processing

Our cleaning and processing steps are explained in our Exploratory notebook, linked here.

Model Selection

We decided to compare 12 different regression models using our predetermined evaluation metrics to determine which model we should use to run our testing data on.

In order to effectively do this, we upgraded the find_train() function we made in our Exploratory notebook. It now takes in a "zoo" of different regression models that it trains on X and y, and outputs our evaluation metrics (R2, MAE, RMSE) for each model, formatted into a DataFrame. We used a StandardScaler() on our data to center it.

Our zoo includes our baseline model (which predicts the mean), and a variety of regression models that we hypertuned for optimal scores. We added hyperparameters to the models which seemed the most promising to improve their scores.

We made a second evaluation environment that performs polynomial feature expansion on X and computes the same evaluation metrics in a DataFrame. This will allow us to compare polynomial feature expansion between each model and between the original scores of each model.

The function make_table() returns a DataFrame with each model in the zoo and its evaluation metrics on both training and validation sets.

Initial scores

This table shows the initial scores of all our models.

This shows the top model being SVR with a R2 train score of 0.5618 and a validation score of 0.4936. We decided the top model based on the highest R2 score with a lowest difference between R2 training and validation. To make model selection easier we displayed the numbers and top model beneath the table.

Top model: SVR 0.5618 0.4936

Polynomial feature expansion

We tested the same data from above in our polynomial function and found that when we apply polynomial feature expansion, BR is the best model.

Top model: BR 0.5094 0.4810

This model has a smaller gap between the training and validation scores, which we liked. However, the SVR scores were both higher.

Top 10 cols

Next, based on the regression coefficient and PCA analysis from our Exploratory notebook, we decided to only train the model on the ten columns with the strongest relationships with precipitation amount. We ran this through our first evaluation environment.

Best model: SVR 0.5192 0.4879

This SVR model has higher scores than BR, and a smaller gap than the initial SVR model. This confirms that dropping the rest of the columns positively impacts our model.

Polynomial & top 10 cols

Next, we combined both of these strategies and ran our top ten columns through our poly fit environment. This shows us that the highest model is Linear Regression, which we ended up using as our final model.

Best model: LR 0.5416, 0.5158

This model has higher scores than the polynomial SVR and a smaller gap than our initial model. Even though the initial model scored higher on the training set(0.5618), it scored lower on the validation set (0.4936) and we opted to use this Linear Regression model due to the smaller gap between the scores to avoid potential overfitting.

Our next step was to create our linear regression model with polynomial feature expansion and dropped columns.

We created graphs of our linear regression model for our training and validation sets. Overall, our model performed the best on the training set.

Testing our Model

Our next step was to run our model on both of our testing sets.

On our first testing set, our R2 score was 0.504, with a root mean squared error of 0.172. Compared to our validation set, it scored slightly worse, but had a smaller RMSE.

Below is a scatterplot of how our model performed on the first testing set.

Next, we ran our model on our testing set of 2021 data, which was a year our model had never seen before.

This time, our model had an R2 score of 0.435, and an RMSE of 0.176. It had a much smaller score than any of our other sets, but its RMSE was pretty similar to the first testing set.