Exploratory Analysis

Data 311 Final Project

Alyssa, Bryan, and Vivian

Data Collection

We requested data from NOAA's LCD database. The data was collected from SeaTac airport, and we started off with 3 datasets: one spanning 2010-2019, one from 2020, and one from 2021. We combined the first two to get a dataset from 2010 to 2020, which we used for training, validating, and testing our model, and saved the 2021 dataset for testing our final model.

Cleaning & Processing

Because this dataset contains hourly, daily, and monthly data, first, we had to omit data we didn't need. We made a new dataset containing all of the columns with daily measurements, and used .dropna() to remove rows with NaN values. This brought our dataset size down from 144,259 rows to 3,873.

Next, we cleaned our data. Three of our columns contained the letter "T", representing trace amounts of snowdepth, snowfall, and precipitation. We changed these values to 0.0 similar to how we did in Lab 2, and ensured that each column contained only floats.

We repeated these data cleaning steps on our 2021 testing dataset:

Exploratory Analysis

We renamed the columns to shortened versions of their names to improve the readability of the plots we wanted to make, because all of the data is daily.

To explore our data, we first used pairplot() to display scatter plots of each input feature against daily precipitation. These showed us that there were some pretty strong relationships with precipitation amounts. The scatterplots are sorted by correlation coefficient, which we calculate below. As a sanity check, we can see that the last three graphs all have to do with wind, which intuitively don't have a large impact on precipitation amounts. The distributions gradually appear more scattered as the correlation coefficients decrease.

To explore correlation coefficients, we created a basic LinearRegression() model.

Next, we displayed the coefficients, sorted by absolute value, in a bar graph for comparison and in a table.

This shows us that daily average station pressure and sea level pressure have the strongest relationships with daily precipitation. The rest of the columns have slight correlations.

Next, we used PCA analysis to graph the explained variance ratio, or the percentage of variance that is attributed by each of the components.

This shows us that the percentage of explained variance hits a peak at around 9 or 10 input features, after which it appears to plateau. Combining this with our coefficient analysis, it seems likely that we could train our model on the ten columns with the highest regression coefficients to potentially improve the accuracy of our model.

Evaluation Metrics & Baseline Performance

We decided to make our baseline the mean amount of daily precipitation. We created a model for our baseline and calculated its R squared, MAE, and RMSE scores to give us an idea of what we wanted our final model to outperform.

The R2 score is the percentage of variance in the response variable that is explained by our linear model. MAE, or mean absolute error, is the mean of the absolute values of the differences between the true value and the predicted value for each data point. The RMSE is the square root of the average squared difference between the estimated values and the actual value. Due to squaring, RMSE gives a higher weight to large errors.

The mean is 0.1159, which will be our baseline.

Because we predict the mean, the R2 score of 0.0 on our training set makes sense, because variance is defined as the deviation of the observations from the mean. The negative score on our validation set shows that the baseline performed worse, because the mean daily precipitation on our validation set was different than the mean of our training set. On the training set, MAE=0.15 implies that, on average, the baseline's distance from the true value is 0.15. On the validation set, the average distance from the ground truth is 0.16. For our training set, the square root of the average of squared differences between the ground truth and baseline values is 0.49, while it's 0.55 for the validation set. This penalizes higher residuals.

Initial Model Evaluation Environment

We created an initial evaluation environment. It takes in X and y and creates a baseline regression model and a simple linear regression model and prints our evaluation metrics for the training and validation sets.

This shows that a basic linear regression model on our data has a coefficient of determination of 40% on our training set and 35% on our validation set. Our model's average distance from ground truth for the training and validation sets are 0.11 and 0.13, respectively, which are both lower than our baseline MAE scores. The same applies for the RMSE scores of 0.42 and 0.49.

From our exploratory analysis, we will try applying the top 10 columns with the highest regression coefficients to try and improve our model. The initial linear regression model didn't score super high, so we want to try different strategies to improve its score and potentially try other models.