Predicting Bat Speed

CSCI 0451 Final Project: Create a machine learning model that predicts a players bat speed utilizing Recursive Feature Elimination, Linear Regression and Random Forest Regression models.
Author

Cece Ziegler, David Byrne, Julia Fairbank, Sam Ehrsam

Published

May 14, 2023

Source Code

Here are the two links to the source code for our linear regression/RFE and random forest model.

https://github.com/CeceZiegler1/ML_Final_Proj/blob/main/LinearRegressionAnalytic.py

https://github.com/CeceZiegler1/ML_Final_Proj/blob/main/RandomForestRegressor.py

Abstract and Overview of Significance of Topic

Bat speed is a measure of player performance that has become increasingly popular for player development over the past decade. It is typically measured as the speed that the sweet spot of the bat (about 6 inches from the end of the barrel) is traveling when contact is made with the baseball. Bat speed has become increasingly popular due to its high correlation with exit velocity and subsequently hitting metrics such as expected batting average (xBA) or expected weighted on base average (xWOBA). Metrics such as xBA and xWOBA are modern metrics that are used to effectively quantify a players ability to contribute offensively in a positive manner. This increasing popularity in bat speed has led to a related increase in training methodologies based around developing it. Coaches across all levels of play use bat speed as a KPI to validate and inform individualized coaching decisions.

Formal Introduction to Topic

For our final project, we are using data from the Open Biomechanics Project (OBP) driveline baseball research. Our data captures the biomechanical breakdown of a baseball player’s swing by measuring forces produced by different body parts in three dimensions over exact instances in time, for example, at instance X, player 103’s 2nd swing has a max lead hip force in the y direction of 111.73. The data was captured using a combination of ground force plates, a marker motion capture system, bat speed sensors, and ball flight monitors. The available data is rather robust, accounting for every piece of information that could be responsible for a baseball swing.

For our project, our goal is to create a machine learning model that uses this OBP data to identify the most important features of a player’s swing when generating bat speed, and then use those features to accurately predict a player’s bat speed. By comparing an athlete’s true bat speed to their predicted bat speed based on our model, the player could identify how efficiently they are swinging. We hope that this model could be used by baseball players and coaches to address the unique aspects of a player’s swing that could contribute to a higher bat speed, which in turn, would the players reach their fullest potential based on where their inefficiencies lie. Our project can be broken down into two main processes: Identifying the key features that contribute to bat speed. Creating a model that uses the key features to predict bat speed. For the first step, we have decided to run a Recursive Feature Elimination(RFE) on the 60 potential features from our dataset to pull out a smaller number of strong predictive features to use in our model. Next, using those select features, we will run a regression analysis to create a model that can be used to predict a player’s bat speed. Let’s take a closer look at these analyses.

Values Statement

This project will mainly be utilized by coaches or anybody concerned with player development in baseball. The information that our model would provide to coaches would allow them to make better coaching decisions. In addition to helping coaching staffs, the players themselves would also benefit from more directed training and better evaluation standards. The baseball offseason is very short, so being able to make the most of this time is extremely valuable.

This dataset is the first of is the first of its kind to be released as open source. When open-source pitch physics data first became available, it fundamentally changed the way in which baseball was viewed and played. This information allowed both pitchers and hitters to have reliable and measurable feedback on every pitch. The availability of pitch-level biomechanics data has the potential to once again fundamentally change baseball. By working on this project, we are contributing to the larger effort of the baseball community to better understand exactly what makes a baseball swing productive.

Materials and Methods:

Data

From the OBP dataset, we will be focusing on baseball-hitting data, specifically force plate and joint velocity to predict bat speed. The original datasets can be found here. Driveline baseball research collected this data using a combination of ground force plates, a marker motion capture system, bat speed sensors, and ball flight monitors. Originally, both the force plate and joint velocity datasets had over 1,000,000 observations, with each individual swing including thousands of observations because the swing was broken down by milliseconds. We felt it was unnecessary to keep the time aspect of the dataset, as the velocities produced for each feature variable were very similar from millisecond to millisecond, and the large datasets were difficult to work with. To get rid of the time component and obtain a more reasonably sized data set, we found the minimum, maximum and range of each feature variable in the dataset for every swing. Each swing is labeled by session_swing in our dataset, and each row is a different swing. The session swing is labeled by player ID and swing number, for example, session_swing 111_3 is player 111’s third swing. Not all players have the same number of swings in the dataset, but we don’t think this should have any impact on our results. After eliminating the time aspect, each swing has 60 potential feature variables. The 60 feature variables include the min, max and range of the forces produced by many different body parts in the x, y and z directions during a player’s swing. Some examples include lead shoulder which is the player’s shoulder closest to the pitcher, and rear hip which is the player’s hip furthest from the pitcher.

Our data possesses some limitations as it exclusively represents male baseball players and doesn’t include any data from female softball players. We think it would be interesting for Driveline baseball research to expand to softball to eliminate some of the gender bias they have inadvertently caused.

Recursive Feature Elimination

Recursive Feature Elimination, or RFE is a recursive algorithm that is used to find the most impactful features out of a large feature set. This is accomplished by training machine learning on all the features, and then removing the least impactful features. This process is repeated with the smaller feature set, until the feature set is of the desired size. This can help prevent overfitting and allow for easier use and training. This does require that the model it is being used to select features has a way to calculate the effect of features, which means that it won’t work for every model, or some determinator has to be created for it to be used. Another drawback is that unless proper testing is done to find out the amount of impactful features, the accuracy can be diminished beyond the benefits of avoiding overfitting.

Multiple Linear Regression

Because our project is concerned with predicting bat speed, we require a numeric prediction model, rather than a classification prediction model. We decided to use Multiple Linear Regression, which allows us to take two or more features in a dataset to predict a single numeric dependent variable, bat speed. Multiple Linear Regression differs from regular Linear Regression in that you can use more than one feature to predict the target variable. Once built, we can isolate each individual feature and evaluate its impact on the target variable.

With our linear regression model, we will be using the Mean Squared Error (MSE) loss function to determine the accuracy and performance of our model.

Random Forest Regression

Random Forest Regression is a technique that creates multiple decision trees and averages their outputs to give a final result that often has a high prediction/classification rate. The process involves the user selecting the amount, n, of decision trees to be created, then using the bootstrapping method to randomly select k data samples from the training set. (Bootstrapping is simply the process of randomly selecting subsets from a dataset over a certain number of iterations and a certain number of variables, then the results from each subset are averaged together which returns a more powerful result.) Then, create n decision trees using different random k values. Using the decision trees, predict regression results that can be used on unseen data. Finally, the regression results are all averaged together, returning a final regression output. Random Forests generally provide high accuracy for feature selection as it generates uncorrelated decision trees built by choosing a random set of features for each tree.

Variable Overview

The features which we have created our data set with fall into two main categories: biomechanics data and force plate data. Beginning with the biomechanics data, we have a set of joints and their associated angular velocities in three planes of motion. We have information on lead hip, lead shoulder, rear hip, rear shoulder, pelvis, torso, and the torso-pelvis complex. For each of these joints, we calculated the range of velocities and maximum velocities for each swing captured. With the force plate data, the lead foot and rear foot are independently observed, and the data is split among the three planes of motion along the x, y, and z axes. For each pairing of foot and plane of motion, we calculated the minimum, maximum, and range of the force produced.

The Process

Step 1: Cleaning the Data

Our original dataset contained over 1,000,000 observations that were grouped by session_swing. Each swing contained hundreds of observations that analyzed a variety of features over time (~0.0028 seconds between captures). For our project, we wanted to remove this time aspect and instead create a simplified dataset that contained the minimum, maximum, and range values of the features of interest for each swing.

To do so, we imported our dataset in R and grouped it by the session_swing variable, and, by using base R functions, calculated the minimum, maximum, and range of each feature variable of interest. We repeated this for the force plate dataset and joint velocity dataset, then used left-join to combine the two datasets to create a conglomerate dataset with all potential predictive features for each session_swing variable.

We then added our target vector, max bat speed, from POI_metrics.csv to create our fill dataset that includes our target vector.

This process allowed us to get reduce the size of our dataset from over 1,000,000 observations to 638 session_swings.

Step 2: RFE Feature Selection

We used SKLearn’s RFE feature collection class, which can be found here.

The RFE model from the SKLearn class allowed us to gain the base understanding we needed to implement our own version of RFE. After reading through the API and playing around with the RFE feature from SKLearn, we decided to implement our own version of RFE to use with the linear regression model we also implemented. Our RFE function tkaes in three parameters: a feature matrix, X, a target vetor, y, and the number of features we want to be selected, k. The function uses a nested for loop to run through all values i in 1:k and at each iteration, j, checks the weight of all the remaining features that were fit on the linear regression model. From here, the best features are selected as the features with the minimum absolutle value of weight. We use this function in conjunction with our linear regression model to find the number of feautres within our dataset that most accuratley predict the batspeed of a player.

Step 3: Building the Models — Multiple Linear Regression and Random Forest

We decided to implement our own linear regression model similar to the work we did in class, selecting just the analytic version. We were able to pull out score and predict functions from our previous blog post implementation. We had to modify our fit function by adding a regualarization term to the weight vector to help avoid over/under-fitting the model.

We also decided to build our own random forest model. To do this, we used a fit, predict and score method. In the fit method, we performed bootstrapping in which at each iteration, a random observation is selected at random, with replacement. We then built the decision tree and recursively called our method. To get our predictions, we took the average of the predicitions from all of the trees.

Step 4: Testing the Models: Linear Regression and Random Forest

To test and train the models, we used an 80/20 train/test split. For both models, we ran a loop to show our the training and testing scores while increasing the number of selected features for the recursive feature elimination model. Once we identified the optimal range of features to use on the Multiple Linear Regression and Random Forest Regression models, we created subsets of our training and testing data to contain the selected features. Finally, we trained and tested our models on the data subsets (80 train/20 split).

Results and Conclusion

Linear Regression

Below shows the effect of increasing the number of features on the accuracy scores produced by the Linear Regression model during RFE. As we can see, our training and testing accuracy scores tend to increase as the number of features increase.

We noticed that our model tends to have a higher testing accuracy when the model is ran on fewer features, and has the best training score when the model uses all 60 features. However, because we wanted to identify 10-20 key features, we decided to train and test our model on the 13 best features, which we selected as value that yields the second best testing score.

The table shows the 13 most important features as selected by our RFE function. These results were quite unexpected, including values like min_rfx and min_rfy as selected variables of importance. min_rfx and min_rfy represent the minimum rear force produced in the x and y directions, which essentially represent the load, or backwards movement prior to the actual swing itself. Other variables make sense as being some of the most important features, such as range_lead_hip_z, and max_torso_pelvis_x as these are body parts that are essitial in creating the rotational force of a swing to help produce a better bat speed.

Using a subset of these 13 features, we trained and tested our model, which produced a testing accuracy score of 38.7%. Unfortunately, this low accuracy indicates that our model isn’t performing as we had hoped. It could mean that our data doesn’t have a strong linear seperability which indicates there is nothing wrong with our model, but rather it isn’t the best model option for our data. Becasue of this, we decided to see if we could produce stronger results by using the Random Forest Regression model.

from LinearRegressionAnalytic import LinearRegressionAnalytic
from LinearRegressionAnalytic import rfe
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
data = pd.read_csv("biomechanics_dataset_v1.csv") 

np.random.seed(1)
X = data.iloc[:,1:60]
y = data.iloc[:, 60]
x_train, x_test, y_train, y_test = train_test_split(X,y, test_size = .2)


val_scores = []
train_scores = []
for k in range(60):
    selected_features = rfe(x_train, y_train, k)
    feature_mask = np.zeros(x_train.shape[1], dtype=bool)
    #masking to include only the selected features
    feature_mask[selected_features] = True
    #subseting x train and test to include only selected feautres
    X_train_selected = x_train.loc[:, feature_mask]
    X_test_selected = x_test.loc[:, feature_mask]
    lr = LinearRegressionAnalytic()
    #fitting model on selected features
    lr.fit(X_train_selected, y_train)
    #appending score to score list
    val_scores.append(lr.score(X_test_selected, y_test))
    train_scores.append(lr.score(X_train_selected, y_train))

# plot the results
import matplotlib.pyplot as plt
plt.plot(range(1, 61), val_scores, label='Testing accuracy')
plt.plot(range(1, 61), train_scores, label='Training accuracy')
plt.xlabel("Number of Features")
plt.ylabel("Accuracy")
plt.legend()
plt.title("Accuracy Scores Produced by Linear Regression Model During RFE")
plt.show()


feat_select = rfe(x_train, y_train, 14)
feat_select
display(data.iloc[:,[33, 34, 3, 35, 37, 6, 39, 43, 45, 14, 19, 23, 28]])


x_train_lin = x_train.iloc[:,[33, 34, 3, 35, 37, 6, 39, 43, 45, 14, 19, 23, 28]]
x_test_lin = x_test.iloc[:,[33, 34, 3, 35, 37, 6, 39, 43, 45, 14, 19, 23, 28]]

lr.fit(x_train_lin, y_train)
print("Testing Accuracy of Subset:", lr.score(x_test_lin, y_test))

max_rear_shoulder_x max_rear_shoulder_y range_lead_hip_z max_rear_shoulder_z max_torso_y range_lead_shoulder_z max_torso_pelvis_x min_rfx min_rfy range_rear_shoulder_y range_torso_pelvis_x max_lead_hip_z max_pelvis_y
0 550.0243 514.1198 778.1339 1188.6807 182.7302 867.6362 182.4743 -232.2776 -88.0097 689.2249 233.0842 384.3450 88.3858
1 638.6019 535.2822 960.4793 1278.5380 196.9712 1054.5098 236.0902 -189.7241 -106.2254 812.9988 306.7874 520.8627 106.4238
2 580.0406 472.9189 784.0413 1588.7207 248.4432 988.9415 222.8233 -124.4299 -84.5785 708.1030 313.2967 433.6955 82.5397
3 635.8561 484.2663 1036.2757 888.1270 166.9048 1472.7250 168.7606 -175.8547 -122.1629 732.5588 228.6738 489.4716 81.4764
4 566.9714 502.2202 1093.3019 1487.6143 191.2448 1130.6572 220.7400 -219.5387 -72.5831 699.1772 286.4758 597.7220 75.9968
... ... ... ... ... ... ... ... ... ... ... ... ... ...
632 631.5529 488.3580 980.3030 1575.2948 165.8830 1150.6032 147.9856 -114.1301 -173.2356 804.6660 239.9022 354.5130 124.7927
633 571.2316 477.7701 748.3298 1604.9299 145.5400 1026.0944 188.9410 -113.4915 -157.5923 735.1128 276.8293 324.9995 137.1521
634 549.3600 407.3251 526.3367 1393.4961 128.0184 1029.3547 257.2261 -112.7565 -111.9854 584.3304 348.2130 207.2101 128.8111
635 623.2650 463.8467 1248.0062 1715.0544 136.8013 892.8699 177.4202 -122.3425 -161.2802 725.1355 266.6244 282.0038 157.1024
636 599.2501 505.9937 1433.3273 1480.4099 143.7898 1233.8176 169.0549 -165.5618 -132.0637 700.1916 234.5590 500.9032 107.8579

637 rows × 13 columns

Testing Accuracy of Subset: 0.386607442390802

Random Forest Regression

Below is a graph that displays the training and testing scores produced by the Random Rorest Regression model when ran with RFE over all of the features. As we can see from the graph, this model performed much better than the Linear Regression model. Unlike the Linear Regression model, the Random Rorest model reached a peak around 10 features and maintained consistency at that score, whereas the Linear Regression model had increased variance between scores across all of the features.

Because the Random Forest Regression model creates a series of trees using the bootstrapping method, we expected the model to have a better training accuracy and weren’t as concerned with overfitting. We also expected it to have a better accuracy score than the Linear Regression model because it can capture non-linear relationships, which we expect our data to have due to the poor performance of the Multiple Linear Regression model. Aditionally, because of the way the trees are built, the random forest model is less likely to be heavily affected by outliers in the data which will allow it to have a better testing accuracy score.

We found that the Random Rorest Regression model had the higher training and validation scores than the Linear Regresion model. The training score reached nearly 100% when ran with more than 10 features. The testing accuracy reached around 65% at 10 features and had a slight increase as the number of features increased. To compare this model with the Linear Regression model, we chose the 15 most important features to test and train the Random Forest model. After training and testing our model on the subset of 15 features, we got a testing accuracy score of 55.4%, which is significantly better than our multiple linear regression model. Considering there is no option for our model to randomly guess since we are predicting a continuous numeric value, we are satisfied with the amount our random forest model learned.

Interestingly, there is only one feature that the Linear Regression model and the Random Forest Regression model both selected: max_rear_shoulder_y. This feature captures the top hand on the bat, so if the shoulder isn’t moving in the swing, it will hinder the ability to produce enough rotational force from the torso.

We were suprised to find that, of the 13 features selected by the Linear Regression model and of the 15 features selected by the Random Forest Regression model, there weren’t more similarities amongst the selected features between the two models. We were hoping to discover a few select features that both models identified as important features, and were interested to find that the majority of the selected features were different.

We hypothesize that the dimentionality of the forces produced by each body part may be a factor that contributes toward the challenge of distinguishing significant features. For example, in the Linear Regression model, max_toros_y was selected as an important feautre, and in the Random Forest Regression model, max_torso_z was selected as an important feature. This indicates that the max force produced by the torso is important when creating and predicting bat speed. If we were to run this experiment again, we may try getting rid of the x, y and z components and just use the average of the forces produced by that body part.

from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import RFE
model = RandomForestRegressor()
scores_test = []
scores_train = []

for x in range(60):
    estm = RFE(model, n_features_to_select = x+1, step = 1)
    estm.fit(x_train,y_train)
    scores_test.append(estm.score(x_test,y_test))
    scores_train.append(estm.score(x_train,y_train))
    
import matplotlib.pyplot as plt

x_val= [] 

for j in range (60):
    x_val.append(j+5)

  
plt.plot(x_val, scores_test, label='Testing accuracy')  # Plot the chart
plt.plot(x_val, scores_train, label='Training accuracy')
plt.xlabel("Number of Features")
plt.ylabel("Accuracy Score")
plt.title("Random Forest Regression")
plt.legend()
plt.show() 

from RandomForestRegressor import RandomForest
rf = RandomForest()

x_train_rf = x_train.iloc[:,[1,8,13,18,21,26,29, 34,38, 46,48,55,56,58]]
x_test_rf = x_test.iloc[:,[1,8,13,18,21,26,29, 34,38, 46,48,55,56,58]]

rf.fit(x_train_rf, y_train, 1000,500)
print(rf.score(x_test_rf, y_test))

feat_select = rfe(x_train, y_train, 14)
feat_select
display(data.iloc[:,[1,8,13,18,21,26,29, 34,38, 46,48,55,56,58]])
0.5541368058167476
range_lead_hip_x range_pelvis_y range_rear_shoulder_x range_torso_z range_torso_pelvis_z max_lead_shoulder_z max_pelvis_z max_rear_shoulder_y max_torso_z max_rfz max_lfx range_rfy range_rfz range_lfy
0 590.6812 371.0611 838.0101 848.3957 743.5585 617.1386 733.6451 514.1198 775.7749 1101.3711 121.2052 240.6389 1052.5648 545.3695
1 536.1970 393.4254 947.9660 814.2556 642.8480 751.1699 799.8748 535.2822 775.3766 1092.3006 111.2187 297.5680 1040.9895 603.8510
2 586.8320 396.8130 801.1592 823.2495 853.6754 723.6880 740.7065 472.9189 793.0441 1117.9434 178.4852 351.2961 1002.5322 599.4007
3 628.4384 402.3244 958.8471 870.6640 541.5395 810.9479 741.3719 484.2663 819.9890 1102.4140 170.5486 344.0314 1094.4491 617.4273
4 595.3172 348.1626 840.4242 809.9368 756.6446 862.6313 770.4950 502.2202 774.5865 1119.0327 176.3782 262.0008 1100.7345 597.7741
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
632 825.4631 350.8260 1015.7101 919.6701 587.4149 838.6431 679.5463 488.3580 859.0192 947.5325 69.2794 380.5133 916.4106 464.0069
633 768.7166 377.7951 1020.7830 910.0289 716.7429 799.7780 715.3288 477.7701 846.6447 958.0700 60.6210 365.7219 931.9138 471.4245
634 667.8735 366.3885 978.3792 880.3159 516.4786 815.2906 701.0455 407.3251 820.1794 998.6667 56.2369 331.5945 954.3035 441.1844
635 698.0434 390.7154 1029.8385 896.8795 616.6955 604.2217 681.9455 463.8467 838.6638 939.1254 67.6610 383.2294 909.9346 533.1048
636 764.1730 348.0633 1024.9879 907.0956 606.3088 830.5895 677.4534 505.9937 846.7994 935.7064 91.0192 346.1590 916.5282 403.0329

637 rows × 14 columns

Concluding Discussion

Overall, our project was sucessful, as we built two models and an RFE function to help us determine the most important features of a swing while predicting bat speed. When we formulated the idea for our project idea, our goal was to deliver python source code that we constructed, along with two jupyter notebooks. One that contained our intial exploration, and one that held our final write up and experiments. We were sucessful in meeting this goal, as we finished with more than two jupyter notebooks, and two python source code files that contained our linear regression with RFE and our random forest model.

If we had more time, we would work on finding ways to improve the accuracy of our models. One idea we have to improve accuracy score is to get rid of the dimension factor, as we mentioned above, so each biomechanic force only has one representation instead of three. We hope this would help our models narrow down the important features and produce a better accuracy score. Additionally, we would like to bring in more data to train and test our model on. DriveLine baseball technology is relatively new, so the data is sparse. If the technology was more accesible, we could have more data which would allow our model to improve its success.

Bias

We created a model that predicts a player’s bat speed with approximately 65% accuracy, using around 10 features. As with any algorithm, we must consider any assumptions or biased data that could have resulted in systematic error throughout our process. Because our training data came from the Driveline Research and Development, our model was only trained on the players with close proximity or eligibility to their biomechanics facility. According to the U.S. Census, 43.3% of the population of Kent, Washington (home of Driveline R&D) are white individuals, contrasted with 12.2% of the population being black individuals. While the data’s goal is to highlight specific bodily forces and movements that contribute to predicting bat speed rather than demographics like race, age, height, or weight, we must acknowledge that this data is most likely skewed toward white individuals and could be less accurate in predicting the bat speed of players of different races.

Additionally, we must highlight that baseball is a male-dominated sport, with the rare exception of a few women playing the sport — see article on Alexis Hopkins. While sports are typically gender segregated for the sake of “fairness” and an acknowledgment that male and female bodies are inherently different and will perform as such, factors like absolute strength and size are not as important in the sport of baseball, as they might be in gender-segregated sports like football and soccer. Rather, the Women’s Sports Foundation explains that baseball involves skills that are combinations of timing, coordination, strength, knowledge of the game, strategies, and control, and argues that bat speed and bat control are more important than absolute strength.

Yet, despite all of this, the Driveline R&D data only contains the biomechanics of male batters. Therefore, if our model were to be improved and implemented, it would only perpetuate the segregation of men and women in this sport. If the data to improve a player’s bat speed can only improve male players, women will continue to be left in the dust.

Driveline Baseball describes its mission as developing the future of baseball training by finding industry-leading insights; all while publishing those insights back to the baseball community where they belong. However, because baseball is a historically white and male-dominated sport, the “insights” that will be found will only contribute to solidifying that the “baseball community” remains dominated by players that fit those demographics.

It is our duty to expand this research and development into more marginalized player communities, such as female athletes and athletes of other races. Then, we can use these insights to create unique training programs that empower and embrace their unique features and help them become the best athletes they can be.

Application

Our bat speed model could be used by coaching staffs to better inform the decisions that they make. For example, if a given player’s predicted bat speed is higher than their actual recorded bat speed, this would indicate a mechanical inefficiency in their swing. Coaches, with this knowledge, could then direct their focus to finding these mechanical ineffeciencies and correcting them. Players in this group would spend more time on skill acquisition training. On the other hand, if a player has a predicted bat speed which is equal to or lower than their actual bat speed, this would indicate above average efficiency in their swing and their training could be directed more towards general strength and power to increase force production. In either case, our model would help streamline the process from data collection to impact, giving coaches and players the power to have impactful training sessions tailored to each individual.

Group Contribution Statement

Cece Ziegler: Helped with data cleaning. Built RFE function and Linear Regression model. Performed RFE and model experiments. Led writing “Results and Conclusion” sections.

David Byrne: Introduced topic. Managed data cleaning. Led writing of “Abstract and Overview of Significance of Topic”, “Data”, “Variable Overview”, “Values Statement”, and “Application” sections.

Julia Fairbank: Led writing of “Formal Introduction to Topic”, “Recursive Feature Elimination”, “Multiple Linear Regression”, “Random Forest Regression”, “The Process”, and “Bias” sections.

Sam Ehrsam: Conducted initial RFE experiments with SKLearn library. Built Random Forest Regression model. Performed RFE experiments.

Personal Reflection

I learned a lot in the process of completing our final project. First, I learned about recursive feature elimination and how it can be used in conjunction with different models to select the most important features. It was beneficial to learn how this function worked through the SKLearn package, and it was rewarding to build my own version of the function in our implementation of the linear regression model. I also learned how it can be difficult to find a model that works best for predicting on different data sets. We felt good about our different models going into the building and testing as we had researched what models worked well for predicting continuous numeric values. However, as we see from our results, our models weren’t super sucessful. I learned that it can be a long process that involves a lot of trial and error to build a model that performs well on your specific data set. I was hoping to learn more about which features are most important in predicting bat speed to potentially help me with my own swing, however due to the low accuracy scores and lack of coherence between the features each model pulled as being most important, I didn’t learn as much as I had hoped to in this sense.

I am very satisfied with what I acheived during this project. My goal was to complete all of my tasks on time and help with the implementations and revisions of the project. I met this goal, as I finished all of my parts in a timely matter, and I built the linear regression model and RFE function we used with the linear regression implementation. I was an active participant in all of our group meetings and I helped with various written portions of the final report.

I will carry what I learned from this project into various aspects of my life. First, I think it is important to be able to work in a group, especially in the technology world, and I learned skills regarding working on a project using git hub from this final. Aditionally, I enhanced my skills in building machine learning models and experimenting with data. I learned skills to determine which models are better suited for which data set, along with ways to pick out the best features. These are skills I foresee myself using in feature job opportunities and potential individual projects I have interest in working on. I am very excited about what I learned throughout this course and how I was able to apply and enhance those skills through this final project. Machine learning and data science is something I have a strong interest in pursuing as I move forward with my education and post grad life, and I am excited to apply the skills I learned throughout this class and during this project.