CSCI 0451 Final Project: Create a machine learning model that predicts a players bat speed utilizing Recursive Feature Elimination, Linear Regression and Random Forest Regression models.
Author
Cece Ziegler, David Byrne, Julia Fairbank, Sam Ehrsam
Published
May 14, 2023
Source Code
Here are the two links to the source code for our linear regression/RFE and random forest model.
Bat speed is a measure of player performance that has become increasingly popular for player development over the past decade. It is typically measured as the speed that the sweet spot of the bat (about 6 inches from the end of the barrel) is traveling when contact is made with the baseball. Bat speed has become increasingly popular due to its high correlation with exit velocity and subsequently hitting metrics such as expected batting average (xBA) or expected weighted on base average (xWOBA). Metrics such as xBA and xWOBA are modern metrics that are used to effectively quantify a players ability to contribute offensively in a positive manner. This increasing popularity in bat speed has led to a related increase in training methodologies based around developing it. Coaches across all levels of play use bat speed as a KPI to validate and inform individualized coaching decisions.
Formal Introduction to Topic
For our final project, we are using data from the Open Biomechanics Project (OBP) driveline baseball research. Our data captures the biomechanical breakdown of a baseball player’s swing by measuring forces produced by different body parts in three dimensions over exact instances in time, for example, at instance X, player 103’s 2nd swing has a max lead hip force in the y direction of 111.73. The data was captured using a combination of ground force plates, a marker motion capture system, bat speed sensors, and ball flight monitors. The available data is rather robust, accounting for every piece of information that could be responsible for a baseball swing.
For our project, our goal is to create a machine learning model that uses this OBP data to identify the most important features of a player’s swing when generating bat speed, and then use those features to accurately predict a player’s bat speed. By comparing an athlete’s true bat speed to their predicted bat speed based on our model, the player could identify how efficiently they are swinging. We hope that this model could be used by baseball players and coaches to address the unique aspects of a player’s swing that could contribute to a higher bat speed, which in turn, would the players reach their fullest potential based on where their inefficiencies lie. Our project can be broken down into two main processes: Identifying the key features that contribute to bat speed. Creating a model that uses the key features to predict bat speed. For the first step, we have decided to run a Recursive Feature Elimination(RFE) on the 60 potential features from our dataset to pull out a smaller number of strong predictive features to use in our model. Next, using those select features, we will run a regression analysis to create a model that can be used to predict a player’s bat speed. Let’s take a closer look at these analyses.
Values Statement
This project will mainly be utilized by coaches or anybody concerned with player development in baseball. The information that our model would provide to coaches would allow them to make better coaching decisions. In addition to helping coaching staffs, the players themselves would also benefit from more directed training and better evaluation standards. The baseball offseason is very short, so being able to make the most of this time is extremely valuable.
This dataset is the first of is the first of its kind to be released as open source. When open-source pitch physics data first became available, it fundamentally changed the way in which baseball was viewed and played. This information allowed both pitchers and hitters to have reliable and measurable feedback on every pitch. The availability of pitch-level biomechanics data has the potential to once again fundamentally change baseball. By working on this project, we are contributing to the larger effort of the baseball community to better understand exactly what makes a baseball swing productive.
Materials and Methods:
Data
From the OBP dataset, we will be focusing on baseball-hitting data, specifically force plate and joint velocity to predict bat speed. The original datasets can be found here. Driveline baseball research collected this data using a combination of ground force plates, a marker motion capture system, bat speed sensors, and ball flight monitors. Originally, both the force plate and joint velocity datasets had over 1,000,000 observations, with each individual swing including thousands of observations because the swing was broken down by milliseconds. We felt it was unnecessary to keep the time aspect of the dataset, as the velocities produced for each feature variable were very similar from millisecond to millisecond, and the large datasets were difficult to work with. To get rid of the time component and obtain a more reasonably sized data set, we found the minimum, maximum and range of each feature variable in the dataset for every swing. Each swing is labeled by session_swing in our dataset, and each row is a different swing. The session swing is labeled by player ID and swing number, for example, session_swing 111_3 is player 111’s third swing. Not all players have the same number of swings in the dataset, but we don’t think this should have any impact on our results. After eliminating the time aspect, each swing has 60 potential feature variables. The 60 feature variables include the min, max and range of the forces produced by many different body parts in the x, y and z directions during a player’s swing. Some examples include lead shoulder which is the player’s shoulder closest to the pitcher, and rear hip which is the player’s hip furthest from the pitcher.
Our data possesses some limitations as it exclusively represents male baseball players and doesn’t include any data from female softball players. We think it would be interesting for Driveline baseball research to expand to softball to eliminate some of the gender bias they have inadvertently caused.
Recursive Feature Elimination
Recursive Feature Elimination, or RFE is a recursive algorithm that is used to find the most impactful features out of a large feature set. This is accomplished by training machine learning on all the features, and then removing the least impactful features. This process is repeated with the smaller feature set, until the feature set is of the desired size. This can help prevent overfitting and allow for easier use and training. This does require that the model it is being used to select features has a way to calculate the effect of features, which means that it won’t work for every model, or some determinator has to be created for it to be used. Another drawback is that unless proper testing is done to find out the amount of impactful features, the accuracy can be diminished beyond the benefits of avoiding overfitting.
Multiple Linear Regression
Because our project is concerned with predicting bat speed, we require a numeric prediction model, rather than a classification prediction model. We decided to use Multiple Linear Regression, which allows us to take two or more features in a dataset to predict a single numeric dependent variable, bat speed. Multiple Linear Regression differs from regular Linear Regression in that you can use more than one feature to predict the target variable. Once built, we can isolate each individual feature and evaluate its impact on the target variable.
With our linear regression model, we will be using the Mean Squared Error (MSE) loss function to determine the accuracy and performance of our model.
Random Forest Regression
Random Forest Regression is a technique that creates multiple decision trees and averages their outputs to give a final result that often has a high prediction/classification rate. The process involves the user selecting the amount, n, of decision trees to be created, then using the bootstrapping method to randomly select k data samples from the training set. (Bootstrapping is simply the process of randomly selecting subsets from a dataset over a certain number of iterations and a certain number of variables, then the results from each subset are averaged together which returns a more powerful result.) Then, create n decision trees using different random k values. Using the decision trees, predict regression results that can be used on unseen data. Finally, the regression results are all averaged together, returning a final regression output. Random Forests generally provide high accuracy for feature selection as it generates uncorrelated decision trees built by choosing a random set of features for each tree.
Variable Overview
The features which we have created our data set with fall into two main categories: biomechanics data and force plate data. Beginning with the biomechanics data, we have a set of joints and their associated angular velocities in three planes of motion. We have information on lead hip, lead shoulder, rear hip, rear shoulder, pelvis, torso, and the torso-pelvis complex. For each of these joints, we calculated the range of velocities and maximum velocities for each swing captured. With the force plate data, the lead foot and rear foot are independently observed, and the data is split among the three planes of motion along the x, y, and z axes. For each pairing of foot and plane of motion, we calculated the minimum, maximum, and range of the force produced.
The Process
Step 1: Cleaning the Data
Our original dataset contained over 1,000,000 observations that were grouped by session_swing. Each swing contained hundreds of observations that analyzed a variety of features over time (~0.0028 seconds between captures). For our project, we wanted to remove this time aspect and instead create a simplified dataset that contained the minimum, maximum, and range values of the features of interest for each swing.
To do so, we imported our dataset in R and grouped it by the session_swing variable, and, by using base R functions, calculated the minimum, maximum, and range of each feature variable of interest. We repeated this for the force plate dataset and joint velocity dataset, then used left-join to combine the two datasets to create a conglomerate dataset with all potential predictive features for each session_swing variable.
We then added our target vector, max bat speed, from POI_metrics.csv to create our fill dataset that includes our target vector.
This process allowed us to get reduce the size of our dataset from over 1,000,000 observations to 638 session_swings.
Step 2: RFE Feature Selection
We used SKLearn’s RFE feature collection class, which can be found here.
The RFE model from the SKLearn class allowed us to gain the base understanding we needed to implement our own version of RFE. After reading through the API and playing around with the RFE feature from SKLearn, we decided to implement our own version of RFE to use with the linear regression model we also implemented. Our RFE function tkaes in three parameters: a feature matrix, X, a target vetor, y, and the number of features we want to be selected, k. The function uses a nested for loop to run through all values i in 1:k and at each iteration, j, checks the weight of all the remaining features that were fit on the linear regression model. From here, the best features are selected as the features with the minimum absolutle value of weight. We use this function in conjunction with our linear regression model to find the number of feautres within our dataset that most accuratley predict the batspeed of a player.
Step 3: Building the Models — Multiple Linear Regression and Random Forest
We decided to implement our own linear regression model similar to the work we did in class, selecting just the analytic version. We were able to pull out score and predict functions from our previous blog post implementation. We had to modify our fit function by adding a regualarization term to the weight vector to help avoid over/under-fitting the model.
We also decided to build our own random forest model. To do this, we used a fit, predict and score method. In the fit method, we performed bootstrapping in which at each iteration, a random observation is selected at random, with replacement. We then built the decision tree and recursively called our method. To get our predictions, we took the average of the predicitions from all of the trees.
Step 4: Testing the Models: Linear Regression and Random Forest
To test and train the models, we used an 80/20 train/test split. For both models, we ran a loop to show our the training and testing scores while increasing the number of selected features for the recursive feature elimination model. Once we identified the optimal range of features to use on the Multiple Linear Regression and Random Forest Regression models, we created subsets of our training and testing data to contain the selected features. Finally, we trained and tested our models on the data subsets (80 train/20 split).
Results and Conclusion
Linear Regression
Below shows the effect of increasing the number of features on the accuracy scores produced by the Linear Regression model during RFE. As we can see, our training and testing accuracy scores tend to increase as the number of features increase.
We noticed that our model tends to have a higher testing accuracy when the model is ran on fewer features, and has the best training score when the model uses all 60 features. However, because we wanted to identify 10-20 key features, we decided to train and test our model on the 13 best features, which we selected as value that yields the second best testing score.
The table shows the 13 most important features as selected by our RFE function. These results were quite unexpected, including values like min_rfx and min_rfy as selected variables of importance. min_rfx and min_rfy represent the minimum rear force produced in the x and y directions, which essentially represent the load, or backwards movement prior to the actual swing itself. Other variables make sense as being some of the most important features, such as range_lead_hip_z, and max_torso_pelvis_x as these are body parts that are essitial in creating the rotational force of a swing to help produce a better bat speed.
Using a subset of these 13 features, we trained and tested our model, which produced a testing accuracy score of 38.7%. Unfortunately, this low accuracy indicates that our model isn’t performing as we had hoped. It could mean that our data doesn’t have a strong linear seperability which indicates there is nothing wrong with our model, but rather it isn’t the best model option for our data. Becasue of this, we decided to see if we could produce stronger results by using the Random Forest Regression model.
from LinearRegressionAnalytic import LinearRegressionAnalyticfrom LinearRegressionAnalytic import rfefrom sklearn.model_selection import train_test_splitimport pandas as pdimport numpy as npdata = pd.read_csv("biomechanics_dataset_v1.csv") np.random.seed(1)X = data.iloc[:,1:60]y = data.iloc[:, 60]x_train, x_test, y_train, y_test = train_test_split(X,y, test_size =.2)val_scores = []train_scores = []for k inrange(60): selected_features = rfe(x_train, y_train, k) feature_mask = np.zeros(x_train.shape[1], dtype=bool)#masking to include only the selected features feature_mask[selected_features] =True#subseting x train and test to include only selected feautres X_train_selected = x_train.loc[:, feature_mask] X_test_selected = x_test.loc[:, feature_mask] lr = LinearRegressionAnalytic()#fitting model on selected features lr.fit(X_train_selected, y_train)#appending score to score list val_scores.append(lr.score(X_test_selected, y_test)) train_scores.append(lr.score(X_train_selected, y_train))# plot the resultsimport matplotlib.pyplot as pltplt.plot(range(1, 61), val_scores, label='Testing accuracy')plt.plot(range(1, 61), train_scores, label='Training accuracy')plt.xlabel("Number of Features")plt.ylabel("Accuracy")plt.legend()plt.title("Accuracy Scores Produced by Linear Regression Model During RFE")plt.show()feat_select = rfe(x_train, y_train, 14)feat_selectdisplay(data.iloc[:,[33, 34, 3, 35, 37, 6, 39, 43, 45, 14, 19, 23, 28]])x_train_lin = x_train.iloc[:,[33, 34, 3, 35, 37, 6, 39, 43, 45, 14, 19, 23, 28]]x_test_lin = x_test.iloc[:,[33, 34, 3, 35, 37, 6, 39, 43, 45, 14, 19, 23, 28]]lr.fit(x_train_lin, y_train)print("Testing Accuracy of Subset:", lr.score(x_test_lin, y_test))
max_rear_shoulder_x
max_rear_shoulder_y
range_lead_hip_z
max_rear_shoulder_z
max_torso_y
range_lead_shoulder_z
max_torso_pelvis_x
min_rfx
min_rfy
range_rear_shoulder_y
range_torso_pelvis_x
max_lead_hip_z
max_pelvis_y
0
550.0243
514.1198
778.1339
1188.6807
182.7302
867.6362
182.4743
-232.2776
-88.0097
689.2249
233.0842
384.3450
88.3858
1
638.6019
535.2822
960.4793
1278.5380
196.9712
1054.5098
236.0902
-189.7241
-106.2254
812.9988
306.7874
520.8627
106.4238
2
580.0406
472.9189
784.0413
1588.7207
248.4432
988.9415
222.8233
-124.4299
-84.5785
708.1030
313.2967
433.6955
82.5397
3
635.8561
484.2663
1036.2757
888.1270
166.9048
1472.7250
168.7606
-175.8547
-122.1629
732.5588
228.6738
489.4716
81.4764
4
566.9714
502.2202
1093.3019
1487.6143
191.2448
1130.6572
220.7400
-219.5387
-72.5831
699.1772
286.4758
597.7220
75.9968
...
...
...
...
...
...
...
...
...
...
...
...
...
...
632
631.5529
488.3580
980.3030
1575.2948
165.8830
1150.6032
147.9856
-114.1301
-173.2356
804.6660
239.9022
354.5130
124.7927
633
571.2316
477.7701
748.3298
1604.9299
145.5400
1026.0944
188.9410
-113.4915
-157.5923
735.1128
276.8293
324.9995
137.1521
634
549.3600
407.3251
526.3367
1393.4961
128.0184
1029.3547
257.2261
-112.7565
-111.9854
584.3304
348.2130
207.2101
128.8111
635
623.2650
463.8467
1248.0062
1715.0544
136.8013
892.8699
177.4202
-122.3425
-161.2802
725.1355
266.6244
282.0038
157.1024
636
599.2501
505.9937
1433.3273
1480.4099
143.7898
1233.8176
169.0549
-165.5618
-132.0637
700.1916
234.5590
500.9032
107.8579
637 rows × 13 columns
Testing Accuracy of Subset: 0.386607442390802
Random Forest Regression
Below is a graph that displays the training and testing scores produced by the Random Rorest Regression model when ran with RFE over all of the features. As we can see from the graph, this model performed much better than the Linear Regression model. Unlike the Linear Regression model, the Random Rorest model reached a peak around 10 features and maintained consistency at that score, whereas the Linear Regression model had increased variance between scores across all of the features.
Because the Random Forest Regression model creates a series of trees using the bootstrapping method, we expected the model to have a better training accuracy and weren’t as concerned with overfitting. We also expected it to have a better accuracy score than the Linear Regression model because it can capture non-linear relationships, which we expect our data to have due to the poor performance of the Multiple Linear Regression model. Aditionally, because of the way the trees are built, the random forest model is less likely to be heavily affected by outliers in the data which will allow it to have a better testing accuracy score.
We found that the Random Rorest Regression model had the higher training and validation scores than the Linear Regresion model. The training score reached nearly 100% when ran with more than 10 features. The testing accuracy reached around 65% at 10 features and had a slight increase as the number of features increased. To compare this model with the Linear Regression model, we chose the 15 most important features to test and train the Random Forest model. After training and testing our model on the subset of 15 features, we got a testing accuracy score of 55.4%, which is significantly better than our multiple linear regression model. Considering there is no option for our model to randomly guess since we are predicting a continuous numeric value, we are satisfied with the amount our random forest model learned.
Interestingly, there is only one feature that the Linear Regression model and the Random Forest Regression model both selected: max_rear_shoulder_y. This feature captures the top hand on the bat, so if the shoulder isn’t moving in the swing, it will hinder the ability to produce enough rotational force from the torso.
We were suprised to find that, of the 13 features selected by the Linear Regression model and of the 15 features selected by the Random Forest Regression model, there weren’t more similarities amongst the selected features between the two models. We were hoping to discover a few select features that both models identified as important features, and were interested to find that the majority of the selected features were different.
We hypothesize that the dimentionality of the forces produced by each body part may be a factor that contributes toward the challenge of distinguishing significant features. For example, in the Linear Regression model, max_toros_y was selected as an important feautre, and in the Random Forest Regression model, max_torso_z was selected as an important feature. This indicates that the max force produced by the torso is important when creating and predicting bat speed. If we were to run this experiment again, we may try getting rid of the x, y and z components and just use the average of the forces produced by that body part.
from sklearn.ensemble import RandomForestRegressorfrom sklearn.feature_selection import RFEmodel = RandomForestRegressor()scores_test = []scores_train = []for x inrange(60): estm = RFE(model, n_features_to_select = x+1, step =1) estm.fit(x_train,y_train) scores_test.append(estm.score(x_test,y_test)) scores_train.append(estm.score(x_train,y_train))import matplotlib.pyplot as pltx_val= [] for j inrange (60): x_val.append(j+5)plt.plot(x_val, scores_test, label='Testing accuracy') # Plot the chartplt.plot(x_val, scores_train, label='Training accuracy')plt.xlabel("Number of Features")plt.ylabel("Accuracy Score")plt.title("Random Forest Regression")plt.legend()plt.show()
Overall, our project was sucessful, as we built two models and an RFE function to help us determine the most important features of a swing while predicting bat speed. When we formulated the idea for our project idea, our goal was to deliver python source code that we constructed, along with two jupyter notebooks. One that contained our intial exploration, and one that held our final write up and experiments. We were sucessful in meeting this goal, as we finished with more than two jupyter notebooks, and two python source code files that contained our linear regression with RFE and our random forest model.
If we had more time, we would work on finding ways to improve the accuracy of our models. One idea we have to improve accuracy score is to get rid of the dimension factor, as we mentioned above, so each biomechanic force only has one representation instead of three. We hope this would help our models narrow down the important features and produce a better accuracy score. Additionally, we would like to bring in more data to train and test our model on. DriveLine baseball technology is relatively new, so the data is sparse. If the technology was more accesible, we could have more data which would allow our model to improve its success.
Bias
We created a model that predicts a player’s bat speed with approximately 65% accuracy, using around 10 features. As with any algorithm, we must consider any assumptions or biased data that could have resulted in systematic error throughout our process. Because our training data came from the Driveline Research and Development, our model was only trained on the players with close proximity or eligibility to their biomechanics facility. According to the U.S. Census, 43.3% of the population of Kent, Washington (home of Driveline R&D) are white individuals, contrasted with 12.2% of the population being black individuals. While the data’s goal is to highlight specific bodily forces and movements that contribute to predicting bat speed rather than demographics like race, age, height, or weight, we must acknowledge that this data is most likely skewed toward white individuals and could be less accurate in predicting the bat speed of players of different races.
Additionally, we must highlight that baseball is a male-dominated sport, with the rare exception of a few women playing the sport — see article on Alexis Hopkins. While sports are typically gender segregated for the sake of “fairness” and an acknowledgment that male and female bodies are inherently different and will perform as such, factors like absolute strength and size are not as important in the sport of baseball, as they might be in gender-segregated sports like football and soccer. Rather, the Women’s Sports Foundation explains that baseball involves skills that are combinations of timing, coordination, strength, knowledge of the game, strategies, and control, and argues that bat speed and bat control are more important than absolute strength.
Yet, despite all of this, the Driveline R&D data only contains the biomechanics of male batters. Therefore, if our model were to be improved and implemented, it would only perpetuate the segregation of men and women in this sport. If the data to improve a player’s bat speed can only improve male players, women will continue to be left in the dust.
Driveline Baseball describes its mission as developing the future of baseball training by finding industry-leading insights; all while publishing those insights back to the baseball community where they belong. However, because baseball is a historically white and male-dominated sport, the “insights” that will be found will only contribute to solidifying that the “baseball community” remains dominated by players that fit those demographics.
It is our duty to expand this research and development into more marginalized player communities, such as female athletes and athletes of other races. Then, we can use these insights to create unique training programs that empower and embrace their unique features and help them become the best athletes they can be.
Application
Our bat speed model could be used by coaching staffs to better inform the decisions that they make. For example, if a given player’s predicted bat speed is higher than their actual recorded bat speed, this would indicate a mechanical inefficiency in their swing. Coaches, with this knowledge, could then direct their focus to finding these mechanical ineffeciencies and correcting them. Players in this group would spend more time on skill acquisition training. On the other hand, if a player has a predicted bat speed which is equal to or lower than their actual bat speed, this would indicate above average efficiency in their swing and their training could be directed more towards general strength and power to increase force production. In either case, our model would help streamline the process from data collection to impact, giving coaches and players the power to have impactful training sessions tailored to each individual.
Group Contribution Statement
Cece Ziegler: Helped with data cleaning. Built RFE function and Linear Regression model. Performed RFE and model experiments. Led writing “Results and Conclusion” sections.
David Byrne: Introduced topic. Managed data cleaning. Led writing of “Abstract and Overview of Significance of Topic”, “Data”, “Variable Overview”, “Values Statement”, and “Application” sections.
Julia Fairbank: Led writing of “Formal Introduction to Topic”, “Recursive Feature Elimination”, “Multiple Linear Regression”, “Random Forest Regression”, “The Process”, and “Bias” sections.
Sam Ehrsam: Conducted initial RFE experiments with SKLearn library. Built Random Forest Regression model. Performed RFE experiments.
Personal Reflection
I learned a lot in the process of completing our final project. First, I learned about recursive feature elimination and how it can be used in conjunction with different models to select the most important features. It was beneficial to learn how this function worked through the SKLearn package, and it was rewarding to build my own version of the function in our implementation of the linear regression model. I also learned how it can be difficult to find a model that works best for predicting on different data sets. We felt good about our different models going into the building and testing as we had researched what models worked well for predicting continuous numeric values. However, as we see from our results, our models weren’t super sucessful. I learned that it can be a long process that involves a lot of trial and error to build a model that performs well on your specific data set. I was hoping to learn more about which features are most important in predicting bat speed to potentially help me with my own swing, however due to the low accuracy scores and lack of coherence between the features each model pulled as being most important, I didn’t learn as much as I had hoped to in this sense.
I am very satisfied with what I acheived during this project. My goal was to complete all of my tasks on time and help with the implementations and revisions of the project. I met this goal, as I finished all of my parts in a timely matter, and I built the linear regression model and RFE function we used with the linear regression implementation. I was an active participant in all of our group meetings and I helped with various written portions of the final report.
I will carry what I learned from this project into various aspects of my life. First, I think it is important to be able to work in a group, especially in the technology world, and I learned skills regarding working on a project using git hub from this final. Aditionally, I enhanced my skills in building machine learning models and experimenting with data. I learned skills to determine which models are better suited for which data set, along with ways to pick out the best features. These are skills I foresee myself using in feature job opportunities and potential individual projects I have interest in working on. I am very excited about what I learned throughout this course and how I was able to apply and enhance those skills through this final project. Machine learning and data science is something I have a strong interest in pursuing as I move forward with my education and post grad life, and I am excited to apply the skills I learned throughout this class and during this project.