from LinearRegressionAnalytic import LinearRegressionAnalytic
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
= pd.read_csv("biomechanics_dataset_v1.csv")
data
1)
np.random.seed(= data.iloc[:,1:60]
X = data.iloc[:, 60]
y = train_test_split(X,y, test_size = .2) x_train, x_test, y_train, y_test
Source code
https://github.com/CeceZiegler1/ML_Final_Proj/blob/main/LinearRegressionAnalytic.py
Below, we fit our model on the x_train and and y_train datasets, and then print out the training and validation scores. This model is fitted on all 60 features in the dataset. We can see from the scores, that it is not performing great, as a validation score below 50% indicates we could do better by just randomly selecting. We are going to perform a recursive feature elimination that we also implemented in our source code to see if we can find the optimal number of features to use to obtain the best score.
from LinearRegressionAnalytic import LinearRegressionAnalytic
from LinearRegressionAnalytic import rfe
#Seeing how the model performs without RFE
= LinearRegressionAnalytic()
LR # I used the analytical formula as my default fit method
LR.fit(x_train, y_train)
print(f"Training score = {LR.score(x_train, y_train).round(4)}")
print(f"Validation score = {LR.score(x_test, y_test).round(4)}")
Training score = 0.5258
Validation score = 0.4605
Below, we create an array to store the score that is produced with each different number of features used in the model as selected by our RFE. We use a for loop to loop through each value from 1-60 and display the score at each iteration in a graph. What we see from the graph, is even with using fewer features, our score never gets above around 45%. The best scores come around 12-15 features and 55-60 feautres. Even still, the scores at these points aren’t very good. Although we were hoping linear regression would perform well on our dataset, it doesn’t appear to be the case. Because of this, we are going to implement a random forest tree to see if we can obtain a better validation score on our dataset.
# compute the score for each value of k
= []
val_scores = []
train_scores for k in range(60):
= rfe(x_train, y_train, k)
selected_features = np.zeros(x_train.shape[1], dtype=bool)
feature_mask #masking to include only the selected features
= True
feature_mask[selected_features] #subseting x train and test to include only selected feautres
= x_train.loc[:, feature_mask]
X_train_selected = x_test.loc[:, feature_mask]
X_test_selected = LinearRegressionAnalytic()
lr #fitting model on selected features
lr.fit(X_train_selected, y_train)#appending score to score list
val_scores.append(lr.score(X_test_selected, y_test))
train_scores.append(lr.score(X_train_selected, y_train))
# plot the results
import matplotlib.pyplot as plt
range(1, 61), val_scores, label='Testing accuracy')
plt.plot(range(1, 61), train_scores, label='Training accuracy')
plt.plot("Number of Features")
plt.xlabel("Accuracy")
plt.ylabel(
plt.legend() plt.show()
= rfe(x_train, y_train, 14)
feat_select feat_select
[33, 34, 3, 35, 37, 6, 39, 2, 43, 45, 14, 19, 23, 28]
Below, we will show the 13 most important features as obtained through our rfe.
33, 34, 3, 35, 37, 6, 39, 43, 45, 14, 19, 23, 28]] data.iloc[:,[
max_rear_shoulder_x | max_rear_shoulder_y | range_lead_hip_z | max_rear_shoulder_z | max_torso_y | range_lead_shoulder_z | max_torso_pelvis_x | min_rfx | min_rfy | range_rear_shoulder_y | range_torso_pelvis_x | max_lead_hip_z | max_pelvis_y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 550.0243 | 514.1198 | 778.1339 | 1188.6807 | 182.7302 | 867.6362 | 182.4743 | -232.2776 | -88.0097 | 689.2249 | 233.0842 | 384.3450 | 88.3858 |
1 | 638.6019 | 535.2822 | 960.4793 | 1278.5380 | 196.9712 | 1054.5098 | 236.0902 | -189.7241 | -106.2254 | 812.9988 | 306.7874 | 520.8627 | 106.4238 |
2 | 580.0406 | 472.9189 | 784.0413 | 1588.7207 | 248.4432 | 988.9415 | 222.8233 | -124.4299 | -84.5785 | 708.1030 | 313.2967 | 433.6955 | 82.5397 |
3 | 635.8561 | 484.2663 | 1036.2757 | 888.1270 | 166.9048 | 1472.7250 | 168.7606 | -175.8547 | -122.1629 | 732.5588 | 228.6738 | 489.4716 | 81.4764 |
4 | 566.9714 | 502.2202 | 1093.3019 | 1487.6143 | 191.2448 | 1130.6572 | 220.7400 | -219.5387 | -72.5831 | 699.1772 | 286.4758 | 597.7220 | 75.9968 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
632 | 631.5529 | 488.3580 | 980.3030 | 1575.2948 | 165.8830 | 1150.6032 | 147.9856 | -114.1301 | -173.2356 | 804.6660 | 239.9022 | 354.5130 | 124.7927 |
633 | 571.2316 | 477.7701 | 748.3298 | 1604.9299 | 145.5400 | 1026.0944 | 188.9410 | -113.4915 | -157.5923 | 735.1128 | 276.8293 | 324.9995 | 137.1521 |
634 | 549.3600 | 407.3251 | 526.3367 | 1393.4961 | 128.0184 | 1029.3547 | 257.2261 | -112.7565 | -111.9854 | 584.3304 | 348.2130 | 207.2101 | 128.8111 |
635 | 623.2650 | 463.8467 | 1248.0062 | 1715.0544 | 136.8013 | 892.8699 | 177.4202 | -122.3425 | -161.2802 | 725.1355 | 266.6244 | 282.0038 | 157.1024 |
636 | 599.2501 | 505.9937 | 1433.3273 | 1480.4099 | 143.7898 | 1233.8176 | 169.0549 | -165.5618 | -132.0637 | 700.1916 | 234.5590 | 500.9032 | 107.8579 |
637 rows × 13 columns
= x_train.iloc[:,[33, 34, 3, 35, 37, 6, 39, 43, 45, 14, 19, 23, 28]]
x_train = x_test.iloc[:,[33, 34, 3, 35, 37, 6, 39, 43, 45, 14, 19, 23, 28]]
x_test
lr.fit(x_train, y_train) lr.score(x_test, y_test)
0.386607442390802