Jump to content

Importance of TPE stats via Decision Trees


Recommended Posts

Motivation

 

Relatively new to the league, but wanted to take a stab at trying to better understand how the TPE of certain stats affects the results of the sim. This type of analysis has been done before. See here, here and here. Those posts served mostly as the inspiration for me to try something new. I wanted to see if I could improve on some things by taking a different approach.

 

Some drawbacks to the analysis above that I hoped to correct:

  • Prediction should be done on rate stats, not on counting stats. Higher Minutes Played = Higher Counting stats, but doesn’t necessarily mean they are the better player. Things like Goals per 60 mins, Hits per 60 mins, etc. can be looked at instead to hopefully give a clearer picture
  • Linear models perform poorly with highly correlated variables. In one of the above posts, EX is shown to be the third most significant predictor of goals, and it was even hypothesized that this was because the only players who actually invest in EX are already highly established in the other stats that matter. In reality EX might not actually be important

  • There are lots of non-linear heuristics about TPE floating around the forum. Things like “PA will lead to less goals/total offensive output when it is within 10 points of SC” and “DI only matters up to 50 then it makes no difference” etc. etc. I wanted to pick a model that could do a decent job at picking up on these trends, if they existed

  • There was no assessment of how well the model worked for prediction purposes. It was purely used to assess variable importance tool, but it could have been terrible at prediction. Bad as a predictor often means the model was overfitted and not actually capturing the true relationships between TPE and output.

 

Decision Trees

 

Decision trees are an efficient way to capture non-linearities in data, that also don’t do a bad job capturing linear relationships. They work by finding splits in the independent variables that lead to the biggest change in the variance of the dependent variable. For example, an SC of 70 might split a tree predicting GoalsPer60 output, where, collectively, players higher than 70 in SC average 1 gpg and those below average 0.5 gpg. The tree continues to grow by recursively selecting the next best split until at some point the tree is told to stop (by certain tuning parameters). Perhaps the most annoying drawback to using trees for prediction is figuring out how to properly select the tuning parameters. Stop the tree too early and you don’t get enough interesting relationships, but let it grow too big and you risk overfitting to your training data. There are a bunch of different strategies you can take to further improve decision trees (boosting, random forests, etc.) but the tradeoff here is more predictive power in exchange for a model that is harder to interpret. I didn’t want to go too fancy for this, so I stayed with your run of the mill regression trees.

 

Data

 

The data that was used was regular season VHL player data from seasons 59 to seasons 66 (these are the only complete seasons I could find on the portal to scrape). CPU players or players who played 0 minutes in a season were omitted. I used seasons 59 to 65 for my training data, and season 66 for my test data set. Like the analyses linked above, the TPE used for each player is still the end-of-season TPE. This definitely hurts the results a bit, but I do not know of a way to get the TPE of a player at each game he played. 

 

I started out trying to actually predict the raw stat numbers. This didn’t do poorly perse, but given the amount of variability from season to season it didn’t perform amazingly. (i.e one year a player can average 3 hits per 60, and the next he’ll average < 1.5 with similar minutes, team, etc.). At a glance it looks like the sim/team strategies change enough from year to year that prediction of stats is hard. I chose to handle this variability by predicting stat ranks within each season: instead of predicting “player A will get X goals per 60”, I chose to predict “player A will be in the top X% of goals per 60 among all other players”. In my data, the higher percentile means the higher the stat (i.e 100th percentile is the league leader of that stat in the season)

 

If anyone wants access to the raw data set, just give me a shout. 

 

Results

 

I looked at GoalsPer60, AssistsPer60, HitsPer60, HitsTakenPer60, PIMPer60, ShotsPer60, ShotsBlockedPer60, FaceoffWinPercentage (min 50 faceoffs taken), Fights (total), PlusMinus (total)

 

I will walk through explaining the output of GoalsPer60, but then just quickly list of the rest of the results for the other stats.


 

GfF_15qlNr9DLGe1oYLHpmJVkOg_aDdSLQwzluu2UyMthVpyH7gUPGk9O7dwZts9nKnS9ccwv2zSJiBWZJEUa8a93x5yaA1pDP-m8rDdrHaSNeWiwCK8IstkDPinm_7M2F3klqHK

 

This is a (fairly pruned) decision tree that resulted from running GoalsPer60 on every TPE stat. At each branch there is a split in some variable that creates two more branches. The data points where that split is true go left, and where that split is false go right. Before each split there is a summary of the data (the blue bubble above). The top number is the average (mean) of the output (percentile of GoalsPer60) for all data points in that split. The bottom number is the percentage of data in that split. 

 

For example, we start with an average rank of 0.5 (50th percentile) and 100% of the data. The first split is “SC < 84”. 52% of the training data had SC < 84, and collectively these players had an average rank in GoalsPer60 of the 31st percentile (lower percentile = less goals per 60). Alternatively, the 48% of players that had SC >= 84, had an average rank of the 70th percentile. SC > 84 is good to have if you want you want your player to score relatively more goals. 

 

The tree can be followed down through a bunch more splits until it ends at a node. Whatever is the average value for the node is what gets used in the prediction. For this particular tree, there are only 8 possible percentiles that can be predicted for any given player (because there are only 8 nodes). This is not super realistic, but as we will see below, it does a decent job for what we need.

 

Error

 

TrainingError = 0.11

TestError = 0.14

 

These errors are the average absolute deviation of the predicted values from the actual values in each data set. In this case the training error is saying that, on the data that the tree trained on (seasons 59 to 65), the average deviation in rank was about 11 percentiles. 

 

As an example, player #1 in the data (John Locke in season 59) was ranked in the 83rd percentile for GoalsPer60, but was predicted to be in the 86th percentile. He would have only contributed 0.03 to the error. 

 

The biggest deviation was Mikka Pajari in season 64. Here he actually scored in the 6th percentile for GoalsPer60, but the decision tree predicted him to place in the 53nd percentile based on his TPE. It would be interesting to know why he performed so poorly; his TPE seems quite fair, so this tells me there are a lot of things other than just TPE at play in the sim. 

 

S3jfParhukppeUbPvZ3EpwLMp7T4b4lj0F6wXoocErDe3uutlZasxn_7mSZKWUsSScHHjzzP1wHuPiFDc43yRk0noAZ0BcKcd9aWe_SMF-MH5y4txOnpZiBhA9DQSKSkCa5421qP

 

In terms of test error, it is almost always the case that it will be worse than the training error because the test data has no impact on the decisions the model chose to make. Overall, this model is not amazing for prediction purposes (it would be nice to get the test error down to under 0.05), but it could be a lot worse. The fact that the test error is not too much higher than the training error suggests that it didn’t overfit too badly (if we upped our tree nodes from 8 to 40, we would likely see lower training error, but even higher test error). In terms of figuring out which stats are important, the model does decent enough.

 

Variable Importance

 

The importance of variables within any given decision tree can be assessed and scored pretty easily. One way to do this is to sum up the “goodness of split” for each split for each variable. See here for more info. Unlike the chart above where the tree was pruned to only show very meaningful splits, variable importance is calculated at every possible split, for all variables included in the model i.e even if the variable was not chosen as a split by the tree, there will still be some contribution made to the variable’s importance. Here is the output for GoalsPer60


 

ITSf67WeEZ8fK1O-G2sP9gpeVLpQetCNRlmzEy5S8ggWl0bgFVM03Yf_V9vabUv_nHK92VlwcT1NSP4_f1rCUNB1ipVcE4D5cY_8sDB9bPTjbd2Eiy-dIOslcCyOCUPt3a0Xr5JT



 

The top 3 most important stats were SC, PH, and SK, with SC being the predominant variable. Note that just because a variable is important, it does not mean that the relationship is direct/positive. A highly important variable might have a negative relationship, or it might not have a linear relationship at all. The best way to assess the nature of the relationship by looking at the tree itself, or by looking at partial plots (i.e 1-D scatterplot of dependent). 

 

In this case, all three important TPE stats scale positively with GoalsPer60 (higher SC, PH, or SK generally lead to better GoalsPer60).


 

Summary

 

I will quickly summarize the results in a table below but all of the decision trees can all be found externally (linked here)

 

Stat    Training Error    Test Error    Important 1    Important 2    Important 3    Important 4    Important 5
GoalsPer60    0.11    0.14    SC (23)    PH (16)    SK (15)    DF (11)    EX (9)
AssistsPer60    0.15    0.2    PH (18)    SK (12)    DF (12)    SC (9)    PA (9)
PIMPer60    0.13    0.14    CK (28)    ST (9)    DF (6)    SK (4)    Position (4)
HitsPer60    0.13    0.14    CK (29)    ST (9)    DF (6)    SK (4)    FG (3)
HitsTakenPer60    0.14    0.14    FO (16)    Position (14)    DF (6)    SK (4)    SC (4)
ShotsPer60    0.1    0.13    SC (22)    PH (15)    SK (14)    DF (14)    Position (9)
ShotsBlockedPer60    0.12    0.13    Position (28)    SC (7)    PH (4)    DF (3)    FG (2)
Fights    0.16    0.16    FO (9)    FG (7)    Position (7)    CK (1)    PS (0)
PlusMinus    0.17    0.25    PH (14)    DF (8)    SK (7)    SC (6)    ST (5)
FaceoffWinPercentage    0.1    0.11    FO (20)    Position (13)    ST (2)    EX (2)    PA (2)

 

Overall I think the forums have it right. PH, SK, SC, DF are the top stats, some worth getting higher faster than the others depending on how you want to build.

FO, ST, PA, CK are the top secondary stats to pump stuff into. The rest are kind of disappointing.

 

 

Edited by studentized
Link to comment
Share on other sites

@studentized  I am a big stats guy. I work with a lot of stats every day and a lot of raw data in studies I am a part of. Honestly, I've never heard of this type of analysis before. What is the difference between a Linear regression model (like I ran here and here) vs. your regression tree? lol to me this seems like a poor man's regression mixed with a poor man's SEM?

 

You can use all the fancy stats words I do understand it lol just trying to interpret why this may be better/worse. Also what program did you use? 

 

Just to be clear I'm not saying this is bad, this was a great read I just want to understand the test more! I am planning on doing some SEM with VHL data soon once I figure out the some kinks in my coding for the tests. Nice to see a stats bro :) 

Edited by Motzaburger
Link to comment
Share on other sites

@MotzaburgerGood question! I'll do my best to answer

 

For starters it doesn't assume a linear relationship between the variables or come with any of the other necessary requirements that linear regression does (normal error terms, uncorrelated independent variables, etc.). When you don't have those things in a linear regression, then the stats don't necessarily hold anymore (interpreting pvalues, etc.).

 

In general, most parametric predictive models come with some assumptions so it usually takes a whole bunch of work to meet that. Principal component analysis is one that could probably work on this data to meet a linear model (PCA essentially replaces your independent variables with linear combinations of them to make them orthogonal/uncorrelated)

 

As for why this might be better... It might not be. The standard way to assess model accuracy is to train on one set, test on another and compare test error rates. It's very possible that the one I came up with is worse (its definitely far from being great).

 

The main reason I picked decision trees was because they are fast to run and easy to interpret. The fact that it hadn't been done before and could capture any potential nonlinearities in the data came second.

 

The stats part was all done in R (rpart is the trees library), a lot of the data prep was done in nodejs.

Link to comment
Share on other sites

10 minutes ago, studentized said:

@MotzaburgerGood question! I'll do my best to answer

 

For starters it doesn't assume a linear relationship between the variables or come with any of the other necessary requirements that linear regression does (normal error terms, uncorrelated independent variables, etc.). When you don't have those things in a linear regression, then the stats don't necessarily hold anymore (interpreting pvalues, etc.).

 

In general, most parametric predictive models come with some assumptions so it usually takes a whole bunch of work to meet that. Principal component analysis is one that could probably work on this data to meet a linear model (PCA essentially replaces your independent variables with linear combinations of them to make them orthogonal/uncorrelated)

 

As for why this might be better... It might not be. The standard way to assess model accuracy is to train on one set, test on another and compare test error rates. It's very possible that the one I came up with is worse (its definitely far from being great).

 

The main reason I picked decision trees was because they are fast to run and easy to interpret. The fact that it hadn't been done before and could capture any potential nonlinearities in the data came second.

 

The stats part was all done in R (rpart is the trees library), a lot of the data prep was done in nodejs.

 

Sweet! Makes sense! So basically a non-normal data regression. Most real world data anyways assumptions are ignored and/or not met because real data don't work like that lol. 

 

Nice to see R as the program tho instead of Excel or something ? I use SPSS myself, but I'm taking a course on R in the fall. I do know a lot of Mplus though so it could be interesting. Good stuff here once again. Looking forward to seeing more if you do it. I'm sure I'll do some more sometime soon too just a matter of finding the time for data collection and actually doing it lol. 

Link to comment
Share on other sites

On 8/3/2019 at 8:31 PM, studentized said:

1645 words. Will claim for next 3 weeks (Aug 4th to Aug 18th)

 

claiming for week of 18th (last one)

 

Edit: will bump this to next week instead actually, so I can get theme week in for the 18th

Edited by studentized
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...