Who Had the Best Season? (science)

Nykonax · September 20, 2022

Hi,

I don't usually write articles anymore but I learned something cool in my stats class and wanted to apply it here. So there's this thing called a z-score, which is essentially how many standard deviations away from the mean a datapoint is. However, what's cool about them is that it allows for comparisons between different things. For example, you could compare if your score on the SAT is better than your friends on the ACT. So I think that this could also be applied to different VHL seasons and eras to fairly adjust for the differences. I do want to talk to my prof about this and see how valid this is, especially if VHL scoring isn't normally distributed. But I do think there is still some merit to this, and even if it's somewhat statistically wrong it's still somewhat statistically right so.

Anyways, I wrote some python code to calculate the z-scores of each player in each season and then found the highest z-score, which essentially translates to what player had the most outlier season, which you could also say is the most outstanding. However, this is just limited to S49-84 because of data acquisition and index formating. When I have more time I want to gather the data for the earlier seasons, but it's a little more involved.

The best season in the last 35 seasons was Joel Jarvi (who @CowboyinAmerica wrote a fantastic article about here) who in S53 scored 162 points and had a z-score of 3.405. The mean z-score is 2.52. The next highest z-score was 3.127, which was Max Molholt's 145 point campaign in S49.

I think this is an interesting statistic, and would love to expand it out to earlier seasons and overall careers. Could be very useful in HOF discussions, especially if it does adjust for era like I think it would. If people are curious about the messy code I wrote I'm willing to share it, and I can calculate other stats too (goals, hits, +/-, whatever).

Spoiler

If people are interested in the z-scores of each season, here is a list of them. The first is S49 the last is S84.

zscores = [3.127, 2.5, 2.884, 2.884, 3.405, 2.826, 2.67, 2.714, 3.04, 2.731, 2.851, 2.284, 2.356, 3.051, 2.796, 2.192, 2.32, 2.349, 2.721, 2.972, 2.274, 2.405, 2.078, 2.133, 2.214, 2.482, 1.979, 2.517, 2.391, 2.392, 2.482, 2.059, 1.993, 2.07, 2.425, 2.498]

Edited September 20, 2022 by Nykonax

Gustav · September 21, 2022

Is this just for points? Might be useful to clarify that because there's going to be a clear positional bias. Also, are bots being included in this? Anyone with very low ice time is going to severely throw off the averages if they're still in the index. I'm also not surprised that you're seeing a lot of outlier performances in the late S40s-50s as member numbers were way low and player numbers went all over the place as a result.

Apart from any of that, well done--it's always good to see people going beyond the usual "here are the numbers on my player page" for their articles.

Nykonax · September 21, 2022

21 minutes ago, Gustav said:

Is this just for points? Might be useful to clarify that because there's going to be a clear positional bias. Also, are bots being included in this? Anyone with very low ice time is going to severely throw off the averages if they're still in the index. I'm also not surprised that you're seeing a lot of outlier performances in the late S40s-50s as member numbers were way low and player numbers went all over the place as a result.

Yeah just points. There's no positional adjustment, although I guess you could just find the z-score of just defenseman and just forwards and then compare those for some sort of positional adjustment. I excluded anyone who played less than 72 games or had 0 points. Low averages or point totals don't necessarily throw off the numbers, they're still valid data points when thought of in context of the problem. If you were to apply this to the NHL, and wanted to know who had the most outstanding season, you wouldn't exclude 4th liners from the calculations. Same can be said about the VHL if you're viewing it as a real league and bots as 4th liners. I removed people with 0 points or less than 72 games mostly because they would've only played like 10 seconds a game and aren't actual players, but rather just random STHS fillers which wouldn't exist in real life.

Shindigs · September 21, 2022

The distribution should be very close to normalized for VHL scoring, so using z-score should be valid, and since you have access to the whole population you obviously don't need to use t-score to correct for only working off a sample. Anything that involves a very large numbers of randomized events (like a VHL season) should inherently end up with a normal distribution, it won't be perfect. But good enough that there shouldn't be any issue with using this. Also since you have access to all data points, you can simply create a graph of it in excel/sheets and visually confirm that's it is/isn't normal.

I actually used to use z-score as part of the sheet I run to evaluate Vegas' performance, it's a pretty useful tool to have when dealing with any stats. Though for my specific use case it didn't really show anything beyond what my scatterplots already do visually.

Will reiterate what's already been mentioned that you really should split not only forwards and dmen, but if you ever take a look at hybrid era only in the future, you also have to split centers and wingers, since the quite substantial investment in FO these days means they really aren't on an even playing field. There's a reason wingers make up almost all the Boulet finalists every season. Cause they don't need to spent ~100 TPE on FO and can spend it on more SC/DF/CK instead making them stronger than their Center peers by default, unless you care about FO% which the VHL usually doesn't.

Daniel Janser · September 21, 2022

10 minutes ago, Shindigs said:

unless you care about FO% which the VHL usually doesn't.

It does not help either that the FO% shown in the index stats site is thwarted and copies the numerical value of the TOI... you need to dig deeper to get to those values and who has got the time (except for their own player)...

here today's extract for my player:

Whereas it should read as follows:

Edited September 21, 2022 by Daniel Janser

Shindigs · September 21, 2022

11 minutes ago, Daniel Janser said:

It does not help either that the FO% shown in the index stats site is thwarted and copies the numerical value of the TOI... you need to dig deeper to get to those values and who has got the time (except for their own player)...

here today's extract for my player:

Whereas it should read as follows:

Just Simon things, the FO% leaderboards at least pull the correct value. Also that FO% bug is only in the VHL index, the VHLM index doesn't have it and the stat works.

Nykonax · September 21, 2022

4 hours ago, Shindigs said:

The distribution should be very close to normalized for VHL scoring, so using z-score should be valid, and since you have access to the whole population you obviously don't need to use t-score to correct for only working off a sample. Anything that involves a very large numbers of randomized events (like a VHL season) should inherently end up with a normal distribution, it won't be perfect. But good enough that there shouldn't be any issue with using this. Also since you have access to all data points, you can simply create a graph of it in excel/sheets and visually confirm that's it is/isn't normal.

seems like S76 (and I'd guess most of the 70s) is pretty normally distributed, otherwise a lot of seasons have a bunch of players in the 0-about 10/15 points range which I guess it just a lot of bots that I didn't filter out for reasons I said above. Gonna talk to my professor about this and find out more about the impacts of it.

5 hours ago, Shindigs said:

but if you ever take a look at hybrid era only in the future, you also have to split centers and wingers, since the quite substantial investment in FO these days means they really aren't on an even playing field

I don't necessarily think that's true. Pretty sure centres just inherently get more points in STHS, and that even playing field doesn't really exist or matter. Sure they might have 100 less TPA, but there's no need for adjusting based on that, especially since there wouldn't be adjusting for TPE differences. Centres effectively have 100 less TPA, but some players also just straight up have 100-200-500 less TPA than other ones. That's not being adjusted for (and shouldn't), so why should the centre difference be adjusted for? I think the positional adjustment between defenseman and forwards does make sense though, because right now a 150 point season from a forward would have a higher z-score than a 140 point from a defenseman, even though I think nearly everyone would say 140 from a D is much more outstanding than 150 from a forward.

Shindigs · September 21, 2022

1 minute ago, Nykonax said:

seems like S76 (and I'd guess most of the 70s) is pretty normally distributed, otherwise a lot of seasons have a bunch of players in the 0-about 10/15 points range which I guess it just a lot of bots that I didn't filter out for reasons I said above. Gonna talk to my professor about this and find out more about the impacts of it.

I don't necessarily think that's true. Pretty sure centres just inherently get more points in STHS, and that even playing field doesn't really exist or matter. Sure they might have 100 less TPA, but there's no need for adjusting based on that, especially since there wouldn't be adjusting for TPE differences. Centres effectively have 100 less TPA, but some players also just straight up have 100-200-500 less TPA than other ones. That's not being adjusted for (and shouldn't), so why should the centre difference be adjusted for? I think the positional adjustment between defenseman and forwards does make sense though, because right now a 150 point season from a forward would have a higher z-score than a 140 point from a defenseman, even though I think nearly everyone would say 140 from a D is much more outstanding than 150 from a forward.

Yeah the difference between dmen and forwards is clearly much bigger, but looking at both forwards as a whole and wingers/centers split could be interesting. Since if you do split them and still see basically the same results, it proves the hypothesis that the difference isn't relevant. But if it *does* massively skew the output you get, then it is instead some degree of proof that the difference does matter. Remember that we're looking at the whole sample of all wingers vs. all centers. Not just individual ones, so assuming a large enough sample size we expect to see a roughly equivalent TPA spread in both sets. With there then being effectively a -100 TPA modifier applied to one set, but not the other.

For the graphs, as long as we ignore the clear outliers (bots/very low TPA IAs) then the distributions are all roughly normalized, with the 62 one being somewhat right-tailed. But there's some test you can do to make sure it isn't too right-tailed for z-score to still be applicable, but i sadly can't remember the details. Haven't had to do it for a hot minute. It's a bit hard to tell since the outliers screw up the scale.

If you still have sets that are too irregular and don't seem to be applicable with z-score, I think you can still do basically the same analysis using Chi Squared instead, which if you just got to z-score, you probably won't get to until the end of the term or so.

Sign In

Who Had the Best Season? (science)

Recommended Posts

Nykonax 1,564

Link to comment

Share on other sites

Gustav 6,421

Link to comment

Share on other sites

Nykonax 1,564

Link to comment

Share on other sites

Shindigs 1,771

Link to comment

Share on other sites

Daniel Janser 2,170

Link to comment

Share on other sites

Shindigs 1,771

Link to comment

Share on other sites

Nykonax 1,564

Link to comment

Share on other sites

Shindigs 1,771

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

VHL Portal

Index

Browse

Teams

Rules