Welcome on in everyone. I am T and today I want to talk about the beta distribution.
Now, a lot of people have been asking questions on Twitter and YouTube about what a beta distribution is and there’s been a lot of complicated answers out there. I’m hoping to explain this as simply as possible.
The best way to describe a beta distribution is the way of doing a probability of probabilities. So beginning of a season, it starts. I have a bunch of historical information. I can make an assumption about a player.
Finding the Beta Distribution of Batting .280
Let’s take a baseball player and I’m just going to say I think that their batting average for the season is going to probably be about .280.
Then I go ahead and load a model up, predict it, and it says yes, .280. Now, as the season goes on, maybe that player is doing exceptionally well, exceptionally poorly, but .280 may not be true anymore. If my model continues to say that that player has a 0.28 percent chance of actually getting a hit, maybe that’s wrong and inaccurate, but I won’t see that in the way my model is running. What the beta distribution is going to allow you to do is take that assumption, take newly gained information from games that have been happening, and then tell you what is the probability that your prediction is correct?
If we have a .280 and there’s been maybe 150 at bats that have occurred since the season began, I can go ahead and see what is the probability that my .280 is accurate. It will tell me, is it possible that this batter is hitting at .280 or better and it’ll look for that based upon adjustments throughout the season.
Let’s go ahead and take a look at the spreadsheet to learn a little bit more about the beta distribution.
As I’ve mentioned, the beta distribution again is a probability of probabilities.
So if we look at this info really quickly, I have a predicted probability. I’m going to say I think that this player has a batting average of .250 and that is what I’m expecting.
That’s my prediction and I think that’s where they’re going to be throughout the season. That’s my expected.
I then have my alpha. The alpha is merely the number of successes that I have seen. I have a beta, which is the number of failures that I have seen. Here we’re going to go ahead and say that there’s been 100 successes and 463 failures, which means that there has been an at bat 563 times for this fictitious player.
Right now I’m saying that their batting average is .250. The probability of that given this information is extremely low. However, if I were to change this and say maybe their batting average is actually only .100, there’s a 99 percent chance that this is correct as far as being the probability.
If we went back to .250, I could change some of these numbers around. I could change this to 200 successes. And now we see that this number has been modified.
Update Your Priors
What’s great for this is that you can use this throughout a season and course correct what your numbers are showing.
If a model is saying that there’s a high probability that a player can do something and you start using the beta distribution, you can see that maybe your model is inaccurate or it’s extremely accurate.
In order to use a beta distribution, whatever you’re trying to evaluate, it must be a Bernoulli trial. Which means that it has a yes or no possibility and that is it. They are independent.
So you could think of flipping a coin, that would work. A lot of different things in different sports can work. You can do hitting percentages and things like that as well.
Let’s go ahead and look at a batting average for two players. Here are some numbers that I pulled from the prior season that is helping me in my assumption.
There were 586 at bats, 160 hits and 426 misses. The way I came up with this information was I put in the at-bat. I put in how many hits had occurred and I just merely subtracted the hits from the at-bat to give me how many misses.
You’ll notice that down here it says misses and drops. This is just because it can be used for a lot of different types of events in different sports. It works for baseball, football, basketball, things like that. So that is just a way that you can label it.
Let’s just change the label to only misses. Since we are only talking about baseball right now. And I did the same thing here. I went ahead and got the targets, which is basically, again the at-bats we’re going to say how many hits and then how many misses.
Now we’re looking at these two players and looking at their information. I’m saying the batting average is probably going to be .270. Looking at this information, there’s about a 56 percent chance that I am accurate.
Mine is sitting in there and that’s pretty good. That’s not terrible. Now down here looking at this information, right, we’re closer to the 50 percent line. I might want to try and reevaluate how I came up with this probability to pull those numbers up.
Before we start modifying these numbers, let’s go ahead and look at the beta distribution formula in Excel. The beta distribution itself is going to be right here, which is going to be beta.dist. And then you’re going to pass in your assumed probability, what you think it will be. Then you’re going to pass in the successes, then you’re going to pass in your failures and you’re going to set the cumulative to true.
The one thing you’re also going to want to do is then set one minus this output.
Build Your Case
Now that we have those, let’s go ahead and open up Baseball Reference and get some new statistics.
For (Shohei) Ohtani, we are going to go ahead and look at the at-bats. So far there’s been 311 additional at bats with 96 hits. We’re going to go ahead and add in 311 and 96. We’re going to go ahead and let you see what those are right there.
Remember, the information I was originally showing was from 2022. This is the current 2023 season. Using that information, we have now gone ahead and made some modifications, and it’s still tracking this correctly. And to say that Ohtani’s got about a .270 batting average, we’re 84 percent accurate now.
Now I could think about, maybe I want to move this up. Maybe, as I’ve been watching games, I’m saying, man, he’s hitting phenomenally. Is he hitting .300? No, not likely, because now we see that it’s about a 12 percent. We way jumped over our number. Let’s go ahead and pull that back in.
Maybe it’s .280. We’re still above the 50 percent mark, so we’re doing pretty well. Now, what’s great about this is that as the season goes on and you modify these numbers, you can see the beta distribution numbers change and shift.
The beta distribution is a way in which you can evaluate some of your prior assumptions about a probability.
Sometimes we will look at an athlete or a sporting event and predict that a certain event has a certain probability of occurring. We’re basing that information off of historical information.
As we know in the future, maybe the player plays better or they play worse, so our numbers could shift around. Our probability is never going to be 100 percent accurate or just sit on that number.
Adjust for Success and Failure
Let’s go ahead and briefly explain how this works one more time. Let’s say that before the season begins, Ohtani has a .280 batting average. I will have used some old numbers to come up with this assumption, but this is my prior estimation. As the season goes on, I’m predicting that Ohtani would have a .280 batting average.
However, maybe Otani starts to have a really good season or a really bad season. That means that this probability should be shifting around. The probability will never be completely accurate, nor will it be static throughout the season.
As information is learned game after game, plate appearance after plate appearance and at-bat after at-bat, we start to get more information. Each time that he comes up to the plate, we figure out did he get a hit or did he get a miss? What this number here, the beta distribution number is telling me, is what is the probability that Ohtani is going to have at least a .280 batting average.
As the season goes on we gain more information. If we use 68 hits in 311 at-bats, now it’s dropped down to 3 percent, because we’ve added so many additional at-bats with very few successes.
The probability that Ohtani’s batting average is going to be forecasted to become .280 is very low. It’s at 3 percent now. However, if he had some other numbers and he hit 168 of an additional 311 at bats, that does change things. So now, looking at it, my original assumption that he will have at least a .280 batting average has a 99 percent chance of being true.
One thing that I can do then is think about the forecast in a different way. Okay, well, if that’s very, very possible, what about a .290? Well, a .290 is possible as well. What about .300? Pretty possible as well. What if I say .400?
Now it drastically drops because this is an absurd number for a batting average, and we’ve now come down to a 1 percent.
What this allows you to do is evaluate your initial probability, which is your prior, and as you add in new information, you get to see what the probability is that your original number is still accurate. And if it is not accurate anymore, or extremely accurate, you can start adjusting this number in this calculator to figure out what is it most likely going to be, based upon the new information we have received?
Once again, these are priors, and then you use more recent information to update your prior.
Beta Distribution vs. Binomial Distribution
A beta distribution is different than a binomial or a negative binomial. A beta distribution is a probability of a probability.
In a binomial distribution, we know how many trials we’re expecting or how many at-bats. We have the probability. And what we’re trying to see here in the way that I have set this one up, which is one minus binomial distribution, this one is actually trying to say what is the probability of getting at least this number and greater.
Here we’re saying if there are 100 at-bats at a .400 batting average, the probability to get one hit out of 100 is one, and the probability of getting two is one and three is one and so on and so forth. But then when we start to hit 12, it becomes 99 percent.
If we use 50 at-bats and a .280 average, now we start to see things shift around. If I only have 50 at-bats and I have a .280 batting average, the probability of me getting at least one is about 99 percent. Same thing for two, three, four. Now it’s starting to dwindle down and you can see it starting to drop. When I hit nine right now, we see a pretty good drop off, 40 percent, then to 10.
The binomial distribution is when we know how many trials we have, we have a known probability, and we’re trying to see how many times an event will occur based upon this information.
In a beta distribution, we’re saying we have a probability and we want to see how accurate our probability is as we learn new information. Again, this can be used for different sports. It does not have to only be in baseball, but again, it should be a did-it-occur type event. Things like catches in football, field goals made, passes and completions made, things like that.
It cannot be yardage or anything like that. That has high variance, that is not incremental counts of one occurrence, right? Think of flipping a coin. It’s either heads or it’s tails.
So that is it for the beta distribution. Again, it is a probability of probabilities. Hopefully this will allow you to take a look at your model, understand whether it’s predicting and forecasting accurately, or see some early warning signs that maybe you need to change either your data sets or your model.
If you have any questions, you can reach me in the Unabated Discord. You can drop all questions into the article’s channel. That’s it for now. Until next time: Happy wagering.
This site is strictly for educational and informational purposes only and does not involve any real-money betting. If you or someone you know has a gambling problem and wants help, call 1-800-GAMBLER. This service is intended for adults aged 18 and over only.