Intro to Data Science: Using Sports Betting Data With Game Data

# Intro to Data Science Part 4: Testing Your Hypothesis

Jason Scavone
Intro to Data Science
Modeling
August 7, 2023

When last we met, we had sourced a trove of batter data from the 2021 MLB season. We started to look at some basic correlations among variables for hitters. Now it’s time to introduce sports betting data to the mix.

By doing so we can begin to answer the most crucial question of our whole experiment. The one you ask yourself on a daily basis: Will the game go Over or stay Under?

Finding out what makes a game go Over is our lodestone. So far, we think hard-hit rate will be a predictor of games that go over their total.

To start testing that hypothesis, we acquired team performance data. But we’re missing the other key component here: sports betting data. It’s time to bring the lines into the picture.

## Adding Sports Betting Data to Game Data

The good news is: if you’re doing this for football, it’s going to be easy. The NFL and college R and Python packages we mentioned earlier include betting data as part of what they track.

With baseball, it’s a bit trickier. Though we don’t endorse any particular site over another, there are options available for paid data and some historical data.  We’ll leave it to you to find a solution that works. If you can’t find an R package that includes the information, there are various websites that post daily lines and some that include line history.

We’ve pulled some historical information from SportsBookReviewsOnline.com. These are 2021 lines from a market-making book.

To start with we’ll combine our sports betting data and our game data. We also will include some select pitcher data. For that, we’ve sourced game data from Fangraphs.

Here are the component parts we’ll need to work with.

And here’s the full Excel file that puts it all together:

## Shaping the Data

There are a few steps to get our Excel file into shape. First, put all of the individual files into one workbook.

Then, create a key for home and away teams by joining the data in our date and team cells using the ampersand (“&”). Though you could also use the concatenate function in Excel.

This allows you to assign a unique identifier to individual games. It’s vital to have unique IDs for individual entries so you can precisely locate data, particularly using Excel’s functions.

In columns X through AM, we’ve shaped the odds data.

Use the VLOOKUP function to pull in odds data from the odds sheet. Repeat the steps for the away moneyline, home moneyline, total, Over price and Under price.

Next, get the implied probabilities of our betting prices. The formula for plus-money probabilities is below, where X is the price:

For minus-money probabilities, the formula is:

If you click on cell AC2, you can see the way to build the formula in Excel using a conditional statement is:

=IF(Y2>0,(100 / (Y2 + 100)), (-Y2 / (-Y2 + 100)))

Now remove the vig from these implied probabilities. The formula to do that is below, where P1 is the probability of the first event (in this case the home team winning) and P2 is the probability of the second (the away team winning).

In our first game, the line is -125/+115. That gives us implied probabilities of 46.5 percent and 55.5 percent. If you add those together it comes out to 102 percent. The extra is the overround, which is where the vig is. We’re just adjusting to remove that overround.

Ours looks like this in the sheet:

The true implied probabilities then, are 45.6 percent and 54.4 percent. The vig has been cleared out, and the probabilities now add up to 100 percent.

Before we go any further, we need to know if we’re going to be looking for a binary result (ie: did the game go Over or Under?) or a continuous variable (what is the score of the game likely to be?). Here, we’re going to use both.

To get the actual total, obviously just add together the home (column F) and away scores (column H).

To note whether a game went Over or Under, compare the actual total to the betting line. We’re using a “1” for games that went Over and a “0” for games that stayed Under. For this example, we’re counting pushes in with the Under, though you may want to make the distinction down the line in your own work.

The formula here, where we have total score in column AM is the following. Note that multiplying by one in the formula just changes the output from “True/False” to “1/0”:

=(AL2>Z2)*1

Repeat the same process in column AK to check to see if the home team or away team won.

Columns AN through AU append Fangraphs pitching data. The first step it to get player IDs for home and away pitchers using VLOOKUP to reference our pitcher sheets. Remember that player IDs may be different depending on what data you source.We talked about FanGraphs IDs vs. MLB Advanced Media player IDs in Part 2 of the series.

The last step is to again use VLOOKUP to pull in hard-hit data. This allows us to start building a more complete picture of team hard-hit data in a game vs. individual pitcher hard-hit data. It lets us start putting together correlations

## What Else Factors in?

Here’s what those correlations looked like in Part 3 when we built the plot in R.

Darker red means two variables have more positive correlation. So hard-hit percentage is positively correlated to slugging percentage, and slugging percentage is positively correlated to runs scored.

But we want to be able to look at this information from a few different angles. We can ask more questions and build out our hypotheses.

Overall, our “Y” in this experiment is: Did the total go Over? To get there, we’re testing certain “X” variables. We think that hard-hit rate is  important. But we can consider other factors – other X’s. Do teams that have a higher hard-hit percentage translate to winning more Overs?

These are some other X’s we can consider that might affect our Y:

• Vulnerable pitchers: Pitchers who tend to give up more hard-hit balls we expect would compound with teams more likely to rake. Does this lead to an even greater chance to hit the Over? This is why we’re including pitcher hard-hit data.
• Hot/cold streaks: We think teams that have been hitting the ball hard over the last 10 days are more likely to keep doing so. Does the data bear this out? We’ll take a look
• Weather: We have a pretty good idea that more runs are scored in the warmer months overall. While we don’t have weather data included here, you can explore it on your own. Do teams that hit the ball harder outperform the average team’s run production in the summer months, or is it a non-factor?

## Checking the Hypothesis

We could use the CORREL function in Excel to build out correlations in our data But we’re taking a little shortcut and show a visual representation of the correlation plot. Skip ahead to the next section to get the code to generate this plot using R.

Here are all of our variables together in one chart:

Remember, the darker red a cell is shaded, the more positive correlation there is between variables. The darker blue, the more negative correlation we have.

There is some some correlation between home hard-hit balls per game and the actual total.  Looks like we’re not way off base in our hypothesis.

You may have already spotted one place to expand on this work. Home and away hard-hit per game, and home and away hard-hit allowed per game are all listed separately. Since they both affect the total, combining into overall “per game” and “allowed per game” numbers, may allow you to build a more holistic picture about how hard-hit balls interact with game totals.

## Extra Credit

For those of you who are doing the work in R, here’s our code to put it all together. The code asks for a additional betting data from SportsBookReview, which we’ve included here for download. It also asks for the batting data. If you didn’t download that above, you can grab it here.

Here’s what’s going on inside all that code. The first thing you need to do is set your work directory (defaulted here to “D:/Downloads”) to the appropriate local path on your machine. This is where you want to store the .csv files the code will be looking for.

Line 23 loads the data we created previously for hard-hit percentages, slugging, etc.

Lines 28-171 add the .csv files of odds data to our dataframe. In addition, these lines clean up the data (including fixing some non-standard team abbreviations in our odds data, like “BRS” instead of “BOS” for “Boston Red Sox”).

Lines 174-229 convert prices to implied probabilities and remove the vig. They also create the binary information to determine if a game went over, and which team won.

Lines 232-364 brings in pitcher data and hard-hit rate for starters.

Finally, the code builds out our correlations and generates a plot. You can export the image of the plot to study where the correlations are.

## What’s Next

Now that we’ve got all our data in place and cleaned up, it’s time to have some fun. In the next installment we’ll look at linear regression, whether we think teams can run hot or cold, and logistic regression on totals.

### What came first

If you missed our previous installments, here’s where you can find those: