Intro To Data Science Part 5: Linear And Logistic Regression

Jason Scavone

Intro to Data Science

April 19, 2024

So you’ve found a trove of data. And you’ve found a couple of correlations. What do you do with it all when you’re trying to build a sports betting model? It’s time to apply a linear or logistic regression.

First, let’s take a quick look back at how we got here:

In Part 1 of this series, we introduced the course and laid out what you’ll need to follow along.
In Part 2, we talked about where to find data.
Part 3 was when we looked at some basic ways to find correlations in your data.
And most recently in Part 4 we started to test out our hypothesis. We came into this exercise thinking that there might be a strong relationship between a team’s hard-hit rate and its runs scored.

Now it’s time to take the information we’ve gathered and use it to build a model around our hypothesis.

What Is a Sports Betting Model?

At the most fundamental level, a model yields a prediction based on the data and variables we used to create it.

To build your model, you need a dependent variable – or “response variable.” In this case, it’s runs scored. The independent variables are the one we think explains the dependent variable. We’re going to start with a team’s hard-hit rate on the season, and then we’ll add in hard hit over the last 10 games and see if streaks matters.

When you’re building a model, you don’t normally want to throw all possible independent variables into a blender and see what you get. It can lead to some messy, inconclusive answers. There are already enough blind alleys in sports betting without making more for yourself.

If you’re just starting out in modeling, a better approach is to simplify. Use one or a small handful of variables, then analyze the results. You can always iterate after the fact.

Checking the Results

We’ll talk about analyzing your results in a later article, but there are a number of ways to do it. The most commonly used is R-squared, which measures how much of the dependent variable’s variance is explained by the model.

R-squared exists on a scale of 0 to 1, where an R-squared of 0 means there’s no positive correlation between your dependent and independent variables. If your model has an R-squared of 0.5, that means your model explains 50 percent of the variance of the thing you’re trying to predict.

Linear vs. Logistic

Once the dependent and independent variables are in place, the next step is to apply a regression.

There are two bread-and-butter regressions: linear and logistic regression. Linear and logistic regressions are usually the first tools most statisticians use when analyzing data.

(In fact, most machine learning algorithms are just fancy regressions. Don’t tell ChatGPT we said that. We don’t want to be on the hook after the singularity happens.)

These two methods are easy to use, and your results with linear and logistic regressions can help point you to which more advanced techniques or refinements are appropriate for your model.

Linear regressions are usually used for continuous outcome variables, and logistic regressions are better for binary outcomes. A continuous variable is something like “How many runs will be scored in this game?” A binary variable is a yes/no proposition, like “Will this game go over the total?”

If you think of your data on a plot where one axis is your independent variable (in our case, runs scored) and the other is your dependent variable (hard hit rate), a linear regression will find the most efficient line through all of those data points.

Here’s a very simplified representation of a linear regression. You have a line that shows the best fit between your variables. It shows us the relationship between the x-axis (the independent variable) and the y-axis (the dependent variable). In this case, as the value of the x-axis increases, we can expect the value of the y-axis to increase as well.

Planning Out the Variables

We’ll do both kinds of regressions by the end. We want to use the same variables in each regression. The data we’ve already gathered has the information we need on runs allowed and hard-hit rate. Those are our dependent and independent variables, respectively. First, we’ll run a regression for the home team, then the away.

Once that’s all in place, we’ll introduce a second variable: hard hit over a team’s last 10 games. We’ll use this to see if our hypothesis about hot/cold streaks has any merit.

Starting Your Analysis

If you’re working in Excel, you can use built-in linear regression functions to analyze your data. This is where we have to step in and cajole you into taking up R, though. It’s a much faster way to process large data sets. And we promise, even if you’ve never done any coding before, it’s not as painful as it looks. Pinky swear. Re-read the previous installments in this series where we explain all the code line by line.

If you want to stick with Excel, though, you can do all the work in a spreadsheet. To perform a linear regression, use the Data Analysis function under the Data tab and select “Regression.” Use our dependent variable (home or away runs allowed) as the Y axis and choose your dependent variable (home or away hard hit) as your independent variable.

You can create a summary and a plot that will allow you to analyze how well the variables explain what’s happening.

Here, we’re just using a truncated sample of our data with a few away scores and home hard-hit data points to generate an R-squared value.

We can also create a line fit plot to see the relationship between our variables.

It’s also possible to perform logistic regressions in Excel. But again, if you’re going to be pulling seasons worth of game data, it could end up being a bit of a beast.

OK. Cajoling over. If you choose to stay in Excel, at least keep reading to learn about how some of the pieces fit together. It will be helpful in

Extra Credit

If you’re doing the work in R, here’s the code:

Part 5 code

This requires loading in a .csv file that you already have if you’ve completed the previous modules. But if you can’t find it buried in a nest of folders, or you just want to cut to the chase, here’s the file for download. Just make sure you plant it in the work directory you set for the project using the command setwd(“C:\ExampleDirectory”).

Data After Article 4

We’ve also made these files available on Github, if you’re nerding out over there.

Let’s take a look inside all that code. We’re using the numbering from the standalone code, so if you’re using the full ride, start instead from line 716.

Line 17

Here we’re loading in all our data from previous installments into a data table, called “dat.” It’s a shorthand way of calling all the data in our .csv file every time we use “dat” in a different command.

Line 24

By “fitting” a model, we’re looking for the best way for the model to describe how the independent variable relates to the dependent variable. The “lm” function here is the “linear model” command. We’re telling the program to build a linear regression where home_score is our dependent variable. The tilde (“~” symbol) separates the dependent variable from the two independent variables we’re choosing from the data table: the home team’s hard-hit in previous games, and the away team’s runs allowed.

Line 27

You can use the “summary” command in the console to call up the model’s coefficients. We’ll get into detail about what they all mean in the next article, but broadly, this allows us to analyze how well our model explains the variance in our dependent variable.

Among other analyses, you’ll find the model’s R-squared numbers in this summary.

Line 30

Now we’re using our model in fit1 to predict what we think the home team will score. “dat$home_score_pred1” creates a column in our data frame where we’ll put our predicted score.

The “predict” function tells the model to interact with the independent variables from our data. Which, again, in “fit1” are home team hard-hit rate, and away team runs allowed.

Lines 33-41

We’re creating two plots here to help us analyze our data.

This looks at the model’s errors (or “residuals”). Errors are just the difference between the predicted results and the observed results. If we predicted four runs and the actual result was five, that’s an error. It’s where messy real life deviates from the model.

We want to know if our errors are normally distributed. The first method, in line 33, is a Q-Q plot. We want to know that most of our distribution falls along the line. Looks good here.

We’ll also plot out the density of the errors. Here, we can see that it’s possibly a normal distribution, but it’s definitely left-skewed.

In the next article, we’ll look at how to address this.

Lines 47-50

Now we can add in more variables. In this case, we want to build “fit2” off of “fit1” and simply add in the home team hard-hit over the last 10 games. This is how we test the hypothesis that a team full of guys who’ve been swinging the hot bat will put up more runs overall.

Lines 53-70

This should start to look familiar. The section here repeats some of our previous steps.

First, on Line 56 we’re adding predictions back into the data set after adding an third independent variable.

Then, Lines 60-70 repeat all of the steps we did to analyze the home score, but do it instead for the away score.

Lines 75-81

It’s all coming together. We’re using six independent variables – our three each for home and away – and building a linear regression for the full game total.

Once we have it all together, we run “predict” again to analyze our residuals. This is also the command you would use in season to get the daily outputs once you have a fully fleshed out and functioning model.

Lines 90-97

Now it’s time to apply a logistic regression to attempt to predict whether a game will go over or under the total.

The command “glm” creates a generalized linear model with our dataset, which is the preferred command for logistic regressions. By specifying “family=binomial” we’re asking for an either/or output as it relates to our dependent variable.

“Tot_over” is our dependent variable – the game total. For our independent variables, we’re testing home and away runs against, and home hard-hit over the last 10 games.

In the final line, we’re again using the “predict” function, this time to predict the probability a game will go over or under the total based on our independent variables.

Lines 101-102

Finally, in these last two lines we’re taking the probabilities generated by our predictions in fit4 and converting them to odds so we can compare our model’s price against the current prices at sportsbooks.

These functions called here to convert probabilities to odds (and vice-versa) were created in Article 4. If you’re running the code for all articles, that function is stored in R Studio. If you don’t have those functions loaded, paste in Lines 176-211 from the Article 4 code before these final lines and run it again.

Expand it Out

There was a lot going on here this time, but this is the fun stuff. Now we’ve got game data. We’ve got betting odds. We’ve got a hypothesis and the means to test it. And we’ve got game scores we’re actually predicting. What we’ve got here is a working model.

And if you have your own hypotheses, you hopefully are starting to think about ways you can take a concept like linear regression and apply it to different variables.

Maybe you want to add in slugging percentage or weighted runs created plus. Or possibly you want to go the other direction and analyze pitchers instead of lineups.

If you really want to get in the weeds, you could pull pitch-type data from a team and compare it to an opponent’s strengths against different pitches. The Red Sox staff isn’t throwing many fastballs this year. Do you think the Orioles are going to have a field day against sinker/slider types who eschew the heater? Now you have some techniques to build a model to test out your hypothesis.

Of course, is the model we’ve built here any good? That’s a different question altogether.

What’s next

Once we have a model in place, we need to figure out if we can trust it. We’ll talk a little bit about how to put a model through its paces and evaluate its performance.

One thing we’re going to want to look at is that our model as it currently exists shows a significant p-value for home team hard hit rate, but a much less significant one for away team hard hit rate. We’ll need to think through what that means. Is it logical based on our data, or is something else going on?

What came first

Have questions or want to talk to other bettors on your modeling journey? Come on over to the free Discord and hop into the discussion.

Learn From The Pros

START MY TRIAL!

Intro To Data Science Part 5: Linear And Logistic Regression

What Is a Sports Betting Model?

Checking the Results

Linear vs. Logistic

Planning Out the Variables

Starting Your Analysis

Extra Credit

Line 17

Line 24

Line 27

Line 30

Lines 33-41

Lines 47-50

Lines 53-70

Lines 75-81

Lines 90-97

Lines 101-102

Expand it Out

What’s next

What came first

Latest Articles

Contact Your Representative About the Big Beautiful Bill’s Gambling Tax Provision

Why I’m Finally Starting to Believe in Evolution

How To Bet The CFL: A Primer

Getting Precise About Closing Line Value

Five WNBA Betting Tips For NBA And College Bettors

The Sports Bettor’s Guide To Betting The Kentucky Derby

Intro to Data Science Extra: Build a Playoff Series Simulator

A Step-By-Step Guide to Sports Betting With Crypto

Latest Videos

Learn About The Props Simulator

Learn About The Partial Game Derivative Calculator

Learn about the CLV Calculator

Learn about the Derivatives Calculator

Learn about the Hold Calculator

Betting Tools

Betting Odds

Betting Calculators

Betting Education

Unabated