Intro to Data Science: Glossary

# Intro to Data Science: Glossary

Jason Scavone
Intro to Data Science
April 1, 2023 ### Binomial distribution

A type of distribution that measures the likelihood of two possible, but mutually exclusive, outcomes. In other words, a success or failure.

Binomial distributions are used when you have a set amount of trials and the probability of success is the same for all trials. It is running X trials and recording Y number of results.

If we were to flip a coin 100 times to see how often heads comes up, the result would be binomial distribution.

### Distribution

An analysis of all possible outcomes for an event, and how often those individual outcomes occur in the overall set of possibilities.

For example, there are 36 specific outcomes when you roll two dice together, but only one way to roll a two (by rolling a one on one die, and a one on another). One divided by 36 is 2.7 percent. There are two ways to roll a three (a two and a one, and a one and a two). That happens 5.6 percent of the time.

The full distribution of rolls looks like this:

### Monte Carlo

Simulations used to help predict the outcomes of events that are uncertain. These simulations use a range of values as inputs, rather than specific, fixed values.

By running the simulations over and over again, Monte Carlo simulations reveal a pattern of predicted outcomes, offering a distribution of possible outcomes. Unabated’s prop simulators, for example, use Monte Carlo simulations to arrive at the likelihood of potential outcomes based on a range created from prop projections.

### Multivariable regression

A type of regression that is used to establish the relationship of an outcome, or dependent variable, to one or more independent variables.

In other words, if you want to learn how a baseball team scores runs (the dependent variable), you might use slugging percentage and on-base percentage as your independent variables.

### Negative binomial

The chief difference between a binomial distribution and a negative binomial distribution is instead of recording how many times we get X result over Y trials, we record how many trials it takes to achieve X number of results.

A binomial distribution asks “Out of 100 coin flips, how many times will it come heads?” A negative binomial distribution asks “How many times will you need to flip a coin to get heads 50 times?”

### Package

In the programming language R, a package is an extension you can add to your R installation. There are several packages that can be used to analyze sports data. We go into more depth on installing packages in our Sports Betting Data Basics section.

### Poisson distribution

A type of distribution that measures the likelihood of a number of events happening over a set period of time, using an average of how often those events typically occur.

A Poisson distribution could be used, for example, to assess the likelihood of a pitcher who averages seven strikeouts a game to get specifically five, six, seven, or eight punchouts in a given matchup.

### Regression

A process to establish the relationship between dependent and independent variables.

The most commonly used is a linear regression. This shows whether independent variables are a good predictor of dependent variables. It can also be used to analyze which variables are the best predictor of dependent outcomes. Linear regressions find the most efficient “line” through data points that reveal underlying correlations. Linear regressions can be simple, exploring the relationships between one independent variable and their dependents. Or they can be multivariate, using multiple independent variables.

For example, if we think weighted on-base average will have a direct impact on runs scored in a baseball game, or average depth of target will mean more points in a football game, we could analyze this using a linear regression.

### More data science

For the rest of our Intro to Data Science series, here’s where you can find more: