Construction of a Predictive Model for MLB Matches

: The main purpose of this article was to deﬁne a model that could defeat the online bookmakers’ odds, where the betting item considered was the ﬁrst ﬁve innings of major league baseball (MLB) matches. The betting odds of online bookmakers have two purposes: ﬁrst, they are used to quantify the amount of proﬁt made by the bettors; second, they are regarded as a market equilibrium point between multiple bookmakers and bettors. If the bettors have a more accurate prediction model than the system used to produce betting odds, it will create a positive expected return for the bettors. In this article, we used the Markov process method and the runner advancement model to estimate the expected runs in an MLB match for the teams based on the batting lineup and


Background
Predicting patterns of behavior plays a pivotal role in many walks of life including the COVID-19 time series, animal movement, stock market system and the sports industry. However, a prediction's quality is closely related to the forecasting fields involved. The inspiration for the main character of the movie "Moneyball", Jeffrey Ma [1], mentioned in his book The House Advantage: Playing the Odds to Win Big in Business that it is much simpler to predict the unlisted cards in the hands of dealers than it is to predict the future return of the stock market because there are only 52 cards in a deck. The target of prediction discussed in this article is the probability of winning in baseball. The difficulty of predicting the probability of winning in baseball lies between that of predicting the unlisted cards and the stock market return. The probability of guessing the next card in the dealer's hand depends on the cards and the number of cards that have already appeared on the table, while the prediction of the stock market return in the future includes the fundamental analysis factors and technical analysis indicators, as well as the political and economic factors at home and abroad. Fans of baseball may have heard that the pitcher's impact on their games are greater than a hitter's impact on their games. It is certain that pitchers are not the only factor affecting the outcome of baseball matches.
As far as sports activities are concerned, being able to improve the outcome prediction accuracy of a match not only guides the trainer and sportsmen with the direction of their players' management, but also creates lucrative opportunities in terms of sports wagering. At present, more than 120 countries around the world have issued sports lottery tickets. In Nevada alone, the so-called "city of gambling" in the United States, sports lottery sales in 2017 amounted to as much as 4.9 billion USD, a rise of 440% compared with 1984. In 2017, baseball-related bets accounted for 23% of all sports betting, second only to football [2]. In order to make a profit from sports wagering, many individuals and institutions have devoted themselves to the study of match predictions. However, it is difficult to improve the accuracy of predictions. Among the many sources of information that can be used to predict match outcomes, the betting odds offered by bookmakers are one of the easiest sources from which the public can obtain professional predictions. The public can now easily obtain betting odds from online bookmakers. In order to make an ideal profit, bookmakers need to improve the outcome prediction accuracy of the game; therefore, they have developed several models that calculate the possibility of predicting the possible results in various sports competitions [3,4]. At present, there are three major betting odds systems: European, British and American odds. Most online bookmakers offer the function of switching the display of different betting odds in order to satisfy the betters' preferences. The betting odds used in this article are European odds, also known as decimal odds, which are calculated as the reciprocal of the probability of an outcome, and can be explained with the mathematical equation as follows: Assuming that the probability of outcome i is π i , the European odds can be expressed as 1/π i .
There are many sources of betting odds. The fixed odds of online bookmakers alone have many sources to choose from, such as Bet365 [5] and Betclic, among others. Each bookmaker announces different betting odds and when the odds vary widely, it implies that violations of market efficiency exist. In fact, if everyone was to now bet on just one of those outcomes then the money coming in would be skewed one way for the bookmaker. In response to this, the bookmaker will increase their margin on the popular line to discourage betting, and will reduce their margin on the less popular line to encourage betting. At this point, such hedging action has changed the implied probability of the bookmaker. These can be factors that change the implied probability [6]. However, the announced betting odds can inform a betting decision. Utilizing the situation where bookmakers are forced to increase or decrease their betting odds, Kaunitz et al. [7] proposed a strategy to beat football bookmakers with their own numbers. Instead of building a forecasting model to compete with bookmakers' predictions, they exploited the probability information implicit in the odds publicly available in the marketplace to find bets with mispriced odds. Shin [6,8,9] proposed Shin probabilities based on the assumption that bookmakers quote odds that maximize their expected profit in the presence of uninformed bettors and a known proportion of "insider" traders.

Literature Review
The motivation of this article comes from the independent development of the implied probability with even higher accuracy. In the past, there have been a lot of studies on how to beat bookmakers' betting odds. Some have built statistical models using Power scores, Elo ratings, Maher-Poisson approaches, Bayesian network, and pi-ratings [10][11][12][13][14]. The method we propose here is to use a Markov chain [15], based on the D'Esopo and Lefkowitz runner advancement model (RAM) [16], to calculate the expected number of runs scored (ENRS) and the probability of leading in the first five innings in a baseball game. An inning is the basic unit of play in baseball and a full game is typically scheduled for nine innings. We deal with a stochastic process which is characterized by the rule that only the current state of the process can influence the choice of the next state. It means the process has no memory of its previous states. Such a process is called a Markov process, after the prominent Russian mathematician Andrey Markov . In baseball models, the states are usually the various runners and outs situations. The Markov chain assumption means that we are not interested in how we arrived at a particular situation. The relevant literature on these theories over recent years includes Hirotsu [17] and Hirotsu and Bickel [18], and the latter found the best order to achieve more than 0.5 in probability of winning the game against other possible batting orders. Hirotsu and Wright [19] proposed the formulation for obtaining the optimal pinch-hitting strategy under the designated hitter rule. Fritz and Bukiet [20] extended the Markov chain model to introduce an objective criterion for selecting the major league baseball (MLB) most valuable player (MVP) award winners and Cy Young award winners. Smith [21] presented a Markov chain model for predicting the scores and the winning team of major league baseball (MLB) games. Chang [22] modified the runner advancement model and presented the significant factors that impact the ENRS. Chang [23,24] measured the number of contributed runs per game for each player. In this paper, we predicted the "runs of the first five innings" of an MLB match using the Markov process method and the runner advancement model before the game. We used historical data of the batting lineup vs. the pitcher from the MLB official website to estimate the expected number of runs for the teams. We tested the efficiency of MLB betting markets by examining the ability of the following two kinds of probability: 1.
The Bet365 [5] online bookmaker announces its first five innings betting odds at 10 p.m. China Standard Time. The odds are pre-game odds obtained before the start of the match. This information is relevant, since a starting pitcher in MLB usually rests four or five days after pitching a game before pitching another. Therefore, most MLB teams have five or six starting pitchers on their rosters. These pitchers, and the sequence in which they pitch, is known as the rotation. For the most part the starting lineups will stay the same after posted. The choice of the bookmaker probably does not have a high effect on the homogeneity of prices due to the transparency of odds among online bookmakers and high competition in the market. The betting odds already include the profits of the bookmaker. Therefore, the inverse odds cannot be regarded as the implied probability directly. In the next section, we will introduce how to adjust bookmakers' profits using basic normalization (BN) based on the literature.

2.
We calculated the ENRS (based on batter vs. pitcher career statistics) and the probability of leading in the first five innings, which is the new implied probability (NIP) proposed in this article.
In a betting market, forecasting models are judged in terms of their accuracy and profitability. Wunderlich and Memmert [25] presented the counterintuitive relationship between accuracy and profitability in probabilistic forecasts in relation to betting markets. They said that betting accuracy should not be treated as a valid measure of a forecasting model. This article will also evaluate these two kinds of probability from the two perspectives of accuracy and profitability: first, the ranked probability score (RPS) [26] is a measure of how similar two probability distributions are and is used as a way to evaluate the quality of a probabilistic prediction; second, the expected value (EV) is the measure of what a bettor can expect to win or lose per bet placed on the same odds. The structure of the remainder of this article is as follows: In Section 2, the calculation methods of BN and NIP are introduced; the data sources of betting odds and competition scoring are introduced in Section 3, together with an explanation of how to use RPS and EV to evaluate the three kinds of probability; in Section 4, the results of three kinds of probability are presented; and suggestions and a discussion is provided in Section 5.

Basic Normalization
Assume that o = (o 1 , o 2, · · · , o n ) is the betting odds for a game match with n outcomes, where n ≥ 2, and the inverse odds is set to π = (π 1 , π 2, · · · , π n ), therefore π i = 1/o i could be regarded as the occurrence intensity of an outcome but cannot be regarded as the implied probability directly as the bookmaker has set the total sum of π i to be greater than 1, which means Π = n ∑ i=1 π i > 1; we can regard Π − 1 as the bookmaker's profit. In order to normalize the π i , we divide π i by Π to obtain the p i = π i /Π. We call p i the implied probability subject to basic normalization.

New Implied Probability (NIP)
Unlike BN, the NIP is not derived from the bookmaker's betting odds, but uses Markov chain and RAM to calculate the ENRS in the first five innings. To calculate the NIP we first define a matrix Un containing the situations of scoring from half an inning, being on base, and outs of the current match, where the columns of Un represent the current scoring (there are 21 columns (0, 1, . . . , 20) in total); the reason that the last line was set for 20 points was that the Boston Red Sox and Detroit Tigers once scored 19 points in 1953 among MLB's highest scoring records in a single inning. The rows of Un represent the current number of outs and base states. According to the framework of transition matrix required by the Markov process, 25 states can be summarized in Table 1 below [16] when considering the outs and base states faced by the offensive matchup, of which eight different base states are 0 out, 1 out, and 2 outs, that is, the states of runner on first, runner on second, runner on third, runners on first and second, runners on second and third, runners on first and third, runners on first, second and third, and finally the state of 3 outs. n means the nth batter is entering the game. As Bukiet's [16] transition matrix T 25×25 is the basis of the Markov chain, we simplify the scoring effect generated by T 25×25 into the five categories of T0, T1, T2, T3, and T4. For example, T (n) 0 represents the probability matrix of scoring 0 points for the hitting outcome generated by the nth batter in the inning. Likewise, T (n) 4 represents the probability matrix of scoring 4 points for the hitting outcome generated by the nth batter in the inning. Therefore, Un provides information on the scoring outcome and the probability generated by the nth batter during the inning. As a result, before the start of the game Un [1,1] = 1, and the other matrix elements are all 0, which means that the occurrence probability of scoring 0 points, 0 runners on base, and 0 outs is 1. For readers to understand the Un matrix more easily, the following lists the U1 21×25 matrix after a first batter completed their hitting in the first inning, where the vector probabilities of the first column (scoring 0 point) and the second column (scoring 1 point) are as follows: and The other vectors are all zero vectors. RAM simplifies the runner advancement and scoring outcomes into six cases, as shown in Table 2, with which they can be utilized to estimate the effect of runner advancement. Therefore, the probabilities of the compositional element of the U1 matrix for the first batter in the first inning are expressed as follows: U1 [1,4] U1 [1,9] = p(Out) U1 [2,1] = p(HR) The expected scoring of the first batter in the first inning is: 0 × the sum of the probabilities of the 1st column +1 × the sum of the probabilities of the 2nd column + . . . + 20 × the sum of the probabilities of the 21st column = 1 × p(HR),

Hitting Conditions Outcome
Base Balled (BB) Batter safely reaches first base and does not advance unless there is a bases loaded.
One Base Hit (1B) Batter safely reaches first base; first base runner safely reaches second base, and the rest runners score points.
Two Base Hit (2B) Batter safely reaches second base; first base runner safely reaches third base, and the rest runners score points. Three Base Hit (3B) Batter safely reaches third base, and the rest runners score points. Home Run (HR) Batter scores a point, and the rest runners score points.
Therefore, the expected scores contributed by all the batters within a half-inning are as follows: 0 × U0T (1) T (2) · · · T (9) T (1) T (2) · · · s sum of the probabilities of the 1st column + 1 × U0T (1) T (2) · · · T (9) T (1) T (2) · · · s sum of the probabilities of the 2nd column + . . . 20 × U0T (1) T (2) · · · T (9) T (1) T (2) · · · s sum of the probabilities of the 21st column, Following the rule until Un [,25] 's sum of probabilities is greater than 0.999, that is, given the 21 states of scoring, if the marginal probability of the three outs is greater than 0.999, the half inning is immediately discontinued. Following this to precede the match for the first five innings, the expected scores at the end of the first five innings can be subsequently obtained.
Next, we explain how to use the Un [,25] at the end of the first five innings, namely, the scoring distribution of each matchup to calculate the leading, tied or behind probabilities of the game match. By adjusting a nine-inning winning expression proposed by Bukiet [16], we can obtain the leading probability as follows: While the behind probability is determined as follows: where f (x = i) represents the probability of matchup X scoring i points at the end of the first five innings, and f (y = j) represents the odds that matchup Y scoring j points at the end of the first five innings. In addition, Dolinar [27] proposed that f (x = i) and f (y = j) can be estimated using the negative binomial distribution (NBD), where the success event of the NBD is defined as the 15 outs and the failure event of NBD is defined as scoring i points. However, as the BN used in this article does not consider the tied probability, to compare on the same basis we used BN to define the NIP as follows:

Research Tools
The data source was the 70 MLB matches with most batter vs. pitcher matchup stats in 2018 selected from the MLB statistics starting form 15 September 2018 to 30 September 2018. Betting odds were taken from the one of the first five innings of MLB matches announced by the Bet365 [5] online bookmaker. The starting pitchers and batting order of a match were taken from the MLB website [28], and the batter vs. pitcher matchup stats were taken from [29]. Among the indicators considered in RAM, we used the following formulas to calculate the different probabilities: BB% = BB/(BB + 1B + 2B + 3B + HR + Out) (

Rank Probability Score (RPS)
Epstein [26] proposed that the RPS should be used to evaluate the difference between the prediction probability and the real outcome. Constantinou and Fenton [30] said that the RPS is an agreed scoring rule to determine a forecasting model's accuracy. In fact, RPS is a formula for calculating the linear distance between the prediction probability and the real outcome, so RPS is always greater than or equal to 0, and the closer RPS is to 0, the more accurate the prediction is. If we take the betting item discussed in this article as an example, we compare BN and NIP at the end of the first five innings with the real outcomes and evaluate their differences. If we consider the total accuracy of the betting on n matches, the calculation formulas are as follows: where 1 i,l represents match i's actual scoring leading indicator function.

Expected Value (EV)
The expected value (EV) is used to measure whether the betting odds for an outcome is a valuable betting opportunity. In short, if an implied probability multiplied by a net profit is greater than 1 after subtracting the implied probability multiplied by the bet cost, then the bet is considered valuable. It can also be said that the larger the value is, the greater the expected value on the betting odds under the implied probability. As far as BN is concerned, because the bookmaker has reduced the betting odds in order to create its own profits, the EV has to be negative, which means betting on the basis of BN is not an appropriate strategy. Therefore, regarding the comparison of EV, we present only the EV subject to the NIP. Thus, assuming that o i = (o i,l , o i,b ) is the betting odds for two outcomes (leading and behind) for match i, we obtain: where n represents the total matches for betting, and besides calculating different implied probabilities, the above two equations are consistent; 1 i,l represents the real leading indicator function of match i; and 1 i,b represents the real behind indicator function of match i.

Comparison of the Prediction Probability and the Real Outcomes
We compared the accuracy of the three prediction probabilities (Table A1). The leading probability in Table A1 refers to the team in the left column; for example, for the match listed in the first row, this refers to the match held on 15 September 2018, where the matchups were between New York Yankees (NYY) and Toronto Blue Jays (TOR). NYY's first five-inning leading probabilities subject to the prediction of BN, NIP and NIP-NBD were 0.71, 0.83, and 0.84 respectively. The values presented in the last row are the real match outcomes, where 1 indicates that the team in the left field (NYY) leads, 0 indicates that the team in the right field (TOR) leads, and "-" indicates a tie. Tables 3 and 4 present the RPS and EV results for a specific date (15 September 2018) and team (NYY), respectively. The average RPS over 70 matches was RPS BN = 0.24, RPP N IP = 0.17, RPS N IP_NBD = 0.15, and a Wilcoxon signed-rank test revealed that these differences were statistically significant (p = 0.02).

Comparison of the Expected Value
Without considering the tied situations (a total of 60 matches), we compared the prediction accuracy of NIP and NIP-negative binomial distribution (NIP-NBD) according to the information provided in Table A1, which was N IP = 45/60 = 75%, and N IP_NBD = 46/60 = 76.7%. The NIP-NBD exhibited higher consistency with the real outcomes. In terms of the expected value, the betting odds (o 1 , o 2 ) for matches in the second row were 1.6 and 2.35, respectively, while for the five-inning outcome the party with betting odds of 2.35 leads, and the expected values calculated according to NIP and NIP-NBD were −0.44 and −0.48, respectively. The expected values using the predictions for 60 matches were EV N IP = 10.15, and EV N IP_NBD = 11.51, which means if one unit bet was cast on each match, the 60-match expected values subject to NIP and NIP-NBD were 10.15, and 11.51, respectively.

Discussion and Application
In this article, we introduced the most accessible information of betting odds for the general public, which are the betting odds announced by bookmakers. Considering the profits of the bookmakers, we use BN to restore the betting odds information. Markov chain and RAM are important theoretical foundations for predicting baseball scoring. In this article, we used these models to predict MLB match scoring and to calculate the implied probability NIP. By evaluating the probability values, including prediction accuracy and the evaluation of expected values, we proved that NIP has its advantages in terms of the number of matches (n = 70) considered, whether in terms of RPS or EV. In fact, during the theoretical analysis for the 70 matches in this article, if we restore the very moment of betting, where the outcomes were unknown, the total return for the 70 matches according to the prediction probability models of NIP, and NIP-NBD are 23.89 and 22.69, respectively, converted to return on investment (ROI) as 34.13% and 32.41%.
By its very nature the sport of baseball is highly suited to adaption as a Markov chain as by its very nature it is split into discrete standalone plays [31][32][33]. Through this paper we have been able to derive a first five-inning scores for a match. The transition probabilities, which derived on the hitting condition are used for our modeling of baseball as a Markov process. The input factors to our model is the roster of the home team, the roster of the away team, statistics of all players on each team leading up to a game. Baseball is unique in how much data is available online. Statistics on every plate appearance in MLB since 1921 are available for free. Bookmakers can use this abundance of data to improve odds.
NIPs were not derived from the bookmaker's betting odds but used batter/pitcher matchup history stats between any pair of players. Fans of baseball games may have heard that the pitcher's impact on a game is greater than a hitter's impact. A team manager usually wants to get 5-6 innings from his starter pitchers. Sometimes, despite pitchers having good arms, good quality pitches and high throwing velocity, they do not have the stamina for those 5-6 innings. Moreover, a starting pitcher must pitch at least five innings to qualify for the win. In summary, a first five innings bet allows us to focus on a much smaller range of factors when searching for value in wagers.
As the first five-inning betting market is small, there are not many historical data sources that can be obtained directly. Therefore, the 70-match betting odds and the historical batter vs. pitcher stats and starting batting order referred to in this article are difficult for general researchers to obtain directly through computerization. Therefore, it is a limitation of this article that in practical applications, the relevant information, such as batting order and betting odds, need to be input before betting, so it takes a lot of time to obtain the NIP. In addition, compared with similar articles, the number of samples in this study was low. In future studies, we would increase the number of matches considered. Data Availability Statement: Data available in a publicly accessible repository that does not issue DOIs Publicly available datasets were analyzed in this study. This data can be found here: www.mlb.com.

Conflicts of Interest:
The author declare no conflict of interest.