C
ricket is one of the many sports that require a sphere ball and a bat to play, with a set of rules, which makes this game unique and different from others. Cricket has evolved over the years starting from test matches followed by one-day matches, and from past few years, T20 cricket has taken a lot of attention. But till date, the ICC Cricket World Cup has been the most prestigious tournament of the all, which is a form of limited 50 overs match.
The ICC Cricket World Cup is an international sporting event that is held approximately every four years since its inception in 1975, with preliminary qualification rounds leading the teams up to the finals. Studies have been done in cricket in terms of physiological, psychological or the physical demands of batsmen, wicket keepers, spinners and pace-bowlers in different formats of play, however recently a few of the studies have focused on the performance analysis of individual players or a whole team by calculating the effect size. But to the best of our knowledge, none of the studies have focused on developing a predictive model to predict the outcome of the match based on the team as well as individual player historic performance.
We believe this predictive analysis strategy would be very useful for viewers, sponsors, and team strategists. This would also give insights to various cricket analysts and commentators about the features that play a crucial role in statistical analysis.
A model could have been built that could have predicted the outcome of every match of the 2019 World Cup to predict the winner of the tournament. In the above context, we feel, if we closely study the historical performance of the players in the one-day international matches, we should be able to associate a performance score for each player. We will decide on a methodology to derive the team performance score from the individual player’s scores. The performance scores of the team will further decide the chances of a team to be the winner of the match. Through this article, we hope to identify those key parameters at the player level that have a significant impact on the team’s outcome for a match.
Data Source
The data for this predictive analysis could be obtained from Wikipedia and ESPN Cricinfo Websites. ‘Statsguru’ service provided by ESPN Cricinfo could also be leveraged to extract individual player statistics.
Variables that would be required the most and will play a key role in getting batting and bowling are as below.
BattingBowlingPlayer* – Name of the playerPlayer* – Name of the playerMats – No of matches playedMats – No of matches playedInns – No of innings playedInns – No of innings playedNO – No of not outsOvers – No of overs bowledRuns – No of runs scoredMdns – No of maidensHS – Highest scoreRuns – No of runs givenAve – Average of the playerWkts – No of wickets takenBF – No of balls facedBBI – Best bowling figure in an inningSR – Strike rate Ave – Bowling average (Runs/Wickets)100s – No of 100s scoredEcon – Bowling Economy50s – No of 50s scoredSR – No of runs hit per over0 – No of duck outs5W – No of 5 wickets taken4s – No of fours hit
6s – No of sixes hit
Toss/Result and ground details collected will also help in analysis as it would be the home ground and team batting first will always get more advantage to win compared to opposition.
Methodology
A team is a combination of batsmen and bowlers. There are selected 15 players in each team squad but only 11 are in playing. So, we need to model for 22 players per match to predict the winner. Sometimes, the playing 11 may change due to match tactics, injuries, venue, etc, so in this case, we can’t just consider a set of 11 players, they need to be revised as per the schedule and then the prediction should be made taking into account each and every individual playing. Apart from that outcome of toss, ground plays also plays a major role for a team to win or lose the match. Here as mentioned above supervised learning has been implemented as per below model diagram.
Feature Construction
The choice of right features plays a key role in the success of a prediction model. For the problem at hand, which is predicting the winner of the ODI cricket world cup, we choose two other important features along with the relative strength of one team against the other. The first one is the venue of the match, and the second is the outcome of the toss. The venue of the match is important because of the ‘home team advantage,’ which basically means that the team playing at their home grounds has an advantage over the visiting team.
This advantage is directly attributed to the psychological support that the home team gets from the audience in the ground, to the familiarity of the ground, environment, etc. The second feature is the outcome of the toss, which has been observed and believed to have a major role in deciding the outcome of a match. The toss is directly associated with the nature of the pitch and the environment. For instance, a green pitch supports the pace bowlers, so winning the toss and opting to bowl first could give the team an upper hand over the opponent team. Similarly, in humid conditions it becomes difficult for the bowlers to control the wet ball, so batting first is an optimal decision in that case.
Therefore, every match played between team A and team B in our dataset has three features: toss, venue, and strengthA/B. StrengthA/B and venue have numeric features, whereas toss has a binary feature. The value of the toss is 1 if team A has won the toss, or 0 otherwise. The value of venue is 1 if the match is being played at a home ground of team A, and 0, if it has been played at a home ground of Team B, and 2 otherwise. The value of StrengthA/B is the relative strength of team A against team B which is calculated as
The target variable defines the winner of a match, which is a binary variable. The value of the winner is 1 if the winner of the match is team A, and a 0 if the winner is team B. Notice that out of the two competing teams, any one of them could be considered as team A and all the feature values and the target value would update accordingly.
Modelling
Some of the Machine Learning models can be implemented mainly using R library. Naive Bayes and SVM module could be used from E1071 package, Decision Tree module from r-part package, random forest from randomForest package, logistic regression using glm, XGBoost from xgboost package and k-NN was used from class package. Model performance can also be measured with the help of confusion matrix.
Above methodology could even be used for other sports.
Comments