Introduction

What is Call of Duty/Warzone?

Call of Duty is a very popular video game series published by Activision. Recently, its free-to-play game Warzone has come into great popularity, specifically with the rise of Battle Royale type games. With this rise has also come a very large community that is very competitive. These players started to realize that the game's lack of a ranking system did not match up with their own matchmaking experience. Thus, players have started to wonder if there is a hidden skill-based matchmaking system present in the game.

Why would someone care about their matchmaking?

In general, all gamers care about their gaming experience. It's obviously not fun to consistently lose, but its also not fun to consistently win. Finding the balance, is very important to staying interested in a game for a long time. So, both from a consumer and developer standpoint, matchmaking is integral to keeping a video game relevant. While this is the case, not being able to see your performance in relation to matchmaking is removing a significant part of the experience from players. In other very popular video games such as League of Legends, Apex Legends, Valorant, and even FIFA all show their players their ranking and progression through the ranked tier system. In Warzone, such a system does not exist, and this leads players to question the skill levels of their opponents and themselves.

In addition, there is also the possibility that Activision has a financial incentive to modify matchmaking, especially at the content creator level. The reasoning behind this is that content creators hold great influence over potential customers, and giving them a good experience might lead to more customers.

What exactly is skill based matchmaking?

Skill based matchmaking is a system that matches players in a game based on some ranking. This ranking can be whatever metric the developers choose, but is often implemented as an ELO score or a custom MMR (matchmaking rating) score fitting the game's qualities.

What is the goal of our project?

We will be exploring two main questions for this project.

The first question will be "Is there skill-based matchmaking in Call of Duty: Warzone?". This question is a very common one amongst the COD user base and being members of this user base, we wanted to find out an answer.

The second question is arguably juicier because it puts Activision in the hot seat. We will be trying to answer "Does Activision purposefully lower the matchmaking difficulty of content creators?".

Due to the fact that the game has a very passionate community, learning the answers to these questions can be very insightful for the input fans want to give to developers. For the content creators, players who are also passionate viewers of COD on Twitch or YouTube might rethink their opinions of whoever they watch.

Data Collection

Due to the specific nature of this project, we had to find creative ways to collect data regarding Call of Duty matchmaking information. Luckily, there is a Call of Duty API that enables developers to look at past match data and stats for specific players. However, Call of Duty's official API enforces a setting where accounts must set their visibility to public for their profiles to be viewable. However, some third party APIs aggregate data across games and paint a clearer picture of players' statistics. One of these is WZStats.gg (Warzone Stats) and they show detailed per match data. By using their website and its API, we are able to get the data of many players that will help to inform our research question.

Because it is a manual process to set your profile to public visibility, it is likely that better-skilled players are going to be the ones with visible profiles. This potentially limits our visibility into the player skill spectrum. If there is skill-based matchmaking, the initial accounts, and their respective game history, that we are analyzing, will be biased towards higher tier skill levels because these players care a lot more about their stats than lower-skilled players and would be more likely to set their profiles to public. However, if there is no skill-based matchmaking, then the game lobbies will be entirely random as far as skill is concerned (there could be other factors such as network latency and global location). With random lobbies, we will hypothetically be able to tap into the entire spectrum of players if we analyze enough games.

As said, there is no existing database, so we needed to write code that could help create one. In order to do so, we first started by assembling a list of profiles that had public visibility. This included some of our own accounts and also those of pro players and content creators. As mentioned previously, looking at content creators' accounts could provide some insight into our second question, whether the matchmaking skill level of content creators' lobbies were lower.

The data collection process ended up being quite complicated for us. In fact, we spent 6 hours on this and had to try it about seven times. Ooops. So what went wrong? Our initial collection process was essentially built on the perspective of this problem through graph theory. Specifically, we wanted to preform a breath first traversal of a lot of accounts, to attempt to sample the player base as effective as possible. The plan was to start with the 10 seed accounts, as described earlier, and then treat each player as a new node, eliminating those who have been visited already, and continnue to analyze each persons' previous 20 matches. This analysis, more so data gathering, include capturing, all (up to) 150 players per lobby and the lifetime kds for each player in the lobby. Where was our logic faulty? Breadth first search only works under the assumption that there is no skill based match making. Specifically, if there is no skill based match making, then the breadth first search would allow us to branch away form the current lobby and to various different skill levels quite quickly, with few degrees of separation, if any at all. However, if there is skill based matching making, then we would be stuck in the same spectrum of the skill distribution and we wouled be unable to reach the rest of the player base unless our initial seed accounts were perfectly distributed across the spectrum of plauers, which they are not.

So, we move to method two. Method two entails a pivot away from the breadth first search and moves towards a depth first search, or rather a graph traversal with a split factor of 2. More specficially, in order to achieve a better sampling, starting at one player, we randomly sample 2 players from their most recent match lobby. If we can successfully sample 2 accounts with public data settings, then we add them to our queue, and then repeat the process that we did on the initial player. We aim to do this repeatedly to achieve 11 degrees of separation from the original account. We arbitirarily selected a professional content creator's account, NICKMERCS, as the initial account and let the program run overnight, sampling 2^11. The following diagram helps show the method in which we designed our system to collect data.

alt text

The following code implements "wzstats.py" which is the script we wrote that implements scraping data from wzstats.gg, the data aggregating site mentioned previously in this writeup. The following code is contained within wzstats.py:

The following is a snippet of the dataset that we built.

For our second question, we are going to want to look at the pro and content creator players games specifically. The following code will iterate through the top accounts and get their games. Lucky for us, we wrote a function above called getTopPlayers() that allows us to interact with the Warzone Stats API to get the top rated players in the Warzone system. Lets get their data and write it to a CSV for future use.

Missing Data

From looking at the head of our dataset, we can immediately see that there are lots of NaN values. This might be alarming at first, but there is actually a very good reason for this. As we mentioned previously, not all accounts have public data available. Specifically, a player must go into their account settings to toggle this for every console linked to their Activision/Call Of Duty account. This means that people who care about their stats will probably go through the trouble of toggling this setting so that websites like Warzone Stats can display their history for them in an aggregated manner.

This then brings up the question of what type of missing data this is? Our initial hunch is that this data is Missing at Random, specifically that the missing username and platform is related to a player's lifetime_kd. We believe that this is a very strong potential reasoning because players who care about their statistics are probably players who also play the game a lot and thus might have a higher lifetime_kd. Lets check this theory out with some code!

We currently have two distinct datasets, those entries who have missing data and are private, and those entries who are public and do not have missing data. Since we created this database, we know clearly that the missing data is a result of whether or not someomne has toggled their data privacy settings to public or not within the Call of Duty account settings. However, we believe that there is a deeper correlation. We believe that there could be a relationship between a player's skill and whether or not they have public data. Specifically, from a practicality stand point, it would make sense that players who play the game a lot care about their stats being publicly available. Playing the game a lot and being passionate for the game does not mean that someone is good at the game, but there is a an argument to be made that skill scales with time played on the game, at least to an extent (some people might forever be bad at the game unfortunately).

So how do we decide if two datasets are different. We will use a T-test since we will effectively be comparing the mean KDs of players who are public with that of players who are private.

What are the assumptions made by a T-test?

  1. Data sets are independent
  2. Data sets are (approximately) normally distributed.
  3. Data sets have a similar amount of variance within each group being compared (a.k.a. homogeneity of variance)

In regards to our data sets, we know that they are independent sets because a player cannot be both public and private. We also will assume that the KDs are normally distributed, approximately. This assumption is definitely more of a stretch as the KDs are somewhat skewed left, but because there is no distinct tail and a heavy concentration of KDs within <3 standard deviations from the mean, which allows us to be more confident in this assumption. We will write a little bit of code to test this assumption below. The third assumption refers to the amount of variance in each data set which we can also check below with some simple code.

We can see that the variances here are actually near exactly proportional to the means of each data set and are quite similar. This supports the third assumption, especially with very similar standard deviations as well. Next we can check if the Empirical Rule (that 95% of data lies within 2 standard deviations of the mean) holds.

From the outputted print statements, we can see that the Empirical Rule holds, and in fact if you test the other components of the rule, you can see that over 70% of our data is within 1 standard deviation of the mean and that just over 99% of the data is within 3 standard deviations of the mean.

With these three assumptions clarified, lets move onto establishing some stats before moving onto the T-test.

For additional resources regarding T-tests and other concepts for this section, see the following links:
https://www.investopedia.com/terms/e/empirical-rule.asp
https://www.investopedia.com/terms/t/t-test.asp
https://www.statisticshowto.com/probability-and-statistics/t-test/
https://www.scribbr.com/statistics/t-test/
https://www.itl.nist.gov/div898/handbook/eda/section3/eda353.htm

Here comes the T-test now! The final step is to decide which type of T-test we want. The options are paired, two-sample, and one-sample. We will be going with two-sample test since our sets are independent. Since we only care to show that the two sets are different from each other, specifically that their means are different enough, we will be doing a two-tailed T-test. See the scribbr link above for more clear explanations of when to use each type of test.

From analyzing the average KDs of users with missing data and users without missing data, we can see that the averages are very different. But just how significant is the difference? We preformed a two-sided T-test on our two sets of data. We found the T score for the difference between the means of these two subsets was -135.158 and 53232.47 degrees of freedom. This corresponded to a p-value insignificantly different from zero. This is extremely strong evidence that there is a difference between the average KDs of players who have public and private data settings. In conclusion, the missing data is Missing at Random (MAR).

Exploratory Data Analysis (EDA) and Data Visualization

A basic level histogram doesnt actually show us much. We can see that most KDs are centered between 0 and 5, but there are definitely some significantly higher outliers. In practicality, this could mean that in the 2048 games we analyzed, we came upon players who are either insanely good, better than any professional ever, or we came upon players that are hacking. Having an incredibly high lifetime value is very difficult because of the randomness of games and even some of the best players will still have KDs around the 6-10 mark.

When we take a look at players whose KDs are over 6 we can still see a very high concentration between the 6 and 10 kd range. However, we then also see a second cluster of players at the 20+ KD mark. Because of the very low frequency of these players, we can definitely assume that these players are insignificant outliers. They actually end up changing a lobby's average KD by up to 35/150 = 0.23 which is a large amount of skew, but because we only see 15 of them across 2048 games, the overal difference is negligible.

We can see that there are 15 users with an over 10 lifetime kd. Of these 15 users, all 15 are actually private accounts. This is quite interesting because, we previously proved that players with higher KDs tend to turn their data settings to public. These players seem to be the best 15 by far, and yet their privacy settings are still set to private. While it is entirely possible that some extremely good players do not care enough to change the setting, it is unlikely that ALL 15 follow the same logic.

Moreover, in practicality, having a KD that high as a LIFETIME KD and not a SINGLE GAME KD is extremely unlikely. This would require players to drop 10+, 20+, or even 30+ kills per game consistently while limiting their deaths to 1 or 2. Note that for KD calculations, for the sake of avoiding divide by zero errors, COD counts 0 deaths as 1 death (i.e. 35 kills 0 deaths = 35 kills 1 death). Because even professional players and content creators are unable to achieve this level of success in their skill, we can reasonably assume that these 15 players are one of two things. They are either a brand new account with maybe 1 or 2 insanely good games, hence their lifetime kd and single game kds might be very similar, OR they are hackers who tend to get lots of kills over large amounts of games using hacks like aimbot and other cheats. Also note that I said brand new account and not brand new player. The reason for this is players can have numerous accounts, and it is possible that a professional player, content creator, or anyone for that matter, created a new account and had a very very good first game (or multiple), but the odds of the this happening are actually very low for another reason. In Call of Duty: Warzone, one way to improve your chances of winning is to level up your guns and achieve new attachments and other perks (https://www.dexerto.com/call-of-duty/best-warzone-loadouts-class-setup-1342383/). This can only be done by playing the game for an extensive amount of time, usually requirings 10s if not 100s of games to complete all necessary achievements to level up your equipment and profile. Inherently, this means that a very good player on a brand new account, still faces this challenge and is severely disadvantaged when entering into a game for the first time. Thus, from a practicality standpoint, it is more likely that these users are hackers or bots and not real, legitimate players.

Now that we have talked about the outliers, let's look back towards the more realistic end of the player spectrum.

We are kind of curious about the frequency of certain KDs. We wonder if there are very frequent and also very infrequent KDs. We can see from the histogram that they are generally around 1, but we wonder if we can get the image when looking a bit more specific.

We can see that there are 2 major outliers here where ~250 people have the same KD, which is very unlikely to happen, especially at this scale. Lets take a look at what their KDs are.

We ended up looking for all KDs who had over 100 people with that KD. We end up seeing that very surprisingly, the KDs fall in line with {0, 1/3, 1/2, 2/3, 1} are the most common KDs, by far. In practicality, this is probably a sign of players who are playing their first games, or very limited games as these KDs are very common at a small scale number of games. Another possibility is some potential rounding on the API side of things for players with limited data. Interestingly, if you expand the range to query those with over 75, the KDs that appear are also still very nice numbers/decimals.

Lets also do a little bit of analysis specifically on the lobbys of high skill playered games

Hypothesis Testing and Evaluation of Null Model

If we assume that COD Warzone does not matchmake lobbies on the basis of skill, or KD, we should expect that lobbies are a random sampling of individuals from the distribution of KD. We will call this the null model. We will analyze the correspondence of actual observations to this theoretical model to determine the fitness of this null model.

To read more about convulsions and the statistical theory behind them, see the following links:
https://www.statlect.com/glossary/convolutions
https://www.youtube.com/watch?v=P3ZcJEy84ps

Given our null model, if the ~2000 actual observances of lobby average lifetime KDs fall under this model, they should be uniformly distributed over the percentiles (CDF at value observed) given by the model. To test this, we calculate the percentile of each observation by computing the theoretical distribution of each lobby given its size and then using this to find the CDF of that lobby size at the actual lobby average KD observed.

Again, if the null model is correct, and lobbies are a simple random sample of the population of players, we should see uniformly distributed percentiles according to this model over the sample of lobby average KDs. The idea behind this method is that if F(x) is the cdf of random variable X, then F(X) is a uniformly distributed random variable over [0,1]. This means that for a sample X1, ... , Xn, F(X1), ... , F(Xn) should be a sample of the uniform distribution over [0,1].
https://math.stackexchange.com/questions/868400/showing-that-y-has-a-uniform-distribution-if-y-fx-where-f-is-the-cdf-of-contin#:~:text=37-,Showing%20that%20Y%20has%20a%20uniform%20distribution%20if%20Y%3DF,the%20cdf%20of%20continuous%20X&text=Let%20X%20be%20a%20random,1%20is%20well%2Dde%EF%AC%81ned).&text=Show%20that%20Y%20has%20a,interval%20%5B0%2C1%5D.

These percentiles are not uniformly distributed so it is extremely unlikely that lobbies are generated in a way that randomly samples the population of players, i.e. ignores skill.

So, from the above statistical analysis, we can see that in regards to question 1, there is definitely some influencing factor to the creation of game lobbies. This does not mean that there is skill based match making, but rather suggests that since lobbies are not random samples of the player base, that there is something else going on here, which could potentially be skill based match making.

Now, let's do some EDA regarding question two.

Lets refresh our memories as to how the Top Player's Df is formatted and how the General Population Df is formatted.

Similar to the procedure that we performed for the missing data section, we again want to test whether we can, with confidence, determine if the Top Player's data is different enough from the General Population's data. Specifically, we want to see if Top Player's games are different than the General Population's games and if this can be determined with confidence (the probability that it is not true is very low).

So we will resort to another T-test. Lets go over the assumptions again.

  1. Data sets are independent
  2. Data sets are (approximately) normally distributed.
  3. Data sets have a similar amount of variance within each group being compared (a.k.a. homogeneity of variance)

By the construction of our sample, these data sets are independent.

We can see now at face value that the means and standard deviations are very different when we compare top players' games to the general population's games. Let's now conduct a T-test again to check if we can confidently say that there is enough difference between the two data sets to make a confident conclusion.

We can say with strong likelihood based off of these T-test results that the top player's games and general population's games are siginificantly different. We will address this more directly in the conclusion.

Conclusion

So, now that we have done all of this analysis and exploration, what are the results?

We started the project off, being inspired by two main questions:
1) Is there skill based matchmaking in Call of Duty: Warzone lobbies?
2) Do Top Player's and Content Creators face easier lobbies than the general population to help Activision boost sales?

This created the two null hypotheses:
1) Lobbies are random samples of the general population
2) Top Player's and Content Creators are placed into easier lobbies (determined by average lobby lifetime KD)

In order to address these questions, we first looked at our datasets and analyzed them for missing data. We learned about the difference between public and private data settings in regards to a player's account and how that creates missing data in our dataset. We were curious about the relationship of KD to the missing data, so after looking at some basic statistics about the two populations of players,we performed a T-test to see if we could confidently say that the two sets were indepdendent and different.

Once confirming the type of missing data, we then looked at some Exploratory Data Analysis. This mostly involved creating lots of histograms for various frames of our dataset. We noticed a really long right tail in the data and decided to look more closely at this since most of the data was around 1, but we had KDs over 20. We talked about what this meant practically and discovered the presence of data from hackers (most likely) in our dataset. While we could have discarded this data, hackers are unfortunately a part of the game and can make real lobbies much harder. They are also not impossible to defeat as very good players can outplay hackers frequently. So, we then decided to leave them in the dataset since they are still valid datapoints, allbeit they are ruining the game, but this is not so important to our analysis.

We then looked at the presumably non-hacker data that was KDs <= 6. We created a scatter plot of the number of people with unique KDs and found that the most common KDs were really really nice numbers. We explained that this might be caused by newer players with less games, and then took a brief graphical look at the Top Player's games before moving on to hypothesis testing.

In order to examine the null hypothesis that lobbies were not matchmade using skill based criteria, we first constructed a null model for lobby construction that did not consider skill, one that randomly sampled through the entire population of KDs. We calculated the percentiles of each observed lobby within this model, and found that these percentiles were not uniformly distributed over 0%-100% so it was likely that lobbies were made using criteria that consider skill. In essence, the model poorly predicted the observed lobbies, which indicated that there was some influential factor in matchmaking, which may or may not directly involve skill. Thus, we rejected the null hypothesis since we can show that lobbies are not actually random samples of the player base.

After concluding that hypothesis testing and model arguing in regards to question 1, we moved to question 2 to see if we could develop some insights here. Similar to the missing data T-test procedure before, we peformed another T-test, this time looking to conclude that the mean lobby lifetime KDs of Top Player's games were in fact different from that of General Player's games. Our T-test showed that this was in fact true and that Top Player's games are different than General Player's games. Then comparing the averages of the Top Player's games and the General Player's games, we saw that the Top Player's actually face significantly more difficult opponents since their lobbies had a much higher average lifetime KD. Thus, we rejected the null hypothesis, since we can now confidently determine that Top Player's actually face harder lobbies than the general population. Interestingly, this also helps further support our conclusion to question 1, since we have shown that being higher skilled yields higher skilled lobbies as well. While this does not allow us to conclude that the matchmaking influencer is actually skill, it would definitely support that argument and could serve as a basis for further testing.

Important Note
Throughout the entirety of the conclusions, you will see that we are not directly saying that skill based match making exists, but are rather arguing that some form of matchmaking bias does exist. The reason for this is, while KD is the main metric in which we performed analysis, we cannot actually prove that KD alone is the reason for the distribution and differences between lobbies. For example, the time of day when a game occurs could heavily influence the spectrum of players that is online. More specifically, a lobby being created during the late-night/early morning hours are likely to contain better players that are staying up late to grind the game, whereas a lobby being created during the early evening hours is likely to have a better sampling of the player base since more casual players are likely to be online. Alternatively, games during weeknights might also have a a significantly less amount of casual players who might only play the game on the weekends. Unfortunately, we do not have datetime data for the lobbies that we sampled and so analyzing this aspect was beyond the scopes of our analysis. However, it is still grounds for further analysis into this topic because have have certainly shown that there is definitely some factor heavily influencing the creation of game lobbies despite ActivisionBlizzard claiming that there is not Skill Based Matchmaking.

Thank you!

Written by Alex Coppens, Luke Stuart, and Sandeep Ramesh