Note: I started this post way back when the NCAA men's basketball tournament was going on, but didn't finish it until now.

Since the NCAA Men's Basketball Tournament has moved to 64 teams, a 16 seed as never upset a 1 seed. You might be tempted to say that the probability of such an event must be 0 then. But we know better than that.

In this post, I am interested in looking at different ways of estimating how the odds of winning a game change as the difference between seeds increases. I was able to download tournament data going back to the 1930s until 2012 from hoopstournament.net/Database.html. The tournament expanded to 64 teams in 1985, which is what I used for this post. I only used match ups in which one of the seeds was higher than the other because this was the easiest way to remove duplicates. (The database has each game listed twice, once with the winner as the first team and once with the loser as the first team. The vast majority (98.9%) of games had one team as a higher seed because an equal seed can only happen at the Final Four or later.)

library(ggplot2); theme_set(theme_bw())
brackets=read.csv("NCAAHistory.csv")
 
# use only data from 1985 on in which the first team has the higher seed
brackets=subset(brackets,Seed<Opponent.Seed & Year>=1985 & Round!="Opening Round")
 
brackets$SeedDiff=abs(brackets$Opponent.Seed-brackets$Seed)
brackets$HigherSeedWon=ifelse(brackets$Opponent.Seed>brackets$Seed,brackets$Wins,brackets$Losses)
brackets$HigherSeedScoreDiff=ifelse(brackets$Opponent.Seed>brackets$Seed,1,-1)*(brackets$Score-brackets$Opponent.Score)
Created by Pretty R at inside-R.org

Use Frequencies

The first way is the most simple: look at the historical records when a 16 seed is playing a 1 seed (where the seed difference is 15). As you can see from the plot below, when the seed difference is 15, the higher seeded team has won every time. This is also true when the seed difference is 12, although there have only been 4 games in this scenario. Another oddity is that when the seed difference is 10, the higher seed only has only won 50% of the time. Again, this is largely due to the fact that there have only been 6 games with this seed difference.

seed.diffs=sort(unique(brackets$SeedDiff))
win.pct=sapply(seed.diffs,function(x) mean(brackets$HigherSeedWon[brackets$SeedDiff==x]))
ggplot(data=data.frame(seed.diffs,win.pct),aes(seed.diffs,win.pct))+geom_point()+
  geom_hline(yintercept=0.5,linetype=2)+
  geom_line()+labs(x="Seed Difference",y="Proportion of Games Won by Higher Seed")
Created by Pretty R at inside-R.org


Use Score Difference

In many applications, it has been shown that using margin of victory is much more reliable than just wins and losses. For example, in the computer ranking of College Football teams, using score differences is more accurate, but outlawed for fear that teams would run up the score on weaker opponents. So the computer rankings are not as strong as they could be.

We have no such conflict of interest, so we should try to make use of any information available. A simple way to do that is to look at the mean and standard deviation of the margin of victory when the 16 seed is playing the 1 seed. Below is a plot of the mean score difference with a ribbon for the +/- 2 standard deviations.

seed.diffs=sort(unique(brackets$SeedDiff))
means=sapply(seed.diffs,function(x) mean(brackets$HigherSeedScoreDiff[brackets$SeedDiff==x]))
sds=sapply(seed.diffs,function(x) sd(brackets$HigherSeedScoreDiff[brackets$SeedDiff==x]))
ggplot(data=data.frame(seed.diffs,means,sds),aes(seed.diffs,means))+
  geom_ribbon(aes(ymin=means-2*sds,ymax=means+2*sds),alpha=.5)+geom_point()+geom_line()+
  geom_hline(yintercept=0,linetype=2)+
  labs(x="Seed Difference",y="Margin of Victory by Higher Seed")
Created by Pretty R at inside-R.org


You can see that the ribbon includes zero for all seed differences except 15. If we assume that the score differences are roughly normal, we can calculate the probability that the score difference will be greater than 0. The results are largely the same as before, but we see now that there are no 100% estimates. Also, the 50% win percentage for a seed difference of 10 now looks a little more reasonable, albeit still out of line with the rest.

ggplot(data=data.frame(seed.diffs,means,sds),aes(seed.diffs,1-pnorm(0,means,sds)))+
  geom_point()+geom_line()+geom_hline(yintercept=0.5,linetype=2)+
  labs(x="Seed Difference",y="Probability of Higher Seed Winning Based on Margin of Victory")
Created by Pretty R at inside-R.org

Model Win Percentage as a Function of  Seed Difference

It is always good to incorporate as much knowledge as possible into an analysis. In this case, we have information on other games besides the 16 versus 1 seed game which help us estimate the 16 versus 1 game. For example, it is reasonable to assume that the larger the difference in seed is, the more likely the higher seed will win. We can build a logistic regression model which looks at all of the outcomes of all of the games and predicts the probability of winning based on the difference in seed. When the two teams have the same seed, I enforced the probability of the higher seed winning to be 0.5 by making the intercept 0.

In the plot below, you can see that the logistic model predicts that the probability of winning increases throughout until reaching about 90% for the 16 versus 1. I also included a non-linear generalized additive model (GAM) model for comparison. The GAM believes that being a big favorite (16 vs 1 or 15 vs 2) gives an little boost in win probability. An advantage of modeling is that we can make predictions for match-ups that have never occurred (like a seed difference of 14).

ggplot(data=brackets,aes(SeedDiff,HigherSeedWon))+
  stat_smooth(method="gam",family="binomial",se=F,formula=y~0+x,aes(colour="Logistic"),size=1)+
  stat_smooth(method="gam",family="binomial",se=F,formula=y~s(x),aes(colour="GAM"),size=1)+
  geom_jitter(alpha=.15,position = position_jitter(height = .025,width=.25))+
  labs(x="Seed Difference",y="Game Won by Higher Seed",colour="Model")
Created by Pretty R at inside-R.org

Model Score Difference as a Function of  Seed Difference

We can also do the same thing with margin of victory. Here, I constrain the linear model to have an intercept of 0, meaning that two teams with the same seed should be evenly matched. Again, I included the GAM fit for comparison. The interpretations are similar to before, in that it seems that there is an increase in margin of victory for the heavily favored teams.

ggplot(data=brackets,aes(SeedDiff,HigherSeedScoreDiff))+
  stat_smooth(method="lm",se=F,formula=y~0+x,aes(colour="Linear"),size=1)+
  stat_smooth(method="gam",se=F,formula=y~s(x),aes(colour="GAM"),size=1)+
  geom_jitter(alpha=.25,position = position_jitter(height = 0,width=.25))+
  labs(x="Seed Difference",y="Margin of Victory by Higher Seed",colour="Model")
Created by Pretty R at inside-R.org

From these models of margin of victory we can infer the probability of the higher seed winning (again, assuming normality).

library(gam)
lm.seed=lm(HigherSeedScoreDiff~0+SeedDiff,data=brackets)
gam.seed=gam(HigherSeedScoreDiff~s(SeedDiff),data=brackets)
 
pred.lm.seed=predict(lm.seed,data.frame(SeedDiff=0:15),se.fit=TRUE)
pred.gam.seed=predict(gam.seed,data.frame(SeedDiff=0:15),se.fit=TRUE)
se.lm=sqrt(mean(lm.seed$residuals^2))
se.gam=sqrt(mean(gam.seed$residuals^2))
 
df1=data.frame(SeedDiff=0:15,ProbLM=1-pnorm(0,pred.lm.seed$fit,sqrt(se.lm^2+pred.lm.seed$se.fit^2)),
               ProbGAM=1-pnorm(0,pred.gam.seed$fit,sqrt(se.gam^2+pred.gam.seed$se.fit^2)))
ggplot(df1)+geom_hline(yintercept=0.5,linetype=2)+
  geom_line(aes(SeedDiff,ProbLM,colour="Linear"),size=1)+
  geom_line(aes(SeedDiff,ProbGAM,colour="GAM"),size=1)+
  labs(x="Seed Difference",y="Probability of Higher Seed Winning",colour="Model")
Created by Pretty R at inside-R.org


Summary

Putting all of the estimates together, you can easily spot the differences between the models. The two assumptions that just used the data between specific seeds look pretty similar. It looks like using score differential is a little more reasonable of the two. The two GAMs have a similar trend and so did the  linear and logistic models. If someone asks you what the probability that a 16 seed beats a 1 seed, you have at least 6 different answers.

This post highlights the many different ways someone can analyze the same data. Simply statistics talked a bit about this in a recent podcast. In this case, the differences are not huge, but there are noticeable changes. So the next time you read about an analysis that someone did, keep in mind all the decisions that they had to make and what type a sensitivity they would have on the results.

logit.seed=glm(HigherSeedWon~0+SeedDiff,data=brackets,family=binomial(logit))
logit.seed.gam=gam(HigherSeedWon~s(SeedDiff),data=brackets,family=binomial(logit))
 
df2=data.frame(SeedDiff=0:15,
               ProbLM=1-pnorm(0,pred.lm.seed$fit,sqrt(se.lm^2+pred.lm.seed$se.fit^2)),
               ProbGAM=1-pnorm(0,pred.gam.seed$fit,sqrt(se.gam^2+pred.gam.seed$se.fit^2)),
               ProbLogit=predict(logit.seed,data.frame(SeedDiff=0:15),type="response"),
               ProbLogitGAM=predict(logit.seed.gam,data.frame(SeedDiff=0:15),type="response"))
df2=merge(df2,data.frame(SeedDiff=seed.diffs,ProbFreq=win.pct),all.x=T)
df2=merge(df2,data.frame(SeedDiff=seed.diffs,ProbScore=1-pnorm(0,means,sds)),all.x=T)
ggplot(df2,aes(SeedDiff))+geom_hline(yintercept=0.5,linetype=2)+
  geom_line(aes(y=ProbLM,colour="Linear"),size=1)+
  geom_line(aes(y=ProbGAM,colour="GAM"),size=1)+
  geom_line(aes(y=ProbLogit,colour="Logistic"),size=1)+
  geom_line(aes(y=ProbLogitGAM,colour="Logistic GAM"),size=1)+
  geom_line(aes(y=ProbFreq,colour="Frequencies"),size=1)+
  geom_line(aes(y=ProbScore,colour="Score Diff"),size=1)+
  geom_point(aes(y=ProbFreq,colour="Frequencies"),size=3)+
  geom_point(aes(y=ProbScore,colour="Score Diff"),size=3)+
  labs(x="Seed Difference",y="Probability of Higher Seed Winning",colour="Model")
 
ggplot(df2)+geom_hline(yintercept=0.5,linetype=2)+
  geom_point(aes(x=SeedDiff,y=ProbFreq,colour="Frequencies"),size=1)
Created by Pretty R at inside-R.org


Note that the GAM functions did not have a way to easily restrict the win probability be equal to exactly 0.5 when the seed difference is 0. That is why you may notice the GAM model is a bit above 0.5 at 0.
1

View comments

  1. Throw this whole study out. FDU beat Purdue!

    ReplyDelete

Update: I have moved my blog to andland.github.io. Check it out for more recent posts. Thanks!

A lot of times we are given a data set in Excel format and we want to run a quick analysis using R's functionality to look at advanced statistics or make better visualizations.

18

In case you haven't noticed, the blog has been less active lately. I have moved the blog to andland.github.io and have a couple of new posts up there.

In a previous post, I showed you how to scrape playlist data from Columbus, OH alternative rock station CD102.5. Since it's the end of the year and best-of lists are all the fad, I thought I would share the most popular songs and artists of the year, according to this data.

&amp;lt;p&amp;gt;Loading ...&amp;lt;/p&amp;gt;

CD1025 is an “alternative” radio station here in Columbus. They are one of the few remaining radio stations that are independently owned and they take great pride in it. For data nerds like me, they also put a real time list of recently played songs on their website.

Note: I started this post way back when the NCAA men's basketball tournament was going on, but didn't finish it until now.

Since the NCAA Men's Basketball Tournament has moved to 64 teams, a 16 seed as never upset a 1 seed.

1

As a grad student, I do lots of searches for research related to my own. When I am off campus, a lot of the relevant results are not open access. In that case, I have to log onto my school's library website and search for the journal or article there. It is quite a hassle.

The famous probabilist and statistician Persi Diaconis wrote an article not too long ago about the "Markov chain Monte Carlo (MCMC) Revolution." The paper describes how we are able to solve a diverse set of problems with MCMC.

7

Update: I have moved my blog to andland.github.io. Check it out for more recent posts. Thanks!

Restricted Boltzmann Machines (RBMs) are an unsupervised learning method (like principal components). An RBM is a probabilistic and undirected graphical model.

16

With the election nearly upon us, I wanted to share an easy way I just found to download polling data and graph a few with ggplot2. dlinzer at github created a function to download poll data from the Huffington Post's Pollster API.

4

Introduction

Matrix factorization has been proven to be one of the best ways to do collaborative filtering. The most common example of collaborative filtering is to predict how much a viewer will like a movie.

I have been toying around with Kaggle's Million Song Dataset Challenge recently because I have some interest in collaborative filtering (using matrix factorization).

Update: I have moved my blog to andland.github.io. Check it out for more recent posts. Thanks!

Random forests ™ are great. They are one of the best "black-box" supervised learning methods. If you have lots of data and lots of predictor variables, you can do worse than random forests.

13

Forgive me if you are already aware of this, but I found it quite alarming. I know that most code is interpreted by the computer in binary and we input in decimal, so problems can arise in conversion and with floating point. But the example I have below is so simple that it really surprised me.

11

I was having some fun with PITCHf/x data and generalize additive models. PITCHf/x keeps track of the trajectory, path, location of every pitch in the MLB. It is pretty accurate and opens up baseball to more analyses than ever before.

Don't you hate it when you are running a long piece of code and you keep checking the results every 15 minutes, hoping it will finish? There is a better way.

I got the idea from here. He uses a Python script and the text interface is not free.

Recently, Chris Perez, the closer for the Indians, displayed some frustration with the fans for not supporting the team. Currently, they have the lowest attendance in the majors -- by a decent margin.

After signing a huge deal with the Angels, Pujols has been having a really bad year. He hasn't hit a home run this year, breaking a career long streak. So I thought it would be a good idea to use some statistics to tell how good or bad we think Pujols will actually be this year.

1

Correlation matrices are a common way to look at the dependence of a set of variables. When the variables have spatial relationships, the correlation matrix loses some information.

Lets say you have repeated observations, each one being a matrix.

That title is quite a mouthful. This quarter, I have been reading papers on Spectral Clustering for a reading group. The basic goal of clustering is to find groups of data points that are similar to each other. Also, data points in one group should be dissimilar to data in other clusters.

2

I am a big fan of SAS's JMP software. It is the first statistical program I learned and I really like how the emphasize visualization. In their most recent update, JMP 9 now has the ability to create maps.

1

I guess you could call this On Bayes Percentage. *cough*

Fresh off learning Bayesian techniques in one of my classes last quarter, I thought it would be fun to try to apply the method. I was able to find some examples of Hierarchical Bayes being used to analyze baseball data at Wharton.

Continuing my series of trying to figure out which team is best to pick for survival football and then ignoring it, I present my week 3 analysis. I used the same method as the past 2 weeks, and didn't make any updates to it since last week.

So this is late, but I already did the analysis and I wanted to share my results for posterity. I used the same method as last time to try to evaluate who should be picked in a survival football pick 'em.

The NFL season is starting tomorrow night and I am in a survival league this year. If you are not familiar, in a survival league, each week you pick one team to win their game. Once you pick a team, you can no longer pick them for the rest of the season.

3

If the past is a predictor of future performance, then there is about a 99.3% chance that I will stop updating this in 2 weeks. But you have to start somewhere.

Total Pageviews
Total Pageviews
279378
Blog Archive
About Me
About Me
Blogroll
Blogroll
Loading
Powered by Blogger.