Note: I started this post way back when the NCAA men's basketball tournament was going on, but didn't finish it until now.

Since the NCAA Men's Basketball Tournament has moved to 64 teams, a 16 seed as never upset a 1 seed. You might be tempted to say that the probability of such an event must be 0 then. But we know better than that.

In this post, I am interested in looking at different ways of estimating how the odds of winning a game change as the difference between seeds increases. I was able to download tournament data going back to the 1930s until 2012 from hoopstournament.net/Database.html. The tournament expanded to 64 teams in 1985, which is what I used for this post. I only used match ups in which one of the seeds was higher than the other because this was the easiest way to remove duplicates. (The database has each game listed twice, once with the winner as the first team and once with the loser as the first team. The vast majority (98.9%) of games had one team as a higher seed because an equal seed can only happen at the Final Four or later.)

library(ggplot2); theme_set(theme_bw())
brackets=read.csv("NCAAHistory.csv")
 
# use only data from 1985 on in which the first team has the higher seed
brackets=subset(brackets,Seed<Opponent.Seed & Year>=1985 & Round!="Opening Round")
 
brackets$SeedDiff=abs(brackets$Opponent.Seed-brackets$Seed)
brackets$HigherSeedWon=ifelse(brackets$Opponent.Seed>brackets$Seed,brackets$Wins,brackets$Losses)
brackets$HigherSeedScoreDiff=ifelse(brackets$Opponent.Seed>brackets$Seed,1,-1)*(brackets$Score-brackets$Opponent.Score)
Created by Pretty R at inside-R.org

Use Frequencies

The first way is the most simple: look at the historical records when a 16 seed is playing a 1 seed (where the seed difference is 15). As you can see from the plot below, when the seed difference is 15, the higher seeded team has won every time. This is also true when the seed difference is 12, although there have only been 4 games in this scenario. Another oddity is that when the seed difference is 10, the higher seed only has only won 50% of the time. Again, this is largely due to the fact that there have only been 6 games with this seed difference.

seed.diffs=sort(unique(brackets$SeedDiff))
win.pct=sapply(seed.diffs,function(x) mean(brackets$HigherSeedWon[brackets$SeedDiff==x]))
ggplot(data=data.frame(seed.diffs,win.pct),aes(seed.diffs,win.pct))+geom_point()+
  geom_hline(yintercept=0.5,linetype=2)+
  geom_line()+labs(x="Seed Difference",y="Proportion of Games Won by Higher Seed")
Created by Pretty R at inside-R.org


Use Score Difference

In many applications, it has been shown that using margin of victory is much more reliable than just wins and losses. For example, in the computer ranking of College Football teams, using score differences is more accurate, but outlawed for fear that teams would run up the score on weaker opponents. So the computer rankings are not as strong as they could be.

We have no such conflict of interest, so we should try to make use of any information available. A simple way to do that is to look at the mean and standard deviation of the margin of victory when the 16 seed is playing the 1 seed. Below is a plot of the mean score difference with a ribbon for the +/- 2 standard deviations.

seed.diffs=sort(unique(brackets$SeedDiff))
means=sapply(seed.diffs,function(x) mean(brackets$HigherSeedScoreDiff[brackets$SeedDiff==x]))
sds=sapply(seed.diffs,function(x) sd(brackets$HigherSeedScoreDiff[brackets$SeedDiff==x]))
ggplot(data=data.frame(seed.diffs,means,sds),aes(seed.diffs,means))+
  geom_ribbon(aes(ymin=means-2*sds,ymax=means+2*sds),alpha=.5)+geom_point()+geom_line()+
  geom_hline(yintercept=0,linetype=2)+
  labs(x="Seed Difference",y="Margin of Victory by Higher Seed")
Created by Pretty R at inside-R.org


You can see that the ribbon includes zero for all seed differences except 15. If we assume that the score differences are roughly normal, we can calculate the probability that the score difference will be greater than 0. The results are largely the same as before, but we see now that there are no 100% estimates. Also, the 50% win percentage for a seed difference of 10 now looks a little more reasonable, albeit still out of line with the rest.

ggplot(data=data.frame(seed.diffs,means,sds),aes(seed.diffs,1-pnorm(0,means,sds)))+
  geom_point()+geom_line()+geom_hline(yintercept=0.5,linetype=2)+
  labs(x="Seed Difference",y="Probability of Higher Seed Winning Based on Margin of Victory")
Created by Pretty R at inside-R.org

Model Win Percentage as a Function of  Seed Difference

It is always good to incorporate as much knowledge as possible into an analysis. In this case, we have information on other games besides the 16 versus 1 seed game which help us estimate the 16 versus 1 game. For example, it is reasonable to assume that the larger the difference in seed is, the more likely the higher seed will win. We can build a logistic regression model which looks at all of the outcomes of all of the games and predicts the probability of winning based on the difference in seed. When the two teams have the same seed, I enforced the probability of the higher seed winning to be 0.5 by making the intercept 0.

In the plot below, you can see that the logistic model predicts that the probability of winning increases throughout until reaching about 90% for the 16 versus 1. I also included a non-linear generalized additive model (GAM) model for comparison. The GAM believes that being a big favorite (16 vs 1 or 15 vs 2) gives an little boost in win probability. An advantage of modeling is that we can make predictions for match-ups that have never occurred (like a seed difference of 14).

ggplot(data=brackets,aes(SeedDiff,HigherSeedWon))+
  stat_smooth(method="gam",family="binomial",se=F,formula=y~0+x,aes(colour="Logistic"),size=1)+
  stat_smooth(method="gam",family="binomial",se=F,formula=y~s(x),aes(colour="GAM"),size=1)+
  geom_jitter(alpha=.15,position = position_jitter(height = .025,width=.25))+
  labs(x="Seed Difference",y="Game Won by Higher Seed",colour="Model")
Created by Pretty R at inside-R.org

Model Score Difference as a Function of  Seed Difference

We can also do the same thing with margin of victory. Here, I constrain the linear model to have an intercept of 0, meaning that two teams with the same seed should be evenly matched. Again, I included the GAM fit for comparison. The interpretations are similar to before, in that it seems that there is an increase in margin of victory for the heavily favored teams.

ggplot(data=brackets,aes(SeedDiff,HigherSeedScoreDiff))+
  stat_smooth(method="lm",se=F,formula=y~0+x,aes(colour="Linear"),size=1)+
  stat_smooth(method="gam",se=F,formula=y~s(x),aes(colour="GAM"),size=1)+
  geom_jitter(alpha=.25,position = position_jitter(height = 0,width=.25))+
  labs(x="Seed Difference",y="Margin of Victory by Higher Seed",colour="Model")
Created by Pretty R at inside-R.org

From these models of margin of victory we can infer the probability of the higher seed winning (again, assuming normality).

library(gam)
lm.seed=lm(HigherSeedScoreDiff~0+SeedDiff,data=brackets)
gam.seed=gam(HigherSeedScoreDiff~s(SeedDiff),data=brackets)
 
pred.lm.seed=predict(lm.seed,data.frame(SeedDiff=0:15),se.fit=TRUE)
pred.gam.seed=predict(gam.seed,data.frame(SeedDiff=0:15),se.fit=TRUE)
se.lm=sqrt(mean(lm.seed$residuals^2))
se.gam=sqrt(mean(gam.seed$residuals^2))
 
df1=data.frame(SeedDiff=0:15,ProbLM=1-pnorm(0,pred.lm.seed$fit,sqrt(se.lm^2+pred.lm.seed$se.fit^2)),
               ProbGAM=1-pnorm(0,pred.gam.seed$fit,sqrt(se.gam^2+pred.gam.seed$se.fit^2)))
ggplot(df1)+geom_hline(yintercept=0.5,linetype=2)+
  geom_line(aes(SeedDiff,ProbLM,colour="Linear"),size=1)+
  geom_line(aes(SeedDiff,ProbGAM,colour="GAM"),size=1)+
  labs(x="Seed Difference",y="Probability of Higher Seed Winning",colour="Model")
Created by Pretty R at inside-R.org


Summary

Putting all of the estimates together, you can easily spot the differences between the models. The two assumptions that just used the data between specific seeds look pretty similar. It looks like using score differential is a little more reasonable of the two. The two GAMs have a similar trend and so did the  linear and logistic models. If someone asks you what the probability that a 16 seed beats a 1 seed, you have at least 6 different answers.

This post highlights the many different ways someone can analyze the same data. Simply statistics talked a bit about this in a recent podcast. In this case, the differences are not huge, but there are noticeable changes. So the next time you read about an analysis that someone did, keep in mind all the decisions that they had to make and what type a sensitivity they would have on the results.

logit.seed=glm(HigherSeedWon~0+SeedDiff,data=brackets,family=binomial(logit))
logit.seed.gam=gam(HigherSeedWon~s(SeedDiff),data=brackets,family=binomial(logit))
 
df2=data.frame(SeedDiff=0:15,
               ProbLM=1-pnorm(0,pred.lm.seed$fit,sqrt(se.lm^2+pred.lm.seed$se.fit^2)),
               ProbGAM=1-pnorm(0,pred.gam.seed$fit,sqrt(se.gam^2+pred.gam.seed$se.fit^2)),
               ProbLogit=predict(logit.seed,data.frame(SeedDiff=0:15),type="response"),
               ProbLogitGAM=predict(logit.seed.gam,data.frame(SeedDiff=0:15),type="response"))
df2=merge(df2,data.frame(SeedDiff=seed.diffs,ProbFreq=win.pct),all.x=T)
df2=merge(df2,data.frame(SeedDiff=seed.diffs,ProbScore=1-pnorm(0,means,sds)),all.x=T)
ggplot(df2,aes(SeedDiff))+geom_hline(yintercept=0.5,linetype=2)+
  geom_line(aes(y=ProbLM,colour="Linear"),size=1)+
  geom_line(aes(y=ProbGAM,colour="GAM"),size=1)+
  geom_line(aes(y=ProbLogit,colour="Logistic"),size=1)+
  geom_line(aes(y=ProbLogitGAM,colour="Logistic GAM"),size=1)+
  geom_line(aes(y=ProbFreq,colour="Frequencies"),size=1)+
  geom_line(aes(y=ProbScore,colour="Score Diff"),size=1)+
  geom_point(aes(y=ProbFreq,colour="Frequencies"),size=3)+
  geom_point(aes(y=ProbScore,colour="Score Diff"),size=3)+
  labs(x="Seed Difference",y="Probability of Higher Seed Winning",colour="Model")
 
ggplot(df2)+geom_hline(yintercept=0.5,linetype=2)+
  geom_point(aes(x=SeedDiff,y=ProbFreq,colour="Frequencies"),size=1)
Created by Pretty R at inside-R.org


Note that the GAM functions did not have a way to easily restrict the win probability be equal to exactly 0.5 when the seed difference is 0. That is why you may notice the GAM model is a bit above 0.5 at 0.
1

View comments

  1. Update: I have moved my blog to andland.github.io. Check it out for more recent posts. Thanks!

    A lot of times we are given a data set in Excel format and we want to run a quick analysis using R's functionality to look at advanced statistics or make better visualizations. There are packages for importing/exporting data from/to Excel, but I have found them to be hard to work with or only work with old versions of Excel (*.xls, not *.xlsx). So for a one time analysis, I usually save the file as a csv and import it into R.

    This can be a little burdensome if you are trying to do something quick and creates a file that needs to be cleaned up later. An easier option is to copy and paste the data directly into R. This can be done by using "clipboard" as the file and specifying that it is tab delimited, since that is how Excel's clipboard stores the data.

    For example, say you have a table in excel you want to copy into R. First, copy it in Excel.

    Then go into R and use this function.

    read.excel <- function(header=TRUE,...) {
      read.table("clipboard",sep="\t",header=header,...)
    }
     
    dat=read.excel()

    This function specifies that you are reading data from the clipboard, that it is tab delimited, and that it has a header.

    Similarly, you can copy from R to Excel using the same logic. Here I also make row.name=FALSE as default since I rarely have meaningful row names and they mess up the header alignment.

    write.excel <- function(x,row.names=FALSE,col.names=TRUE,...) {
      write.table(x,"clipboard",sep="\t",row.names=row.names,col.names=col.names,...)
    }
     
    write.excel(dat)
    Created by Pretty R at inside-R.org

    These functions can be added to you .RProfile so that they are always ready for a quick analysis!

    Obviously, this technique does not encourage reproducible research. It is meant to be used for quick, ad hoc analysis and plotting; not something you would use for an analysis that needs to be done on a regular basis.

    18

    View comments

  2. In case you haven't noticed, the blog has been less active lately. I have moved the blog to andland.github.io and have a couple of new posts up there.

    Thanks,
    Andrew
    0

    Add a comment

  3. In a previous post, I showed you how to scrape playlist data from Columbus, OH alternative rock station CD102.5. Since it's the end of the year and best-of lists are all the fad, I thought I would share the most popular songs and artists of the year, according to this data. In addition to this, I am going to make an interactive graph using Shiny, where the user can select an artist and it will graph the most popular songs from that artist.

    First off, I am assuming that you have scraped the appropriate data using the code from the previous post.

    library(lubridate)
    library(sqldf)
    
    playlist=read.csv("CD101Playlist.csv",stringsAsFactors=FALSE)
    dates=mdy(substring(playlist[,3],nchar(playlist[,3])-9,nchar(playlist[,3])))
    times=hm(substring(playlist[,3],1,nchar(playlist[,3])-10))
    playlist$Month=ymd(paste(year(dates),month(dates),"1",sep="-"))
    playlist$Day=dates
    playlist$Time=times
    playlist=playlist[order(playlist$Day,playlist$Time),]

    Next, I will select just the data from 2013 and find the songs that were played most often.
    playlist=subset(playlist,Day>=mdy("1/1/13"))
    playlist$ArtistSong=paste(playlist$Artist,playlist$Song,sep="-")
    top.songs=sqldf("Select ArtistSong, Count(ArtistSong) as Num
          From playlist
          Group By ArtistSong
          Order by Num DESC
          Limit 10")
    

    The top 10 songs are the following:
                                  Artist-Song Number Plays
    1  FITZ AND THE TANTRUMS-OUT OF MY LEAGUE 809
    2                      ALT J-BREEZEBLOCKS 764
    3              COLD WAR KIDS-MIRACLE MILE 759
    4                      ATLAS GENIUS-IF SO 750
    5                         FOALS-MY NUMBER 687
    6                         MS MR-HURRICANE 679
    7       THE NEIGHBOURHOOD-SWEATER WEATHER 657
    8           CAPITAL CITIES-SAFE AND SOUND 646
    9             VAMPIRE WEEKEND-DIANE YOUNG 639
    10             THE FEATURES-THIS DISORDER 632
    

    I will make a plot similar to the plots made in the last post to show when the top 5 songs were played throughout the year.
        
    plays.per.day=sqldf("Select Day, Count(Artist) as Num
          From playlist
          Group By Day
          Order by Day")
    
    playlist.top.songs=subset(playlist,ArtistSong %in% top.songs$ArtistSong[1:5])
    
    song.per.day=sqldf(paste0("Select Day, ArtistSong, Count(ArtistSong) as Num
                              From [playlist.top.songs]
                              Group By Day, ArtistSong
                              Order by Day, ArtistSong"))
    dspd=dcast(song.per.day,Day~ArtistSong,sum,value.var="Num")
    
    song.per.day=merge(plays.per.day[,1,drop=FALSE],dspd,all.x=TRUE)
    song.per.day[is.na(song.per.day)]=0
    
    song.per.day=melt(song.per.day,1,variable.name="ArtistSong",value.name="Num")
    song.per.day$Alpha=ifelse(song.per.day$Num>0,1,0)
    
    library(ggplot2)
    ggplot(song.per.day,aes(Day,Num,colour=ArtistSong))+geom_point(aes(alpha=Alpha))+
      geom_smooth(method="gam",family=poisson,formula=y~s(x),se=F,size=1)+
      labs(x="Date",y="Plays Per Day",title="Top Songs",colour=NULL)+
      scale_alpha_continuous(guide=FALSE,range=c(0,.5))+theme_bw()
    
    Alt-J was more popular in the beginning of the year and the Foals have been more popular recently.

    I can similarly summarize by artist as well.
    top.artists=sqldf("Select Artist, Count(Artist) as Num
                    From playlist
                    Group By Artist
                    Order by Num DESC
                    Limit 10")
    

                        Artist  Num
    1                     MUSE 1683
    2          VAMPIRE WEEKEND 1504
    3        SILVERSUN PICKUPS 1442
    4                    FOALS 1439
    5                  PHOENIX 1434
    6            COLD WAR KIDS 1425
    7                JAKE BUGG 1316
    8  QUEENS OF THE STONE AGE 1296
    9                    ALT J 1233
    10     OF MONSTERS AND MEN 1150
    

    playlist.top.artists=subset(playlist,Artist %in% top.artists$Artist[1:5])
    
    artists.per.day=sqldf(paste0("Select Day, Artist, Count(Artist) as Num
                              From [playlist.top.artists]
                              Group By Day, Artist
                              Order by Day, Artist"))
    dspd=dcast(artists.per.day,Day~Artist,sum,value.var="Num")
    
    artists.per.day=merge(plays.per.day[,1,drop=FALSE],dspd,all.x=TRUE)
    artists.per.day[is.na(artists.per.day)]=0
    
    artists.per.day=melt(artists.per.day,1,variable.name="Artist",value.name="Num")
    artists.per.day$Alpha=ifelse(artists.per.day$Num>0,1,0)
    
    ggplot(artists.per.day,aes(Day,Num,colour=Artist))+geom_point(aes(alpha=Alpha))+
      geom_smooth(method="gam",family=poisson,formula=y~s(x),se=F,size=1)+
      labs(x="Date",y="Plays Per Day",title="Top Artists",colour=NULL)+
      scale_alpha_continuous(guide=FALSE,range=c(0,.5))+theme_bw()
    
    The pattern for the artists are not as clear as it is for the songs.

    Finally, I wrote a Shiny interactive app. They are surprisingly easy to create and if you are thinking about experimenting with it, I suggest you try it. I will leave the code for the app in a gist. In the app, you can enter any artist you want, and it will show you the most popular songs on CD102.5 for that artist. You can also select the number of songs that it plots with the slider.

    For example, even though Muse did not have one of the most popular songs of the year, they were still the band that was played the most. By typing in "MUSE" in the Artist text input, you will get the following output.

    They had two songs that were very popular this year and a few others that were decently popular as well.

    Play around with it and let me know what you think.
    0

    Add a comment

  4. 0

    Add a comment

  5. CD1025 is an “alternative” radio station here in Columbus. They are one of the few remaining radio stations that are independently owned and they take great pride in it. For data nerds like me, they also put a real time list of recently played songs on their website. The page has the most recent 50 songs played, but you can also click on “Older Tracks” to go back in time. When you do this, the URL ends “now-playing/?start=50”. If you got back again, it says “now-playing/?start=100”.

    Using this structure, I decided to see if I could download all of their historical data and see how far it goes back. In the code below, I use the XML package to go to the website and download the 50 songs and then increment the number by 50 to find the previous 50 songs. I am telling the code to keep doing this until I get to January 1, 2012.
    library(ggplot2)
    theme_set(theme_bw())
    library(XML)
    library(lubridate)
    library(sqldf)
    startNum = 0
    while (TRUE) {
        theurl <- paste0("http://cd1025.com/about/playlists/now-playing/?start=", 
            startNum)
        table <- readHTMLTable(theurl, stringsAsFactors = FALSE)[[1]]
        if (startNum == 0) {
            playlist = table[, -1]
        } else {
            playlist = rbind(playlist, table[, -1])
        }
        dt = mdy(substring(table[1, 4], nchar(table[1, 4]) - 9, nchar(table[1, 4])))
        print(dt)
        if (dt < mdy("1/1/12")) {
            break
        }
        startNum = startNum + 50
    }
    
    playlist = unique(playlist)  # Remove Dupes
    
    write.csv(playlist, "CD101Playlist.csv", row.names = FALSE)
    
    This takes a while and is fairly large. My file has over 150,000 songs. If you want just a little data, change the date to last week or so. The first thing I will do is parse the dates and times of the songs, order them, and look at the first few songs. You can see that data only goes back to March of 2012.
    dates = mdy(substring(playlist[, 3], nchar(playlist[, 3]) - 9, nchar(playlist[, 
        3])))
    times = hm(substring(playlist[, 3], 1, nchar(playlist[, 3]) - 10))
    playlist$Month = ymd(paste(year(dates), month(dates), "1", sep = "-"))
    playlist$Day = dates
    playlist$Time = times
    playlist = playlist[order(playlist$Day, playlist$Time), ]
    head(playlist)
    
    ##                     Artist                Song       Last.Played
    ## 151638 DEATH CAB FOR CUTIE   YOU ARE A TOURIST 12:34am03/01/2012
    ## 151637       SLEEPER AGENT          GET BURNED 12:38am03/01/2012
    ## 151636          WASHED OUT           AMOR FATI 12:41am03/01/2012
    ## 151635            COLDPLAY       CHARLIE BROWN 12:45am03/01/2012
    ## 151634           GROUPLOVE         TONGUE TIED 12:49am03/01/2012
    ## 151633               SUGAR YOUR FAVORITE THING 12:52am03/01/2012
    ##             Month        Day   Time
    ## 151638 2012-03-01 2012-03-01 34M 0S
    ## 151637 2012-03-01 2012-03-01 38M 0S
    ## 151636 2012-03-01 2012-03-01 41M 0S
    ## 151635 2012-03-01 2012-03-01 45M 0S
    ## 151634 2012-03-01 2012-03-01 49M 0S
    ## 151633 2012-03-01 2012-03-01 52M 0S
    
    Using the sqldf package, I can easily see what the most played artists and songs are from the data I scraped.
    sqldf("Select Artist, Count(Artist) as PlayCount
           From playlist
           Group By Artist
           Order by PlayCount DESC
           Limit 10")
    
    ##                   Artist PlayCount
    ## 1      SILVERSUN PICKUPS      2340
    ## 2         THE BLACK KEYS      2203
    ## 3                   MUSE      1988
    ## 4              THE SHINS      1885
    ## 5    OF MONSTERS AND MEN      1753
    ## 6            PASSION PIT      1552
    ## 7              GROUPLOVE      1544
    ## 8  RED HOT CHILI PEPPERS      1514
    ## 9                 METRIC      1495
    ## 10          ATLAS GENIUS      1494
    
    
    sqldf("Select Artist, Song, Count(Song) as PlayCount
          From playlist
          Group By Artist, Song
          Order by PlayCount DESC
          Limit 10")
    
    ##                 Artist                    Song PlayCount
    ## 1          PASSION PIT             TAKE A WALK       828
    ## 2    SILVERSUN PICKUPS                PIT, THE       825
    ## 3         ATLAS GENIUS                 TROJANS       819
    ## 4        WALK THE MOON                ANNA SUN       742
    ## 5       THE BLACK KEYS LITTLE BLACK SUBMARINES       736
    ## 6          DIVINE FITS  WOULD THAT NOT BE NICE       731
    ## 7        THE LUMINEERS                  HO HEY       722
    ## 8       CAPITAL CITIES          SAFE AND SOUND       712
    ## 9  OF MONSTERS AND MEN          MOUNTAIN SOUND       711
    ## 10               ALT J            BREEZEBLOCKS       691
    
    I am a little surprised that Silversun Pickups are the number one band, but everyone on the list makes sense. Looking at how the plays of the top artists have varied from month to month, you can see a few patterns. Muse has been more popular recently and The Shins and Grouplove have lost some steam.
    artist.month=sqldf("Select Month, Artist, Count(Song) as Num
          From playlist
          Group By Month, Artist
          Order by Month, Artist")
    artist=sqldf("Select Artist, Count(Artist) as Num
          From playlist
          Group By Artist
          Order by Num DESC")
    p=ggplot(subset(artist.month,Artist %in% head(artist$Artist,8)),aes(Month,Num))
    p+geom_bar(stat="identity",aes(fill=Artist),position='fill',colour="grey")+
     labs(y="Percentage of Plays")

    For the play count of the top artists, I see some odd numbers in June and July of 2012. The number of plays went way down.
    p + geom_area(aes(fill = Artist), position = "stack", colour = 1) + labs(y = "Number of Plays")
    


    Looking into this further, I plotted the date and time that the song was played by the cumulative number of songs played since the beginning of the list. The plot should be a line with a constant slope, meaning that the plays per day are relatively constant. You can see in June and July of 2012 there are flat spots where there is no playlist history.
    qplot(playlist$Day + playlist$Time, 1:length(dates), geom = "path")
    

    There are also smaller flat spots in September and December, but I am going to decide that those are small enough not to affect any further analyses. From this, I am going to use data only from August 2012 to present.
    playlist = subset(playlist, Day >= mdy("8/1/12"))
    
    Next up, I am going to use this data to analyze the plays of artists from Summerfest, and try to infer if the play counts varied once they were added to the bill.
    0

    Add a comment

  6. Note: I started this post way back when the NCAA men's basketball tournament was going on, but didn't finish it until now.

    Since the NCAA Men's Basketball Tournament has moved to 64 teams, a 16 seed as never upset a 1 seed. You might be tempted to say that the probability of such an event must be 0 then. But we know better than that.

    In this post, I am interested in looking at different ways of estimating how the odds of winning a game change as the difference between seeds increases. I was able to download tournament data going back to the 1930s until 2012 from hoopstournament.net/Database.html. The tournament expanded to 64 teams in 1985, which is what I used for this post. I only used match ups in which one of the seeds was higher than the other because this was the easiest way to remove duplicates. (The database has each game listed twice, once with the winner as the first team and once with the loser as the first team. The vast majority (98.9%) of games had one team as a higher seed because an equal seed can only happen at the Final Four or later.)

    library(ggplot2); theme_set(theme_bw())
    brackets=read.csv("NCAAHistory.csv")
     
    # use only data from 1985 on in which the first team has the higher seed
    brackets=subset(brackets,Seed<Opponent.Seed & Year>=1985 & Round!="Opening Round")
     
    brackets$SeedDiff=abs(brackets$Opponent.Seed-brackets$Seed)
    brackets$HigherSeedWon=ifelse(brackets$Opponent.Seed>brackets$Seed,brackets$Wins,brackets$Losses)
    brackets$HigherSeedScoreDiff=ifelse(brackets$Opponent.Seed>brackets$Seed,1,-1)*(brackets$Score-brackets$Opponent.Score)
    Created by Pretty R at inside-R.org

    Use Frequencies

    The first way is the most simple: look at the historical records when a 16 seed is playing a 1 seed (where the seed difference is 15). As you can see from the plot below, when the seed difference is 15, the higher seeded team has won every time. This is also true when the seed difference is 12, although there have only been 4 games in this scenario. Another oddity is that when the seed difference is 10, the higher seed only has only won 50% of the time. Again, this is largely due to the fact that there have only been 6 games with this seed difference.

    seed.diffs=sort(unique(brackets$SeedDiff))
    win.pct=sapply(seed.diffs,function(x) mean(brackets$HigherSeedWon[brackets$SeedDiff==x]))
    ggplot(data=data.frame(seed.diffs,win.pct),aes(seed.diffs,win.pct))+geom_point()+
      geom_hline(yintercept=0.5,linetype=2)+
      geom_line()+labs(x="Seed Difference",y="Proportion of Games Won by Higher Seed")
    Created by Pretty R at inside-R.org


    Use Score Difference

    In many applications, it has been shown that using margin of victory is much more reliable than just wins and losses. For example, in the computer ranking of College Football teams, using score differences is more accurate, but outlawed for fear that teams would run up the score on weaker opponents. So the computer rankings are not as strong as they could be.

    We have no such conflict of interest, so we should try to make use of any information available. A simple way to do that is to look at the mean and standard deviation of the margin of victory when the 16 seed is playing the 1 seed. Below is a plot of the mean score difference with a ribbon for the +/- 2 standard deviations.

    seed.diffs=sort(unique(brackets$SeedDiff))
    means=sapply(seed.diffs,function(x) mean(brackets$HigherSeedScoreDiff[brackets$SeedDiff==x]))
    sds=sapply(seed.diffs,function(x) sd(brackets$HigherSeedScoreDiff[brackets$SeedDiff==x]))
    ggplot(data=data.frame(seed.diffs,means,sds),aes(seed.diffs,means))+
      geom_ribbon(aes(ymin=means-2*sds,ymax=means+2*sds),alpha=.5)+geom_point()+geom_line()+
      geom_hline(yintercept=0,linetype=2)+
      labs(x="Seed Difference",y="Margin of Victory by Higher Seed")
    Created by Pretty R at inside-R.org


    You can see that the ribbon includes zero for all seed differences except 15. If we assume that the score differences are roughly normal, we can calculate the probability that the score difference will be greater than 0. The results are largely the same as before, but we see now that there are no 100% estimates. Also, the 50% win percentage for a seed difference of 10 now looks a little more reasonable, albeit still out of line with the rest.

    ggplot(data=data.frame(seed.diffs,means,sds),aes(seed.diffs,1-pnorm(0,means,sds)))+
      geom_point()+geom_line()+geom_hline(yintercept=0.5,linetype=2)+
      labs(x="Seed Difference",y="Probability of Higher Seed Winning Based on Margin of Victory")
    Created by Pretty R at inside-R.org

    Model Win Percentage as a Function of  Seed Difference

    It is always good to incorporate as much knowledge as possible into an analysis. In this case, we have information on other games besides the 16 versus 1 seed game which help us estimate the 16 versus 1 game. For example, it is reasonable to assume that the larger the difference in seed is, the more likely the higher seed will win. We can build a logistic regression model which looks at all of the outcomes of all of the games and predicts the probability of winning based on the difference in seed. When the two teams have the same seed, I enforced the probability of the higher seed winning to be 0.5 by making the intercept 0.

    In the plot below, you can see that the logistic model predicts that the probability of winning increases throughout until reaching about 90% for the 16 versus 1. I also included a non-linear generalized additive model (GAM) model for comparison. The GAM believes that being a big favorite (16 vs 1 or 15 vs 2) gives an little boost in win probability. An advantage of modeling is that we can make predictions for match-ups that have never occurred (like a seed difference of 14).

    ggplot(data=brackets,aes(SeedDiff,HigherSeedWon))+
      stat_smooth(method="gam",family="binomial",se=F,formula=y~0+x,aes(colour="Logistic"),size=1)+
      stat_smooth(method="gam",family="binomial",se=F,formula=y~s(x),aes(colour="GAM"),size=1)+
      geom_jitter(alpha=.15,position = position_jitter(height = .025,width=.25))+
      labs(x="Seed Difference",y="Game Won by Higher Seed",colour="Model")
    Created by Pretty R at inside-R.org

    Model Score Difference as a Function of  Seed Difference

    We can also do the same thing with margin of victory. Here, I constrain the linear model to have an intercept of 0, meaning that two teams with the same seed should be evenly matched. Again, I included the GAM fit for comparison. The interpretations are similar to before, in that it seems that there is an increase in margin of victory for the heavily favored teams.

    ggplot(data=brackets,aes(SeedDiff,HigherSeedScoreDiff))+
      stat_smooth(method="lm",se=F,formula=y~0+x,aes(colour="Linear"),size=1)+
      stat_smooth(method="gam",se=F,formula=y~s(x),aes(colour="GAM"),size=1)+
      geom_jitter(alpha=.25,position = position_jitter(height = 0,width=.25))+
      labs(x="Seed Difference",y="Margin of Victory by Higher Seed",colour="Model")
    Created by Pretty R at inside-R.org

    From these models of margin of victory we can infer the probability of the higher seed winning (again, assuming normality).

    library(gam)
    lm.seed=lm(HigherSeedScoreDiff~0+SeedDiff,data=brackets)
    gam.seed=gam(HigherSeedScoreDiff~s(SeedDiff),data=brackets)
     
    pred.lm.seed=predict(lm.seed,data.frame(SeedDiff=0:15),se.fit=TRUE)
    pred.gam.seed=predict(gam.seed,data.frame(SeedDiff=0:15),se.fit=TRUE)
    se.lm=sqrt(mean(lm.seed$residuals^2))
    se.gam=sqrt(mean(gam.seed$residuals^2))
     
    df1=data.frame(SeedDiff=0:15,ProbLM=1-pnorm(0,pred.lm.seed$fit,sqrt(se.lm^2+pred.lm.seed$se.fit^2)),
                   ProbGAM=1-pnorm(0,pred.gam.seed$fit,sqrt(se.gam^2+pred.gam.seed$se.fit^2)))
    ggplot(df1)+geom_hline(yintercept=0.5,linetype=2)+
      geom_line(aes(SeedDiff,ProbLM,colour="Linear"),size=1)+
      geom_line(aes(SeedDiff,ProbGAM,colour="GAM"),size=1)+
      labs(x="Seed Difference",y="Probability of Higher Seed Winning",colour="Model")
    Created by Pretty R at inside-R.org


    Summary

    Putting all of the estimates together, you can easily spot the differences between the models. The two assumptions that just used the data between specific seeds look pretty similar. It looks like using score differential is a little more reasonable of the two. The two GAMs have a similar trend and so did the  linear and logistic models. If someone asks you what the probability that a 16 seed beats a 1 seed, you have at least 6 different answers.

    This post highlights the many different ways someone can analyze the same data. Simply statistics talked a bit about this in a recent podcast. In this case, the differences are not huge, but there are noticeable changes. So the next time you read about an analysis that someone did, keep in mind all the decisions that they had to make and what type a sensitivity they would have on the results.

    logit.seed=glm(HigherSeedWon~0+SeedDiff,data=brackets,family=binomial(logit))
    logit.seed.gam=gam(HigherSeedWon~s(SeedDiff),data=brackets,family=binomial(logit))
     
    df2=data.frame(SeedDiff=0:15,
                   ProbLM=1-pnorm(0,pred.lm.seed$fit,sqrt(se.lm^2+pred.lm.seed$se.fit^2)),
                   ProbGAM=1-pnorm(0,pred.gam.seed$fit,sqrt(se.gam^2+pred.gam.seed$se.fit^2)),
                   ProbLogit=predict(logit.seed,data.frame(SeedDiff=0:15),type="response"),
                   ProbLogitGAM=predict(logit.seed.gam,data.frame(SeedDiff=0:15),type="response"))
    df2=merge(df2,data.frame(SeedDiff=seed.diffs,ProbFreq=win.pct),all.x=T)
    df2=merge(df2,data.frame(SeedDiff=seed.diffs,ProbScore=1-pnorm(0,means,sds)),all.x=T)
    ggplot(df2,aes(SeedDiff))+geom_hline(yintercept=0.5,linetype=2)+
      geom_line(aes(y=ProbLM,colour="Linear"),size=1)+
      geom_line(aes(y=ProbGAM,colour="GAM"),size=1)+
      geom_line(aes(y=ProbLogit,colour="Logistic"),size=1)+
      geom_line(aes(y=ProbLogitGAM,colour="Logistic GAM"),size=1)+
      geom_line(aes(y=ProbFreq,colour="Frequencies"),size=1)+
      geom_line(aes(y=ProbScore,colour="Score Diff"),size=1)+
      geom_point(aes(y=ProbFreq,colour="Frequencies"),size=3)+
      geom_point(aes(y=ProbScore,colour="Score Diff"),size=3)+
      labs(x="Seed Difference",y="Probability of Higher Seed Winning",colour="Model")
     
    ggplot(df2)+geom_hline(yintercept=0.5,linetype=2)+
      geom_point(aes(x=SeedDiff,y=ProbFreq,colour="Frequencies"),size=1)
    Created by Pretty R at inside-R.org


    Note that the GAM functions did not have a way to easily restrict the win probability be equal to exactly 0.5 when the seed difference is 0. That is why you may notice the GAM model is a bit above 0.5 at 0.
    1

    View comments

    1. Throw this whole study out. FDU beat Purdue!

      ReplyDelete
  7. As a grad student, I do lots of searches for research related to my own. When I am off campus, a lot of the relevant results are not open access. In that case, I have to log onto my school's library website and search for the journal or article there. It is quite a hassle. Luckily, I recently noticed that the website is predictably modified after I log into the library. I go to Ohio State University, and before and after logging in the websites will be http://www.jstor.org/... and http://www.jstor.org.proxy.lib.ohio-state.edu/..., for example. It doesn't matter which journal or website I am looking at. If OSU has access, it will show up with .proxy.lib.ohio-state.edu after the .com/.org/etc.

    So I was able to save myself a few steps by typing that in instead of going through the library's website. I should also be sure to note that typing this in brings you to an OSU log in screen. Only those with valid access can use this shortcut, but I assume other schools have the same setup.

    This is still more busy work than I would like, so I searched for ways to automatically do this. Lifehacker has some address bar shortcuts, but nothing that quite works. I came across a  Firefox add-on called URL Swap that seems like it could do the trick. However, when setting it up, I happened to find this post on Stack Overflow. The second answer has exactly what I am looking for, changing .ExtraPart to .proxy.lib.ohio-state.edu.

    I created a bookmark on my bookmarks toolbar with the javascript:
    javascript:(function(){%20var%20curloc%20=%20window.location.toString();%20var%20newloc%20=%20curloc.substring(0,%20curloc.indexOf("/",%208))%20+%20".proxy.lib.ohio-state.edu"%20+%20curloc.substring(curloc.indexOf("/",%208));%20location.href=newloc;%20})()

    and it works perfectly. This trick is so useful I wanted to share this trick to save others some time.
    0

    Add a comment

  8. The famous probabilist and statistician Persi Diaconis wrote an article not too long ago about the "Markov chain Monte Carlo (MCMC) Revolution." The paper describes how we are able to solve a diverse set of problems with MCMC. The first example he gives is a text decryption problem solved with a simple Metropolis Hastings sampler.

    I was always stumped by those cryptograms in the newspaper and thought it would be pretty cool if I could crack them with statistics. So I decided to try it out on my own. The example Diaconis gives is fleshed out in more details by its original authors in its own article.

    The decryption I will be attempting is called substitution cipher, where each letter of the alphabet corresponds to another letter (possibly the same one). This is the rule that you must follow for the cryptogram game too. There are 26! = 4e26 possible mappings of the letters of the alphabet to one another. This seems like a hopeless problem to solve, but I will show you that a relatively simple solution can be found as long as your reference text is large enough and the text you are trying to decipher is also large enough.

    The strategy is to use a reference text to create transition probabilities from each letter to the next. This can be stored in a 26 by 26 matrix, where the ith row and jth column is the probability of the jth letter, given the ith letter preceded it. For example, the given the previous letter was a Q, U almost always follows it, so the Q-row and U-column would be close to probability of 1. Assuming these one step transition probabilities are what matter, the likelihood of any mapping is the product of the transition probabilities observed.

    To create a transition matrix, I downloaded War and Peace from Project Gutenberg. I looped through each letter and kept track of the number of number of times each letter followed the previous. I also kept track of a 27th character, which was anything that was not a letter -- for example periods, commas, spaces, etc. This lets me know which letters are more likely to start a word or end a word. Consecutive entries of these special characters are not considered.

    After I have these counts, I normalize by dividing by the row total. Before normalizing, I add 1 to each cell so that there are no probabilities of 0. This also corresponds to prior information of each transition being equally likely. I could have tried to give more informative prior information (like Q to U being likely), but it would have been difficult and probably inaccurate for the whole matrix.

    Below is the code for creating the transition probability matrix. Note that I loop through each line and within each line I loop through each character.

    reference=readLines("warandpeace.txt")
    reference=toupper(reference)
     
    trans.mat=matrix(0,27,27)
    rownames(trans.mat)=colnames(trans.mat)=c(toupper(letters),"")
    lastletter=""
    for (ln in 1:length(reference)) {
      if (ln %% 1000 ==0) {cat("Line",ln,"\n")}
      for (pos in 1:nchar(reference[ln])) {
        curletter=substring(reference[ln],pos,pos)
        if (curletter %in% toupper(letters)) {
          trans.mat[rownames(trans.mat)==lastletter,
                    colnames(trans.mat)==curletter]=
            trans.mat[rownames(trans.mat)==lastletter,
                      colnames(trans.mat)==curletter]+1
          lastletter=curletter
        } else {
          if (lastletter!="") {
            trans.mat[rownames(trans.mat)==lastletter,27]=
              trans.mat[rownames(trans.mat)==lastletter,27]+1
            lastletter=""
          }
        }
      }
      curletter=""
      if (lastletter!="") {
        trans.mat[rownames(trans.mat)==lastletter,27]=
          trans.mat[rownames(trans.mat)==lastletter,27]+1
      }
      lastletter=""
    }
     
    trans.prob.mat=sweep(trans.mat+1,1,rowSums(trans.mat+1),FUN="/")
    Created by Pretty R at inside-R.org


    I like to visualize my data to make sure everything looks correct. Below is a plot of the transition matrix.  The blank space corresponds to non-letter character. A lot of insights can be garnered from this plot. The Q-U relationship pops out. T is the most likely letter to start a word. E is a popular letter to follow many others. Y is likely to end a word.

    ggplot(melt(trans.prob.mat),aes(Var2,Var1))+geom_tile(aes(fill=value))+
      scale_fill_gradient(low="white",high="black",limits=c(0,1))+
      labs(x="Probability of Second Letter",y="Conditioning on First Letter",fill="Prob")+
      scale_y_discrete(limits = rev(levels(melt(trans.prob.mat)$Var1)))+
      coord_equal()
    Created by Pretty R at inside-R.org


    The desired solution will be the one that gives the highest likelihood. This is an NP-hard problem so the only way to find the solution is to try all 26! combinations of mappings, which is infeasible, and report the one with the highest likelihood.

    With the MCMC approach you start with a random mapping of letters. Next you propose a new mapping by randomly switching 2 of the characters in the mapping. So if A mapped to G and L mapped to Z and you switched those two, A would map to Z and L would map to G. With this new mapping, you measure the likelihood and divide it by the likelihood of the previous mapping. If this ratio is greater than 1, then move to this new mapping. If the ratio is less than 1, meaning the new mapping is less likely, then move to this new mapping with a probability equal to the ratio. (Practically, this is done by drawing a random uniform number between 0 and 1 and moving to the proposed mapping if the uniform number is less than the ratio.) Repeat this process until you think you have found a solution.

    To do this, I first wrote a few functions. One to decode the coded text based on the mapping. The other was to calculate the log likelihood of the decoded text.

    decode <- function(mapping,coded) {
      coded=toupper(coded)
      decoded=coded
      for (i in 1:nchar(coded)) {
        if (substring(coded,i,i) %in% toupper(letters)) {
          substring(decoded,i,i)=toupper(letters[mapping==substring(coded,i,i)])
        }
      }
      decoded
    }
     
     
    log.prob <- function(mapping,decoded) {
      logprob=0
     
      lastletter=""
      for (i in 1:nchar(decoded)) {
        curletter=substring(decoded,i,i)
        if (curletter %in% toupper(letters)) {
          logprob=logprob+log(trans.prob.mat[rownames(trans.mat)==lastletter,
                                             colnames(trans.mat)==curletter])
          lastletter=curletter
        } else {
          if (lastletter!="") {
            logprob=logprob+log(trans.prob.mat[rownames(trans.mat)==lastletter,27])
            lastletter=""
          }
        }
      }
     
      if (lastletter!="") {
        logprob=logprob+log(trans.prob.mat[rownames(trans.mat)==lastletter,27])
        lastletter=""
      }
      logprob
    }
    Created by Pretty R at inside-R.org

    To show how this works, I will use the same Shakespeare text that is used in the MCMC Revolution paper. I let it run until it tries out 2000 different mappings (not the same as 2000 iterations, because rejected proposals are not counted). Along the way I am keeping track of the most likely mapping visited, and that will be the final estimate. It should be noted that I am only considering the mapping of letters and it is assumed that the special characters are known. For example spaces and periods are assumed to be the same.

    correctTxt="ENTER HAMLET HAM TO BE OR NOT TO BE THAT IS THE QUESTION WHETHER TIS NOBLER IN THE MIND TO SUFFER THE SLINGS AND ARROWS OF OUTRAGEOUS FORTUNE OR TO TAKE ARMS AGAINST A SEA OF TROUBLES AND BY OPPOSING END"
    coded=decode(sample(toupper(letters)),correctTxt) # randomly scramble the text
     
    mapping=sample(toupper(letters)) # initialize a random mapping
    i=1
    iters=2000
    cur.decode=decode(mapping,coded)
    cur.loglike=log.prob(mapping,cur.decode)
    max.loglike=cur.loglike
    max.decode=cur.decode
    while (i<=iters) {
      proposal=sample(1:26,2) # select 2 letters to switch
      prop.mapping=mapping
      prop.mapping[proposal[1]]=mapping[proposal[2]]
      prop.mapping[proposal[2]]=mapping[proposal[1]]
     
      prop.decode=decode(prop.mapping,coded)
      prop.loglike=log.prob(prop.mapping,prop.decode)
     
      if (runif(1)<exp(prop.loglike-cur.loglike)) {
        mapping=prop.mapping
        cur.decode=prop.decode
        cur.loglike=prop.loglike
     
        if (cur.loglike>max.loglike) {
          max.loglike=cur.loglike
          max.decode=cur.decode
        }
     
        cat(i,cur.decode,"\n")
        i=i+1
      }
    }
    Created by Pretty R at inside-R.org


    The output of this example is below. You can see that it comes close but it doesn't quite find the correct mapping. I attribute this to the fact that the text I was trying to decode only had 203 characters. In the paper mentioned above, they decoded the whole play. They further say that their methods work just as well if you randomly sample a text snippet 2000 characters long. Obviously my example had well less than this. This is also a problem in trying to solve a cryptogram because those are even smaller than the Hamlet example I used. So unfortunately this simple method cannot be used for that. (Note that I ran the MCMC chain a several times, using different random starting points, until it came reasonably close to the solution. This is something that the authors also suggested doing.)



    I want to also note that I originally used Alice and Wonderland as the reference text. It had more trouble decoding since this book is much smaller than War and Peace, and therefore the transition probabilities were not as good.

    The MCMC method is a greedy approach in that you are moving to any mapping that increases the likelihood. Greedy methods are not optimal in general because they can get stuck at local modes, especially in high dimensional problems. MCMC avoids this be moving to less likely mappings with some probability, which ensures that it will find the correct solution with enough time.
    7

    View comments

  9. Update: I have moved my blog to andland.github.io. Check it out for more recent posts. Thanks!

    Restricted Boltzmann Machines (RBMs) are an unsupervised learning method (like principal components). An RBM is a probabilistic and undirected graphical model. They are becoming more popular in machine learning due to recent success in training them with contrastive divergence. They have been proven useful in collaborative filtering, being one of the most successful methods in the Netflix challenge (paper). Furthermore, they have been tantamount to training deep learning models, which appear to be the best current models for image and digit recognition.

    I do not want to go into too much detail, but I would like to give a high level overview of RBMs. Edwin Chen has an introduction that is much better. The usual version of an RBM is a probabilistic model for binary vectors. An image can be represented as a binary vector if each pixel that is dark enough is represented as a 1 and the non-dark pixels are 0's. In addition to the visible binary vector, it is assumed that there is a hidden binary vector, and each element of the hidden unit is connected to each unit of the visible unit by symmetric weights. The weights are represented by the matrix W, where the i,jth component is the weight between hidden unit i and visible unit j. It is important that there are no connections between hidden units or between visible units. The probability that visible unit j is 1 is the inverse logistic function of the sum of the weights connected to visible unit j, in which the hidden units are 1. In math notation (where σ is the inverse logistic, or sigmoid, function):

    Pr(vj=1|h,W)=σ(ihiWi,j).

    The weights are symmetric, so

    Pr(hi=1|v,W)=σ(jvjWi,j).
    As you can see, the model is defined by its conditional probabilities. The task is to find the weight matrix W that best explains the visible data for a given number of hidden units.

    I have been taking Geoff Hinton's coursera course on neural networks, which I would recommend to anyone interested. One of the programming assignments was to fill in code to write an RBM in Matlab/Octave. I have since tried to find a version for R, but have not had any luck, so I decided to translate the code myself. Below is the code to calculate the weight matrix. It is fairly simple and I only use contrastive divergence 1. The input data is p×n instead of the usual transpose.

    # rbm_w is a matrix of size <number of hidden units> by <number of visible units>
    # visible_state is matrix of size <number of visible units> by <number of data cases>
    # hidden_state is a binary matrix of size <number of hidden units> by <number of data cases>
     
    visible_state_to_hidden_probabilities <- function(rbm_w, visible_state) {
      1/(1+exp(-rbm_w %*% visible_state))
    }
     
    hidden_state_to_visible_probabilities <- function(rbm_w, hidden_state) {
      1/(1+exp(-t(rbm_w) %*% hidden_state))
    }
     
    configuration_goodness <- function(rbm_w, visible_state, hidden_state) {
      out=0
      for (i in 1:dim(visible_state)[2]) {
        out=out+t(hidden_state[,i]) %*% rbm_w %*% visible_state[,i]
      }
      out/dim(visible_state)[2]
    }
     
    configuration_goodness_gradient <- function(visible_state, hidden_state) {
      hidden_state %*% t(visible_state)/dim(visible_state)[2]
    }
     
    sample_bernoulli <- function(mat) {
      dims=dim(mat)
      matrix(rbinom(prod(dims),size=1,prob=c(mat)),dims[1],dims[2])
    }
     
    cd1 <- function(rbm_w, visible_data) {
      visible_data = sample_bernoulli(visible_data)
      H0=sample_bernoulli(visible_state_to_hidden_probabilities(rbm_w, visible_data))
      vh0=configuration_goodness_gradient(visible_data, H0)
      V1=sample_bernoulli(hidden_state_to_visible_probabilities(rbm_w, H0))
      H1=visible_state_to_hidden_probabilities(rbm_w, V1)
      vh1=configuration_goodness_gradient(V1, H1)
      vh0-vh1
    }
     
    rbm <- function(num_hidden, training_data, learning_rate, n_iterations, mini_batch_size=100, momentum=0.9, quiet=FALSE) {
    #   This trains a model that's defined by a single matrix of weights.
    #   <num_hidden> is the number of hidden units
    #   cd1 is a function that takes parameters <model> and <data> and returns the gradient (or approximate gradient in the case of CD-1) of the function that we're maximizing. Note the contrast with the loss function that we saw in PA3, which we were minimizing. The returned gradient is an array of the same shape as the provided <model> parameter.
    #   This uses mini-batches no weight decay and no early stopping.
    #   This returns the matrix of weights of the trained model.
      n=dim(training_data)[2]
      p=dim(training_data)[1]
      if (n %% mini_batch_size != 0) {
        stop("the number of test cases must be divisable by the mini_batch_size")
      }
      model = (matrix(runif(num_hidden*p),num_hidden,p) * 2 - 1) * 0.1
      momentum_speed = matrix(0,num_hidden,p)
     
      start_of_next_mini_batch = 1;
      for (iteration_number in 1:n_iterations) {
        if (!quiet) {cat("Iter",iteration_number,"\n")}
        mini_batch = training_data[, start_of_next_mini_batch:(start_of_next_mini_batch + mini_batch_size - 1)]
        start_of_next_mini_batch = (start_of_next_mini_batch + mini_batch_size) %% n
        gradient = cd1(model, mini_batch)
        momentum_speed = momentum * momentum_speed + gradient
        model = model + momentum_speed * learning_rate
      }
      return(model)
    }
    Created by Pretty R at inside-R.org

    I loaded the hand written digit data that was given in the class. To train the RBM, use the syntax below.

    weights=rbm(num_hidden=30, training_data=train, learning_rate=.09, n_iterations=5000,
                mini_batch_size=100, momentum=0.9)
    Created by Pretty R at inside-R.org

    After training the weights, I can visualize them. Below is a plot of strength of the weights going to each pixel. Each facet displays the weights going to/coming from one of the hidden units. I trained 30 units so that it would be easy to show them all on one plot. You can see that most of the hidden units correspond strongly to one digit or another. I think this means it is finding the joint structure of all of the pixels and that pixels representing those numbers are darkened together often.

    library(ggplot2)
    library(reshape2)
    mw=melt(weights); mw$Var3=floor((mw$Var2-1)/16)+1; mw$Var2=(mw$Var2-1)%%16 + 1; mw$Var3=17-mw$Var3;
    ggplot(data=mw)+geom_tile(aes(Var2,Var3,fill=value))+facet_wrap(~Var1,nrow=5)+
      scale_fill_continuous(low='white',high='black')+coord_equal()+
      labs(x=NULL,y=NULL,title="Visualization of Weights")+
      theme(legend.position="none")
    Created by Pretty R at inside-R.org

    16

    View comments

  10. Factor Analysis of Baseball's Hall of Fame Voters
    Recently, Nate Silver wrote a post which analyzed how voters who voted for and against Barry Bonds for Baseball's Hall of Fame differed. Not surprisingly, those who voted for Bonds were more likely to vote for other suspected steroids users (like Roger Clemens). This got me thinking that this would make an interesting case study for factor analysis to see if there are latent factors that drive hall of fame voting.

    The Twitter user @leokitty has kept track of all the known ballots of the voters in a spreadsheet. The spreadsheet is a matrix that has one row for each voter and one column for each player being voted upon. (Players need 75% of the vote to make it to the hall of fame.) I removed all players that had no votes and all voters that had given a partial ballot.

    (This matrix either has a 1 or a 0 in each entry, corresponding to whether a voter voted for the player or not. Note that this kind of matrix is similar to the data that is analyzed in information retrieval. I will be decomposing the (centered) matrix using singular value decomposition to run the factor analysis. This is the same technique used for latent semantic indexing in information retrieval.)

    Starting with the analysis, there is a drop off of the variance after the first 2 factors, which means it might make sense to look only at the first 2 (which is good because I was only planning on doing so).

    votes = read.csv("HOF votes.csv", row.names = 1, header = TRUE)
    pca = princomp(votes)
    screeplot(pca, type = "l", main = "Scree Plot")
    

    Looking at the loadings, it appears that the first principal component corresponds strongly to steroid users, which Bonds and Clemens having large negative values and other suspected steroid users being on the negative end. The players on the positive end have no steroid suspicions.

    dotchart(sort(pca$loadings[, 1]), main = "First Principal Component")
    
    The second component isn't as easy to decipher. The players at the negative end seem to players that are preferred by analytically minded analysts (think Moneyball). Raines, Trammell, and Martinez have more support among this group of voters. Morris, however, has less support among these voters and he isn't that far separated from them.

    There also may be some secondary steroid association in the component as well separating players who have proof of steroid use versus those which have no proof but “look like” they took steroids. For example, there is no hard evidence that Bagwell or Piazza took steroids, but they were very muscular and hit a lot of home runs, so they are believed to have taken steroids. There is some sort of evidence the top five players of this component did take steroids.

    dotchart(sort(pca$loadings[, 2]), main = "Second Principal Component")
    

    Projecting the votes onto two dimensions, we can look at how the voters for Bonds and Clemens split up. You can see there is a strong horizontal split between the voters for and against Bonds/Clemens. There are also 3 voters that voted for Bonds, but not Clemens.

    ggplot(data.frame(cbind(pca$scores[, 1:2], votes))) + geom_point(aes(Comp.1, 
        Comp.2, colour = as.factor(Barry.Bonds), shape = as.factor(Roger.Clemens)), 
        size = 4) + coord_equal() + labs(colour = "Bonds", shape = "Clemens")
    

    Similarly, I can look at how the voters split up on the issue of steroids by looking at both Bonds and Bagwell. The voters in the upper left do not care about steroid use, but believe that Bagwell wasn't good enough to make it to the hall of fame. The voters in the lower right do care about steroid use, but believe that Bagwell was innocent of any wrongdoing.

    ggplot(data.frame(cbind(pca$scores[, 1:2], votes))) + geom_point(aes(Comp.1, 
        Comp.2, colour = as.factor(paste(Roger.Clemens, "/", Jeff.Bagwell))), size = 4) + 
        geom_hline(aes(0), size = 0.2) + geom_vline(aes(0), size = 0.2) + coord_equal() + 
        labs(colour = "Bonds / Bagwell")
    

    We can also look at a similar plot with Schilling instead of Bagwell. The separation here appears to be stronger.

    ggplot(data.frame(cbind(pca$scores[, 1:2], votes))) + geom_point(aes(Comp.1, 
        Comp.2, colour = as.factor(paste(Barry.Bonds, "/", Curt.Schilling))), size = 4) + 
        geom_hline(aes(0), size = 0.2) + geom_vline(aes(0), size = 0.2) + coord_equal() + 
        labs(colour = "Bonds / Schilling")
    

    Finally, we can look at a biplot (using code from here).

    PCbiplot <- function(PC = fit, x = "PC1", y = "PC2") {
        # PC being a prcomp object
        library(grid)
        data <- data.frame(obsnames = row.names(PC$x), PC$x)
        plot <- ggplot(data, aes_string(x = x, y = y)) + geom_text(alpha = 0.4, 
            size = 3, aes(label = obsnames))
        plot <- plot + geom_hline(aes(0), size = 0.2) + geom_vline(aes(0), size = 0.2)
        datapc <- data.frame(varnames = rownames(PC$rotation), PC$rotation)
        mult <- min((max(data[, y]) - min(data[, y])/(max(datapc[, y]) - min(datapc[, 
            y]))), (max(data[, x]) - min(data[, x])/(max(datapc[, x]) - min(datapc[, 
            x]))))
        datapc <- transform(datapc, v1 = 0.7 * mult * (get(x)), v2 = 0.7 * mult * 
            (get(y)))
        plot <- plot + coord_equal() + geom_text(data = datapc, aes(x = v1, y = v2, 
            label = varnames), size = 5, vjust = 1, color = "red")
        plot <- plot + geom_segment(data = datapc, aes(x = 0, y = 0, xend = v1, 
            yend = v2), arrow = arrow(length = unit(0.2, "cm")), alpha = 0.75, color = "red")
        plot
    }
    
    fit <- prcomp(votes, scale = F)
    PCbiplot(fit)
    
    I could have also attempted to rotate the factors to make them more interpretable, but they appeared to have easy interpretation as is. Since we were looking at 2-d plots, rotation would not have made a difference in interpreting the plots. It is also common to use a likelihood approach to estimate factors. I chose to use the principal component method because the data are definitely not normal (being 0's and 1's).
    0

    Add a comment

Total Pageviews
Total Pageviews
279026
Blog Archive
About Me
About Me
Blogroll
Blogroll
Loading
Powered by Blogger.