Recently, Chris Perez, the closer for the Indians, displayed some frustration with the fans for not supporting the team. Currently, they have the lowest attendance in the majors -- by a decent margin. The Indians are averaging about 15,000 fans per home game, while the next closest team, the Oakland A's, is averaging 19,000. It seemed like an odd time for Perez to bring this up because they have had attendance in the 29,000s each of the last two home games. So that intrigued me to look into the numbers of what causes attendance to vary.

I looked at 2011 attendance data for the Cleveland Indians only. I had a strong suspicion that a popular opponent would definitely and weekend games cause attendance to increase. Also, there is usually some press at the beginning of the season that claims no one wants to go to the games because it is too cold for baseball. (There is also more competing entertainment at the beginning of the season.)

What I found to be significant (based on an exploratory approach) are summarized in the graph below. This plot explores the relationship of attendance with 5 other variables. I plotted attendance on the y-axis and the date on the x-axis. I don't expect date to have any effect, but it organizes other aspects well (and you can see opening day had the highest attendance of the year). Instead of plotting points, I plotted the name of the opponent. You can see there are some larger attendances when they are playing the New York Yankees and the Cincinnati Reds, for example. The color of the team name indicates whether they are playing on the weekend or not and the size indicates the temperature. Probably the biggest effect, weekend games outdraw weekday games consistently. The colder temperatures are only in the beginning of the season, and seem to have a noticeable effect (at least for the coldest days).


I also looked into how many games above .500 the team was and how close they were in the division race. Neither of these showed any correlation, at least at the marginal level. This is interesting because the main reason Chris Perez is frustrated is that the team is winning, so the fans should be supporting them. This shows that wining did not make much of a difference within a single year. This should be more prevalent over multiple years.

Some other information that might be useful is the quality of the opponent or whether the ace of the opposing pitching staff is starting. I only included temperature and not precipitation or any other weather information.

Here is the basic R code I used:
library(ggplot2)
ggplot(data=home.attend,aes(x=Date,y=Attendance,colour=Weekend,label=Opp,size=Temp))+
  geom_text()+scale_size(to = c(2, 5))+theme_bw() 
 
Update (9/21/2013):
Since attendance is still a hot topic, I created the same plot for the 2013 season so far.


0

Add a comment

Update: I have moved my blog to andland.github.io. Check it out for more recent posts. Thanks!

A lot of times we are given a data set in Excel format and we want to run a quick analysis using R's functionality to look at advanced statistics or make better visualizations. There are packages for importing/exporting data from/to Excel, but I have found them to be hard to work with or only work with old versions of Excel (*.xls, not *.xlsx). So for a one time analysis, I usually save the file as a csv and import it into R.

This can be a little burdensome if you are trying to do something quick and creates a file that needs to be cleaned up later. An easier option is to copy and paste the data directly into R.
18

In case you haven't noticed, the blog has been less active lately. I have moved the blog to andland.github.io and have a couple of new posts up there.

In a previous post, I showed you how to scrape playlist data from Columbus, OH alternative rock station CD102.5. Since it's the end of the year and best-of lists are all the fad, I thought I would share the most popular songs and artists of the year, according to this data. In addition to this, I am going to make an interactive graph using Shiny, where the user can select an artist and it will graph the most popular songs from that artist.

<p>Loading ...</p>

CD1025 is an “alternative” radio station here in Columbus. They are one of the few remaining radio stations that are independently owned and they take great pride in it. For data nerds like me, they also put a real time list of recently played songs on their website. The page has the most recent 50 songs played, but you can also click on “Older Tracks” to go back in time. When you do this, the URL ends “now-playing/?start=50”. If you got back again, it says “now-playing/?start=100”.

Note: I started this post way back when the NCAA men's basketball tournament was going on, but didn't finish it until now.

Since the NCAA Men's Basketball Tournament has moved to 64 teams, a 16 seed as never upset a 1 seed. You might be tempted to say that the probability of such an event must be 0 then. But we know better than that.

In this post, I am interested in looking at different ways of estimating how the odds of winning a game change as the difference between seeds increases.
1

As a grad student, I do lots of searches for research related to my own. When I am off campus, a lot of the relevant results are not open access. In that case, I have to log onto my school's library website and search for the journal or article there. It is quite a hassle. Luckily, I recently noticed that the website is predictably modified after I log into the library. I go to Ohio State University, and before and after logging in the websites will be http://www.jstor.org/...

The famous probabilist and statistician Persi Diaconis wrote an article not too long ago about the "Markov chain Monte Carlo (MCMC) Revolution." The paper describes how we are able to solve a diverse set of problems with MCMC. The first example he gives is a text decryption problem solved with a simple Metropolis Hastings sampler.

I was always stumped by those cryptograms in the newspaper and thought it would be pretty cool if I could crack them with statistics.
7

Update: I have moved my blog to andland.github.io. Check it out for more recent posts. Thanks!

Restricted Boltzmann Machines (RBMs) are an unsupervised learning method (like principal components). An RBM is a probabilistic and undirected graphical model. They are becoming more popular in machine learning due to recent success in training them with contrastive divergence.
16
Total Pageviews
Total Pageviews
279401
Blog Archive
About Me
About Me
Blogroll
Blogroll
Loading
Powered by Blogger.