1. Finding the best subset of variables for a regression is a very common task in statistics and machine learning. There are statistical methods based on asymptotic normal theory that can help you decide whether to add or remove a variable at a time. The problem with this is that it is a greedy approach and you can easily get stuck in a local mode. Another approach is to look at all possible subsets of the variables and see which one maximizes an objective function (accuracy on a test set, for example).

    Heuristics are required if the number of variables, p, gets large (>40) because the number of possible subsets is equal to 2^p, excluding interactions. One method that works well is called tabu search (TS). It is a more general optimization method, but in the case of finding the best subset, it is similar to stepwise methods. At any point, it will try to improve the objective function the most by adding or removing one variable. One difference is that TS will not add or remove variables that have been recently added or removed. This avoids the optimization from getting stuck at a local maximum. There are more details to the algorithm that you can read up on if you would like.

    There is a tabuSearch package in R that implements this algorithm. I wanted to find a best subset regression using generalized additive models (GAMs) of the package mgcv. An issue arose, because you need to specify which terms are smooth and which are not in a formula to use mgcv. The general form of the formula looks like:
    gam(y ~ s(x1) + s(x2) + x3, data=train)
    where x1 and x2 are smooth terms and x3 is treated like it is in a linear model.

    My data set has a combination of continuous variables, which I want to treat as smooth, and categorical variables, which I want to treat as factors. For each variable in the subset, I need to identify whether it is a factor or not, and then creating a string of the variable with or without s().

    For example, th is a vector of 0's and 1's which indicate whether each variable (column) is in the regression. I used the housing data set from UCI ML repository. With 13 explanatory variables, there are 8192 possible main effects models. I am predicting MEDV, so I start the formula string with "MEDV ~". Then I loop through each element of th. If it is 1 then I want to add it to the formula. I check if it is a factor and if so I just add the name of the variable plus a "+". If it is continuous, I add the column name with "s()" around it. Finally I convert the string to a formula using formula(). I can plug in this formula into the gam function.

    num.cols=sum(th)
    fo.str="MEDV ~"
    cum.cols=0
    for (i in 1:length(th)) {
      if (th[i]>0) {
        if (is.factor(train[,i])) {
          fo.str=paste(fo.str,colnames(train)[i],sep=" ")
        } else {
          fo.str=paste(fo.str," s(",colnames(train)[i],")",sep="")
        }
        cum.cols=cum.cols+1
        if (cum.cols<num.cols) {
          fo.str=paste(fo.str,"+")
        }
      }
    }
    fo=as.formula(fo.str)
    Created by Pretty R at inside-R.org

    For evaluating the subsets, I split the data into training, validation, and testing. I trained the subset on the training data set and measured the R-squared on the prediction of the validation set. The full code can be found below.

    library(mgcv)
    library(tabuSearch)
    # http://archive.ics.uci.edu/ml/datasets/Housing
    housing=read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data")
    colnames(housing)=c("CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "MEDV")
     
    housing$CHAS=as.factor(housing$CHAS)
    housing$RAD=as.factor(housing$RAD) # Changed to factor bc only 9 unique values
    summary(housing)
     
    set.seed(20120823)
    cv=sample(nrow(housing))
    train=housing[cv[1:300],]
    valid=housing[cv[301:400],]
    test=housing[cv[401:nrow(housing)],]
     
    ssto=sum((valid$MEDV-mean(valid$MEDV))^2)
    evaluate <- function(th){ 
      num.cols=sum(th)
      if (num.cols == 0) return(0)
      fo.str="MEDV ~"
      cum.cols=0
      for (i in 1:length(th)) {
        if (th[i]>0) {
          if (is.factor(train[,i])) {
            fo.str=paste(fo.str,colnames(train)[i],sep=" ")
          } else {
            fo.str=paste(fo.str," s(",colnames(train)[i],")",sep="")
          }
          cum.cols=cum.cols+1
          if (cum.cols<num.cols) {
            fo.str=paste(fo.str,"+")
          }
        }
      }
    #   colnames(train)[c(th,0)==1]
      fo=as.formula(fo.str)
      gam1 <- gam(fo,data=train)
      pred1 <- predict(gam1,valid,se=FALSE)
      sse <- sum((pred1-valid$MEDV)^2,na.rm=TRUE)
      return(1-sse/ssto)
    }
     
    res <- tabuSearch(size = ncol(train)-1, iters = 20, objFunc = evaluate, listSize = 5,
                      config = rbinom(ncol(train)-1,1,.5), nRestarts = 4,verbose=TRUE)

    It was able to find a subset with a 0.8678 R-squared on the validation set (and 0.8349 on the test set). The formula found was:

    MEDV ~ s(CRIM) + s(INDUS) + s(NOX) + s(RM) + s(DIS) + RAD + s(TAX) + s(PTRATIO) + s(LSTAT).


    Visualizing the results

    The tabu search gave me a subset which it thinks is best. But I would like to get a better handle on how it derived it, or if there were lots of models with similar quality, but different variables. Or if there were variables that were always in the top performing models. I created a heat plot that showed whether or not a variable was in the regression or not at each iteration.

    Below you can see which variables were included in each iteration. There are a few variables that seem to be more important because they are included in almost every iteration, like LSAT, PTRATIO,and RM. But this doesn't tell me which iterations were the best.


    In the chart below, I shaded each region by the ranking of model in that iteration. The higher ranking mean it did better. It is not easy, but we can glean a little more information from this chart. The models with RAD and DIS do significantly better than the models without them, even though they are not in every iteration. Further, the models with AGE seem a bit worse than those without it.

    The R code to make these plots is below.

    library(reshape)
    library(ggplot2); theme_set(theme_bw())
     
    tabu.df=data.frame(res$configKeep)
    colnames(tabu.df)=colnames(train)[1:(ncol(train)-1)]
    tabu.df$Iteration=1:nrow(tabu.df)
    tabu.df$RSquared=res$eUtilityKeep
    tabu.df$Rank=rank(tabu.df$RSquared)
    tabu.melt=melt(tabu.df,id=c("Iteration","RSquared","Rank"))
    tabu.melt$RSquared=ifelse(tabu.melt$value==1,tabu.melt$RSquared,0)
    tabu.melt$Rank=ifelse(tabu.melt$value==1,tabu.melt$Rank,0)
    (pHeat01 <- ggplot(tabu.melt, aes(Iteration,variable)) + geom_tile(aes(fill = value)) +
      scale_fill_gradient(low = "white", high = "steelblue",guide=FALSE))
    (pHeatRank <- ggplot(tabu.melt, aes(Iteration,variable)) + geom_tile(aes(fill = Rank)) +
      scale_fill_gradient(low = "white", high = "steelblue"))
    Created by Pretty R at inside-R.org
    7

    View comments

  2. Introduction

    Matrix factorization has been proven to be one of the best ways to do collaborative filtering. The most common example of collaborative filtering is to predict how much a viewer will like a movie. The power of matrix factorization was a key development of the Netflix Prize (see http://www2.research.att.com/~volinsky/papers/ieeecomputer.pdf).

    Using the movie rating example, the idea is that there are some underlying features of the movie and underlying attributes of the user that interact to determine if the user will like the movie. So if the user typically likes comedies, and the movie is a comedy, these will interact and the user will like the movie. The best part is that in order to determine these features, we do not need to use any background information on the movie or the user. For example, we don't need someone to manually code whether each movie is a comedy or not. Matrix factorization just needs a set of historical ratings, which is usually easily available.

    Using only a few features, we could probably interpret what each of them mean (comedy, drama, foreign, blockbuster, ...). They found in the Netflix competition that increasing the number of underlying features that make up a user and a movie almost always improved accuracy (they used up to 1500 features). With that many features, they can no longer be interpreted as in the example.

    Baseball Data

    I find this method interesting, and I wanted to apply it to some real data. One idea I had is predicting whether the result of a pitcher-batter match-up in baseball. Usually in baseball, the analysts cry "small sample size" if someone reports that a pitcher is particularly good/bad versus a batter. In order to get an accurate assessment that does not suffer from random variation, we need 100 or more at bats, which usually takes a few years at least.

    It seems that matrix factorization might be a natural fit, because we are not predicting the specific interaction of the batter/pitcher. Instead, we are estimating some underlying features of the batter and the pitcher. We can estimate these features from all of the match-ups that the batter and pitcher have been in -- not just the ones against each other.

    One example of a feature that I hoped to extract is batter and pitcher handedness. It is well known that batters fair poorly against pitchers of the same handedness and do better against pitchers of the opposite handedness. Other examples might relate to whether the pitcher can throw fast and how well the batter does against fast pitches; similarly with breaking balls, etc.

    In the Netflix competition, each user rates movies on a 1 to 5 star scale, and rates the movie only once. The outcome of a match-up is different as there is no obvious "rating" of the outcome. I used improvement in expected number of runs scored of the outcome of a plate appearance as the "rating" to be predicted (from here). So a single is worth 0.474 runs and an out is worth -0.299 runs. Also, in baseball, there will be many match-ups between the batter and pitcher. I could have averaged the outcomes, but I left the multiple records in the data as is.

    I used 2010 data and removed all pitchers that were hitting and batters that were pitching. According to my data, there were 65,128 unique batter-pitcher match-ups (with about 173,000 plate appearances). I used 5,000 match-ups as validation, 5,000 as testing, and the remaining as training.

    Model Fitting

    The first model I used as a baseline just included "biases" for the batter and pitcher, as well as an overall league average bias. In the baseline, there are no factors or interactions. So the result of the match-up between a batter i and pitcher j would be estimated as the overall average bias plus the batter i bias plus the pitcher j bias. In math terms:



    Higher values for the batter bias mean the batter is better better than average and lower values for the pitcher bias means the pitcher is better than average.

    So as to not overfit the data, I used L2 regularization on the batter and pitcher biases. For this and future estimation, I used the validation data to determine the best penalty and I used a gradient descent method to solve for the parameters.

    Adding in the factor interaction, the result of the match-up between a batter i and pitcher j is



    where is a vector of offensive features for batter i and is a set of defensive features for pitcher j. The two vectors must be of the same length which we can determine. Again, I used L2 regularization on the feature vectors.

    Results

    I fit the matrix factorization model using 0, 1, 2, 5, and 10 features. The model with 0 features corresponds to the baseline model. As stated above, I was hoping to see that handedness would show up as one of the first features.

    I fit the models with the training set, tuned the regularization parameters with the validation set, and finally I am comparing the models with the test set. Unfortunately, the baseline model performed the best of all and the performance degraded as more features were added. A plot of the test mean squared errors is below. The difference between the models is small, but I believe that is because the regularization parameter is shrinking the factor vectors to 0.

    A further baseline model could have been used -- an overall mean with no biases. The MSE for all but the 10-factor model were better than this average. So the biases help, but unfortunately the factorization did not.

    I compared the factor vectors of the pitchers by their handedness and I found no difference between righties and lefties.

    Summary

    The outcome of this analysis was not what I hoped for, but I still learned something about matrix factorization and its implementation. I think the moral of the story is that it is hard to predict batter-pitcher match-ups, probably because there is a lot of variation in every at bat. Matrix factorization has been successfully applied in situations where the data is much more sparse than this application, so this difference may be a reason for its failure. I plan to try this again with the libFM software to see if I get the same results.

    Update:
    I implemented this in libFM and basically got the same results, so I guess I was doing it correctly. If I work on this further, I wonder if adding attributes about the pitcher explicitly would help. I would include an indicator for righty/lefty or maybe home/away.
    0

    Add a comment

Total Pageviews
Total Pageviews
279378
Blog Archive
About Me
About Me
Blogroll
Blogroll
Loading
Powered by Blogger.