Thanks,
Andrew
read.excel <- function(header=TRUE,...) { read.table("clipboard",sep="\t",header=header,...) } dat=read.excel()
write.excel <- function(x,row.names=FALSE,col.names=TRUE,...) { write.table(x,"clipboard",sep="\t",row.names=row.names,col.names=col.names,...) } write.excel(dat)
library(lubridate)
library(sqldf)
playlist=read.csv("CD101Playlist.csv",stringsAsFactors=FALSE)
dates=mdy(substring(playlist[,3],nchar(playlist[,3])-9,nchar(playlist[,3])))
times=hm(substring(playlist[,3],1,nchar(playlist[,3])-10))
playlist$Month=ymd(paste(year(dates),month(dates),"1",sep="-"))
playlist$Day=dates
playlist$Time=times
playlist=playlist[order(playlist$Day,playlist$Time),]
playlist=subset(playlist,Day>=mdy("1/1/13"))
playlist$ArtistSong=paste(playlist$Artist,playlist$Song,sep="-")
top.songs=sqldf("Select ArtistSong, Count(ArtistSong) as Num
From playlist
Group By ArtistSong
Order by Num DESC
Limit 10")
Artist-Song Number Plays
1 FITZ AND THE TANTRUMS-OUT OF MY LEAGUE 809
2 ALT J-BREEZEBLOCKS 764
3 COLD WAR KIDS-MIRACLE MILE 759
4 ATLAS GENIUS-IF SO 750
5 FOALS-MY NUMBER 687
6 MS MR-HURRICANE 679
7 THE NEIGHBOURHOOD-SWEATER WEATHER 657
8 CAPITAL CITIES-SAFE AND SOUND 646
9 VAMPIRE WEEKEND-DIANE YOUNG 639
10 THE FEATURES-THIS DISORDER 632
plays.per.day=sqldf("Select Day, Count(Artist) as Num
From playlist
Group By Day
Order by Day")
playlist.top.songs=subset(playlist,ArtistSong %in% top.songs$ArtistSong[1:5])
song.per.day=sqldf(paste0("Select Day, ArtistSong, Count(ArtistSong) as Num
From [playlist.top.songs]
Group By Day, ArtistSong
Order by Day, ArtistSong"))
dspd=dcast(song.per.day,Day~ArtistSong,sum,value.var="Num")
song.per.day=merge(plays.per.day[,1,drop=FALSE],dspd,all.x=TRUE)
song.per.day[is.na(song.per.day)]=0
song.per.day=melt(song.per.day,1,variable.name="ArtistSong",value.name="Num")
song.per.day$Alpha=ifelse(song.per.day$Num>0,1,0)
library(ggplot2)
ggplot(song.per.day,aes(Day,Num,colour=ArtistSong))+geom_point(aes(alpha=Alpha))+
geom_smooth(method="gam",family=poisson,formula=y~s(x),se=F,size=1)+
labs(x="Date",y="Plays Per Day",title="Top Songs",colour=NULL)+
scale_alpha_continuous(guide=FALSE,range=c(0,.5))+theme_bw()
Alt-J was more popular in the beginning of the year and the Foals have been more popular recently.top.artists=sqldf("Select Artist, Count(Artist) as Num
From playlist
Group By Artist
Order by Num DESC
Limit 10")
Artist Num
1 MUSE 1683
2 VAMPIRE WEEKEND 1504
3 SILVERSUN PICKUPS 1442
4 FOALS 1439
5 PHOENIX 1434
6 COLD WAR KIDS 1425
7 JAKE BUGG 1316
8 QUEENS OF THE STONE AGE 1296
9 ALT J 1233
10 OF MONSTERS AND MEN 1150
playlist.top.artists=subset(playlist,Artist %in% top.artists$Artist[1:5])
artists.per.day=sqldf(paste0("Select Day, Artist, Count(Artist) as Num
From [playlist.top.artists]
Group By Day, Artist
Order by Day, Artist"))
dspd=dcast(artists.per.day,Day~Artist,sum,value.var="Num")
artists.per.day=merge(plays.per.day[,1,drop=FALSE],dspd,all.x=TRUE)
artists.per.day[is.na(artists.per.day)]=0
artists.per.day=melt(artists.per.day,1,variable.name="Artist",value.name="Num")
artists.per.day$Alpha=ifelse(artists.per.day$Num>0,1,0)
ggplot(artists.per.day,aes(Day,Num,colour=Artist))+geom_point(aes(alpha=Alpha))+
geom_smooth(method="gam",family=poisson,formula=y~s(x),se=F,size=1)+
labs(x="Date",y="Plays Per Day",title="Top Artists",colour=NULL)+
scale_alpha_continuous(guide=FALSE,range=c(0,.5))+theme_bw()
The pattern for the artists are not as clear as it is for the songs.library(ggplot2)
theme_set(theme_bw())
library(XML)
library(lubridate)
library(sqldf)
startNum = 0
while (TRUE) {
theurl <- paste0("http://cd1025.com/about/playlists/now-playing/?start=",
startNum)
table <- readHTMLTable(theurl, stringsAsFactors = FALSE)[[1]]
if (startNum == 0) {
playlist = table[, -1]
} else {
playlist = rbind(playlist, table[, -1])
}
dt = mdy(substring(table[1, 4], nchar(table[1, 4]) - 9, nchar(table[1, 4])))
print(dt)
if (dt < mdy("1/1/12")) {
break
}
startNum = startNum + 50
}
playlist = unique(playlist) # Remove Dupes
write.csv(playlist, "CD101Playlist.csv", row.names = FALSE)
This takes a while and is fairly large. My file has over 150,000 songs. If you want just a little data, change the date to last week or so. The first thing I will do is parse the dates and times of the songs, order them, and look at the first few songs. You can see that data only goes back to March of 2012.dates = mdy(substring(playlist[, 3], nchar(playlist[, 3]) - 9, nchar(playlist[,
3])))
times = hm(substring(playlist[, 3], 1, nchar(playlist[, 3]) - 10))
playlist$Month = ymd(paste(year(dates), month(dates), "1", sep = "-"))
playlist$Day = dates
playlist$Time = times
playlist = playlist[order(playlist$Day, playlist$Time), ]
head(playlist)
## Artist Song Last.Played
## 151638 DEATH CAB FOR CUTIE YOU ARE A TOURIST 12:34am03/01/2012
## 151637 SLEEPER AGENT GET BURNED 12:38am03/01/2012
## 151636 WASHED OUT AMOR FATI 12:41am03/01/2012
## 151635 COLDPLAY CHARLIE BROWN 12:45am03/01/2012
## 151634 GROUPLOVE TONGUE TIED 12:49am03/01/2012
## 151633 SUGAR YOUR FAVORITE THING 12:52am03/01/2012
## Month Day Time
## 151638 2012-03-01 2012-03-01 34M 0S
## 151637 2012-03-01 2012-03-01 38M 0S
## 151636 2012-03-01 2012-03-01 41M 0S
## 151635 2012-03-01 2012-03-01 45M 0S
## 151634 2012-03-01 2012-03-01 49M 0S
## 151633 2012-03-01 2012-03-01 52M 0S
Using the sqldf package, I can easily see what the most played artists and songs are from the data I scraped.sqldf("Select Artist, Count(Artist) as PlayCount
From playlist
Group By Artist
Order by PlayCount DESC
Limit 10")
## Artist PlayCount
## 1 SILVERSUN PICKUPS 2340
## 2 THE BLACK KEYS 2203
## 3 MUSE 1988
## 4 THE SHINS 1885
## 5 OF MONSTERS AND MEN 1753
## 6 PASSION PIT 1552
## 7 GROUPLOVE 1544
## 8 RED HOT CHILI PEPPERS 1514
## 9 METRIC 1495
## 10 ATLAS GENIUS 1494
sqldf("Select Artist, Song, Count(Song) as PlayCount
From playlist
Group By Artist, Song
Order by PlayCount DESC
Limit 10")
## Artist Song PlayCount
## 1 PASSION PIT TAKE A WALK 828
## 2 SILVERSUN PICKUPS PIT, THE 825
## 3 ATLAS GENIUS TROJANS 819
## 4 WALK THE MOON ANNA SUN 742
## 5 THE BLACK KEYS LITTLE BLACK SUBMARINES 736
## 6 DIVINE FITS WOULD THAT NOT BE NICE 731
## 7 THE LUMINEERS HO HEY 722
## 8 CAPITAL CITIES SAFE AND SOUND 712
## 9 OF MONSTERS AND MEN MOUNTAIN SOUND 711
## 10 ALT J BREEZEBLOCKS 691
I am a little surprised that Silversun Pickups are the number one band, but everyone on the list makes sense. Looking at how the plays of the top artists have varied from month to month, you can see a few patterns. Muse has been more popular recently and The Shins and Grouplove have lost some steam.artist.month=sqldf("Select Month, Artist, Count(Song) as Num
From playlist
Group By Month, Artist
Order by Month, Artist")
artist=sqldf("Select Artist, Count(Artist) as Num
From playlist
Group By Artist
Order by Num DESC")
p=ggplot(subset(artist.month,Artist %in% head(artist$Artist,8)),aes(Month,Num))
p+geom_bar(stat="identity",aes(fill=Artist),position='fill',colour="grey")+
labs(y="Percentage of Plays")
p + geom_area(aes(fill = Artist), position = "stack", colour = 1) + labs(y = "Number of Plays")
qplot(playlist$Day + playlist$Time, 1:length(dates), geom = "path")
playlist = subset(playlist, Day >= mdy("8/1/12"))
Next up, I am going to use this data to analyze the plays of artists from Summerfest, and try to infer if the play counts varied once they were added to the bill.
library(ggplot2); theme_set(theme_bw()) brackets=read.csv("NCAAHistory.csv") # use only data from 1985 on in which the first team has the higher seed brackets=subset(brackets,Seed<Opponent.Seed & Year>=1985 & Round!="Opening Round") brackets$SeedDiff=abs(brackets$Opponent.Seed-brackets$Seed) brackets$HigherSeedWon=ifelse(brackets$Opponent.Seed>brackets$Seed,brackets$Wins,brackets$Losses) brackets$HigherSeedScoreDiff=ifelse(brackets$Opponent.Seed>brackets$Seed,1,-1)*(brackets$Score-brackets$Opponent.Score)
seed.diffs=sort(unique(brackets$SeedDiff)) win.pct=sapply(seed.diffs,function(x) mean(brackets$HigherSeedWon[brackets$SeedDiff==x])) ggplot(data=data.frame(seed.diffs,win.pct),aes(seed.diffs,win.pct))+geom_point()+ geom_hline(yintercept=0.5,linetype=2)+ geom_line()+labs(x="Seed Difference",y="Proportion of Games Won by Higher Seed")
seed.diffs=sort(unique(brackets$SeedDiff)) means=sapply(seed.diffs,function(x) mean(brackets$HigherSeedScoreDiff[brackets$SeedDiff==x])) sds=sapply(seed.diffs,function(x) sd(brackets$HigherSeedScoreDiff[brackets$SeedDiff==x])) ggplot(data=data.frame(seed.diffs,means,sds),aes(seed.diffs,means))+ geom_ribbon(aes(ymin=means-2*sds,ymax=means+2*sds),alpha=.5)+geom_point()+geom_line()+ geom_hline(yintercept=0,linetype=2)+ labs(x="Seed Difference",y="Margin of Victory by Higher Seed")
ggplot(data=data.frame(seed.diffs,means,sds),aes(seed.diffs,1-pnorm(0,means,sds)))+ geom_point()+geom_line()+geom_hline(yintercept=0.5,linetype=2)+ labs(x="Seed Difference",y="Probability of Higher Seed Winning Based on Margin of Victory")
ggplot(data=brackets,aes(SeedDiff,HigherSeedWon))+ stat_smooth(method="gam",family="binomial",se=F,formula=y~0+x,aes(colour="Logistic"),size=1)+ stat_smooth(method="gam",family="binomial",se=F,formula=y~s(x),aes(colour="GAM"),size=1)+ geom_jitter(alpha=.15,position = position_jitter(height = .025,width=.25))+ labs(x="Seed Difference",y="Game Won by Higher Seed",colour="Model")
ggplot(data=brackets,aes(SeedDiff,HigherSeedScoreDiff))+ stat_smooth(method="lm",se=F,formula=y~0+x,aes(colour="Linear"),size=1)+ stat_smooth(method="gam",se=F,formula=y~s(x),aes(colour="GAM"),size=1)+ geom_jitter(alpha=.25,position = position_jitter(height = 0,width=.25))+ labs(x="Seed Difference",y="Margin of Victory by Higher Seed",colour="Model")
library(gam) lm.seed=lm(HigherSeedScoreDiff~0+SeedDiff,data=brackets) gam.seed=gam(HigherSeedScoreDiff~s(SeedDiff),data=brackets) pred.lm.seed=predict(lm.seed,data.frame(SeedDiff=0:15),se.fit=TRUE) pred.gam.seed=predict(gam.seed,data.frame(SeedDiff=0:15),se.fit=TRUE) se.lm=sqrt(mean(lm.seed$residuals^2)) se.gam=sqrt(mean(gam.seed$residuals^2)) df1=data.frame(SeedDiff=0:15,ProbLM=1-pnorm(0,pred.lm.seed$fit,sqrt(se.lm^2+pred.lm.seed$se.fit^2)), ProbGAM=1-pnorm(0,pred.gam.seed$fit,sqrt(se.gam^2+pred.gam.seed$se.fit^2))) ggplot(df1)+geom_hline(yintercept=0.5,linetype=2)+ geom_line(aes(SeedDiff,ProbLM,colour="Linear"),size=1)+ geom_line(aes(SeedDiff,ProbGAM,colour="GAM"),size=1)+ labs(x="Seed Difference",y="Probability of Higher Seed Winning",colour="Model")
logit.seed=glm(HigherSeedWon~0+SeedDiff,data=brackets,family=binomial(logit)) logit.seed.gam=gam(HigherSeedWon~s(SeedDiff),data=brackets,family=binomial(logit)) df2=data.frame(SeedDiff=0:15, ProbLM=1-pnorm(0,pred.lm.seed$fit,sqrt(se.lm^2+pred.lm.seed$se.fit^2)), ProbGAM=1-pnorm(0,pred.gam.seed$fit,sqrt(se.gam^2+pred.gam.seed$se.fit^2)), ProbLogit=predict(logit.seed,data.frame(SeedDiff=0:15),type="response"), ProbLogitGAM=predict(logit.seed.gam,data.frame(SeedDiff=0:15),type="response")) df2=merge(df2,data.frame(SeedDiff=seed.diffs,ProbFreq=win.pct),all.x=T) df2=merge(df2,data.frame(SeedDiff=seed.diffs,ProbScore=1-pnorm(0,means,sds)),all.x=T) ggplot(df2,aes(SeedDiff))+geom_hline(yintercept=0.5,linetype=2)+ geom_line(aes(y=ProbLM,colour="Linear"),size=1)+ geom_line(aes(y=ProbGAM,colour="GAM"),size=1)+ geom_line(aes(y=ProbLogit,colour="Logistic"),size=1)+ geom_line(aes(y=ProbLogitGAM,colour="Logistic GAM"),size=1)+ geom_line(aes(y=ProbFreq,colour="Frequencies"),size=1)+ geom_line(aes(y=ProbScore,colour="Score Diff"),size=1)+ geom_point(aes(y=ProbFreq,colour="Frequencies"),size=3)+ geom_point(aes(y=ProbScore,colour="Score Diff"),size=3)+ labs(x="Seed Difference",y="Probability of Higher Seed Winning",colour="Model") ggplot(df2)+geom_hline(yintercept=0.5,linetype=2)+ geom_point(aes(x=SeedDiff,y=ProbFreq,colour="Frequencies"),size=1)
javascript:(function(){%20var%20curloc%20=%20window.location.toString();%20var%20newloc%20=%20curloc.substring(0,%20curloc.indexOf("/",%208))%20+%20".proxy.lib.ohio-state.edu"%20+%20curloc.substring(curloc.indexOf("/",%208));%20location.href=newloc;%20})()
reference=readLines("warandpeace.txt") reference=toupper(reference) trans.mat=matrix(0,27,27) rownames(trans.mat)=colnames(trans.mat)=c(toupper(letters),"") lastletter="" for (ln in 1:length(reference)) { if (ln %% 1000 ==0) {cat("Line",ln,"\n")} for (pos in 1:nchar(reference[ln])) { curletter=substring(reference[ln],pos,pos) if (curletter %in% toupper(letters)) { trans.mat[rownames(trans.mat)==lastletter, colnames(trans.mat)==curletter]= trans.mat[rownames(trans.mat)==lastletter, colnames(trans.mat)==curletter]+1 lastletter=curletter } else { if (lastletter!="") { trans.mat[rownames(trans.mat)==lastletter,27]= trans.mat[rownames(trans.mat)==lastletter,27]+1 lastletter="" } } } curletter="" if (lastletter!="") { trans.mat[rownames(trans.mat)==lastletter,27]= trans.mat[rownames(trans.mat)==lastletter,27]+1 } lastletter="" } trans.prob.mat=sweep(trans.mat+1,1,rowSums(trans.mat+1),FUN="/")
decode <- function(mapping,coded) { coded=toupper(coded) decoded=coded for (i in 1:nchar(coded)) { if (substring(coded,i,i) %in% toupper(letters)) { substring(decoded,i,i)=toupper(letters[mapping==substring(coded,i,i)]) } } decoded } log.prob <- function(mapping,decoded) { logprob=0 lastletter="" for (i in 1:nchar(decoded)) { curletter=substring(decoded,i,i) if (curletter %in% toupper(letters)) { logprob=logprob+log(trans.prob.mat[rownames(trans.mat)==lastletter, colnames(trans.mat)==curletter]) lastletter=curletter } else { if (lastletter!="") { logprob=logprob+log(trans.prob.mat[rownames(trans.mat)==lastletter,27]) lastletter="" } } } if (lastletter!="") { logprob=logprob+log(trans.prob.mat[rownames(trans.mat)==lastletter,27]) lastletter="" } logprob }
correctTxt="ENTER HAMLET HAM TO BE OR NOT TO BE THAT IS THE QUESTION WHETHER TIS NOBLER IN THE MIND TO SUFFER THE SLINGS AND ARROWS OF OUTRAGEOUS FORTUNE OR TO TAKE ARMS AGAINST A SEA OF TROUBLES AND BY OPPOSING END" coded=decode(sample(toupper(letters)),correctTxt) # randomly scramble the text mapping=sample(toupper(letters)) # initialize a random mapping i=1 iters=2000 cur.decode=decode(mapping,coded) cur.loglike=log.prob(mapping,cur.decode) max.loglike=cur.loglike max.decode=cur.decode while (i<=iters) { proposal=sample(1:26,2) # select 2 letters to switch prop.mapping=mapping prop.mapping[proposal[1]]=mapping[proposal[2]] prop.mapping[proposal[2]]=mapping[proposal[1]] prop.decode=decode(prop.mapping,coded) prop.loglike=log.prob(prop.mapping,prop.decode) if (runif(1)<exp(prop.loglike-cur.loglike)) { mapping=prop.mapping cur.decode=prop.decode cur.loglike=prop.loglike if (cur.loglike>max.loglike) { max.loglike=cur.loglike max.decode=cur.decode } cat(i,cur.decode,"\n") i=i+1 } }
# rbm_w is a matrix of size <number of hidden units> by <number of visible units> # visible_state is matrix of size <number of visible units> by <number of data cases> # hidden_state is a binary matrix of size <number of hidden units> by <number of data cases> visible_state_to_hidden_probabilities <- function(rbm_w, visible_state) { 1/(1+exp(-rbm_w %*% visible_state)) } hidden_state_to_visible_probabilities <- function(rbm_w, hidden_state) { 1/(1+exp(-t(rbm_w) %*% hidden_state)) } configuration_goodness <- function(rbm_w, visible_state, hidden_state) { out=0 for (i in 1:dim(visible_state)[2]) { out=out+t(hidden_state[,i]) %*% rbm_w %*% visible_state[,i] } out/dim(visible_state)[2] } configuration_goodness_gradient <- function(visible_state, hidden_state) { hidden_state %*% t(visible_state)/dim(visible_state)[2] } sample_bernoulli <- function(mat) { dims=dim(mat) matrix(rbinom(prod(dims),size=1,prob=c(mat)),dims[1],dims[2]) } cd1 <- function(rbm_w, visible_data) { visible_data = sample_bernoulli(visible_data) H0=sample_bernoulli(visible_state_to_hidden_probabilities(rbm_w, visible_data)) vh0=configuration_goodness_gradient(visible_data, H0) V1=sample_bernoulli(hidden_state_to_visible_probabilities(rbm_w, H0)) H1=visible_state_to_hidden_probabilities(rbm_w, V1) vh1=configuration_goodness_gradient(V1, H1) vh0-vh1 } rbm <- function(num_hidden, training_data, learning_rate, n_iterations, mini_batch_size=100, momentum=0.9, quiet=FALSE) { # This trains a model that's defined by a single matrix of weights. # <num_hidden> is the number of hidden units # cd1 is a function that takes parameters <model> and <data> and returns the gradient (or approximate gradient in the case of CD-1) of the function that we're maximizing. Note the contrast with the loss function that we saw in PA3, which we were minimizing. The returned gradient is an array of the same shape as the provided <model> parameter. # This uses mini-batches no weight decay and no early stopping. # This returns the matrix of weights of the trained model. n=dim(training_data)[2] p=dim(training_data)[1] if (n %% mini_batch_size != 0) { stop("the number of test cases must be divisable by the mini_batch_size") } model = (matrix(runif(num_hidden*p),num_hidden,p) * 2 - 1) * 0.1 momentum_speed = matrix(0,num_hidden,p) start_of_next_mini_batch = 1; for (iteration_number in 1:n_iterations) { if (!quiet) {cat("Iter",iteration_number,"\n")} mini_batch = training_data[, start_of_next_mini_batch:(start_of_next_mini_batch + mini_batch_size - 1)] start_of_next_mini_batch = (start_of_next_mini_batch + mini_batch_size) %% n gradient = cd1(model, mini_batch) momentum_speed = momentum * momentum_speed + gradient model = model + momentum_speed * learning_rate } return(model) }
weights=rbm(num_hidden=30, training_data=train, learning_rate=.09, n_iterations=5000, mini_batch_size=100, momentum=0.9)
library(ggplot2) library(reshape2) mw=melt(weights); mw$Var3=floor((mw$Var2-1)/16)+1; mw$Var2=(mw$Var2-1)%%16 + 1; mw$Var3=17-mw$Var3; ggplot(data=mw)+geom_tile(aes(Var2,Var3,fill=value))+facet_wrap(~Var1,nrow=5)+ scale_fill_continuous(low='white',high='black')+coord_equal()+ labs(x=NULL,y=NULL,title="Visualization of Weights")+ theme(legend.position="none")
votes = read.csv("HOF votes.csv", row.names = 1, header = TRUE)
pca = princomp(votes)
screeplot(pca, type = "l", main = "Scree Plot")
dotchart(sort(pca$loadings[, 1]), main = "First Principal Component")
The second component isn't as easy to decipher. The players at the negative end seem to players that are preferred by analytically minded analysts (think Moneyball). Raines, Trammell, and Martinez have more support among this group of voters. Morris, however, has less support among these voters and he isn't that far separated from them.dotchart(sort(pca$loadings[, 2]), main = "Second Principal Component")
ggplot(data.frame(cbind(pca$scores[, 1:2], votes))) + geom_point(aes(Comp.1,
Comp.2, colour = as.factor(Barry.Bonds), shape = as.factor(Roger.Clemens)),
size = 4) + coord_equal() + labs(colour = "Bonds", shape = "Clemens")
ggplot(data.frame(cbind(pca$scores[, 1:2], votes))) + geom_point(aes(Comp.1,
Comp.2, colour = as.factor(paste(Roger.Clemens, "/", Jeff.Bagwell))), size = 4) +
geom_hline(aes(0), size = 0.2) + geom_vline(aes(0), size = 0.2) + coord_equal() +
labs(colour = "Bonds / Bagwell")
ggplot(data.frame(cbind(pca$scores[, 1:2], votes))) + geom_point(aes(Comp.1,
Comp.2, colour = as.factor(paste(Barry.Bonds, "/", Curt.Schilling))), size = 4) +
geom_hline(aes(0), size = 0.2) + geom_vline(aes(0), size = 0.2) + coord_equal() +
labs(colour = "Bonds / Schilling")
PCbiplot <- function(PC = fit, x = "PC1", y = "PC2") {
# PC being a prcomp object
library(grid)
data <- data.frame(obsnames = row.names(PC$x), PC$x)
plot <- ggplot(data, aes_string(x = x, y = y)) + geom_text(alpha = 0.4,
size = 3, aes(label = obsnames))
plot <- plot + geom_hline(aes(0), size = 0.2) + geom_vline(aes(0), size = 0.2)
datapc <- data.frame(varnames = rownames(PC$rotation), PC$rotation)
mult <- min((max(data[, y]) - min(data[, y])/(max(datapc[, y]) - min(datapc[,
y]))), (max(data[, x]) - min(data[, x])/(max(datapc[, x]) - min(datapc[,
x]))))
datapc <- transform(datapc, v1 = 0.7 * mult * (get(x)), v2 = 0.7 * mult *
(get(y)))
plot <- plot + coord_equal() + geom_text(data = datapc, aes(x = v1, y = v2,
label = varnames), size = 5, vjust = 1, color = "red")
plot <- plot + geom_segment(data = datapc, aes(x = 0, y = 0, xend = v1,
yend = v2), arrow = arrow(length = unit(0.2, "cm")), alpha = 0.75, color = "red")
plot
}
fit <- prcomp(votes, scale = F)
PCbiplot(fit)
I could have also attempted to rotate the factors to make them more interpretable, but they appeared to have easy interpretation as is. Since we were looking at 2-d plots, rotation would not have made a difference in interpreting the plots. It is also common to use a likelihood approach to estimate factors. I chose to use the principal component method because the data are definitely not normal (being 0's and 1's).
Template To Handle Code of Conduct Incidents Reports Handling reported incidents related to a Code of Conduct (CoC) is a complex and delicate task. Managing reports timely and with care is crucial for maint...
Square Units [image: The biggest I've seen in a published source in the wild is an 80-fold error in a reported distance, which I think came from a series of at least th...
Reading the referee reports of that retracted paper by the science reformers: A peek behind the curtain One interesting sidelight of the story of a much criticized and finally retracted article on replication in psychology is that we get to read some of the r...
Is Artificial Intelligence Revolutionizing Environmental Health? *NOTE: This post was written by Kevin Elliott, Michigan State University; Nicole Kleinstreuer, National Institutes of Health; Patrick McMullen, ScitoVati...
Triple GM Abhishek Thakur Answers Qs from the Kaggle Community Last week we crowned the world’s first-ever Triple Grandmaster, Abhishek Thakur. In a video interview with Kaggle Data scientist Walter Reade, Abhishek ans...
Race and Attraction, 2009 – 2014 We looked at race in one of our very first posts, and today I’d like to revisit the topic with fresh data. This article folds in millions of person-to-pers...
Jedi master of data: Hans Rosling It has been inspiring to watch how Hans Rosling gave impressive talks about numbers and statistics. If you haven’t seen any of his great presentations, her...
CMOOL Farewell *Past Events* It was great to see everyone at the annual Statistics Department picnic! Here are a few pictures we borrowed from the 133 in Jo...
Hiatus Due to an unexpected, but positive, turn of events, I will have to hold off on posting at pitchR/x for a few months. I was pretty excited about starting th...
Add a comment