Click on my name above to go back to the portfolio homepage:)

Music contains so much information. We can not only compare all the hit songs to conclude the popular music trend, but also analyse the song behaviour of a particular person and get to know more about him/her through just a small music application.

Spotify is one of the top music streaming service around . With 35+ million songs and 170+ million monthly active users, musicians can spread their music and reach the audience easily. Also, it is an ideal platform for users to find their favorite music quickly and conveniently due to its large catalog, collaborative playlists, podcasts, and other attractive features. On the app, music can be browsed or searched via various parameters — such as artists, album, genre, playlist, or record label. Users can create, edit, and share playlists, share tracks on social media, and make playlists with other users.

Recently, I discovered that the Spotify API provides audio features for each song in its database, with which we are able to quantify a song. Therefore, I would like to conduct some music analysis based on it.

Intro

This project is divided into two parts. The goal in Part I is to analyze what song characteristics would affect its popularity. To achieve this, I built up a regression model and set the popularity as the dependent variable. The goal in Part II is to create a like song prediction system for a specific spotify user.

For the study, I will access the Spotify Web API, which provides data from the Spotify music catalog. This can be accessed via standard HTTPS requests to an API endpoint.

The Spotify API provides, among other things, track information for each song, including audio statistics such as danceability, instrumentalness, or tempo. Each feature measures an aspect of a song. Detailed information on how each feature is calculated can be found in the Spotify API Website.

Part ♪

Web Scraping I

First, I got my Client ID and the Client Secret credentials from the Spotify for Developers to connect to the Spotify API.

Sys.setenv(SPOTIFY_CLIENT_ID = 'client_id')
Sys.setenv(SPOTIFY_CLIENT_SECRET = 'client_secret')
access_token <- get_spotify_authorization_code()

Then, I created two new functions for personal usage based on some functions in the spotifyr package.

search_spotify1(): It is like searching in the spotify application. The returned results should be album/artist/playlist/track search result for one specifc genre.
main(): I used the search_spotify1() function I created above on 7 selected genres (country, jazz, classical, hip-hop, rap, pop and rock) repeatedly to get 10 thousand tracks for each genre and intend to bind them in one large list.

track_lists <- data.frame()
# function1
search_spotify1 <- function(genre = character(), 
                            type = c('album', 'artist', 'playlist', 'track'), 
                            market = NULL, 
                            limit = 20, 
                            offset = 0, 
                            include_external = NULL, 
                            authorization = get_spotify_access_token(),
                            include_meta_info = FALSE) {
  
  base_url <- 'https://api.spotify.com/v1/search'
  
  if (!is.character(genre)) {
    stop('"genre" must be a string')
  }
  
  if (!is.null(market)) {
    if (!str_detect(market, '^[[:alpha:]]{2}$')) {
      stop('"market" must be an ISO 3166-1 alpha-2 country code')
    }
  }
  
  if ((limit < 1 | limit > 50) | !is.numeric(limit)) {
    stop('"limit" must be an integer between 1 and 50')
  }
  
  if ((offset < 0 | offset > 10000) | !is.numeric(offset)) {
    stop('"offset" must be an integer between 1 and 10,000')
  }
  
  if (!is.null(include_external)) {
    if (include_external != 'audio') {
      stop('"include_external" must be "audio" or an empty string')
    }
  }
  
  params <- list(
    q =  str_replace_all(str_glue('genre:"{genre}"'), ' ', '+'),
    type = paste(type, collapse = ','),
    market = market,
    limit = limit,
    offset = offset,
    include_external = include_external,
    access_token = authorization
  )
  res <- RETRY('GET', base_url, query = params, encode = 'json')
  stop_for_status(res)
  
  res <- fromJSON(content(res, as = 'text', encoding = 'UTF-8'), flatten = TRUE)
  
  if (!include_meta_info && length(type) == 1) {
    res <- res[[str_glue('{type}s')]]$items %>% 
      as_tibble %>% 
      mutate(genre = genre,market=market)
  }
  
  return(res)
}

# function2
track_lists <- data.frame()
main <- function(n){
  for (genre1 in c('country','jazz', 'classical', 'hip-hop', 'rap', 'pop', 'rock')){
    
    for (offset1 in 0:n){
      track_list <- search_spotify1(genre=genre1,
                                    type='track',
                                    market='US',
                                    limit = 50,
                                    offset = offset1*50)
      track_lists <- rbind.fill(track_lists,track_list)
      print(paste0('Call #',offset1+1,' for the tracks in ',genre1))
    }
  }
  return(track_lists)
  Sys.sleep(5)
}

# get the full list of songs
track_lists1 <- lapply(199,main)
track_lists2 <- track_lists1[[1]]

With the help of the spotifyr package, I was able to collect audio features information using the get_track_audio_features() function and collect artist information using the get_artists() function. I stored all the information I extracted in one .csv file.

# artist information
track_lists2['artist.id'] <- NA
track_lists2['artist.name'] <- NA
for (i in 1:nrow(track_lists2)){
  track_lists2[i,31] <- track_lists2$artists[[i]][1,][["id"]]
  track_lists2[i,32] <- track_lists2$artists[[i]][1,][["name"]]
}

# unique
track_lists3 <- track_lists2 %>% 
  group_by(id,name,album.id,album.name,album.release_date,album.total_tracks,artist.id,
           artist.name,popularity) %>% 
  filter(genre == max(genre)) %>% 
  ungroup() %>% 
  select(id,name,album.id,album.name,album.release_date,album.total_tracks,artist.id,
         artist.name,popularity,genre) %>% 
  distinct()

# get the feasures of songs (limit: 100)
features <- data.frame()
for (i in 0:floor(nrow(track_lists3)/100)){
  if (i<floor(nrow(track_lists3)/100)){
features_temp <- get_track_audio_features(track_lists3$id[(i*100+1):((i+1)*100)])
features <- rbind.fill(features,features_temp)
  }
  else{
    features_temp <- get_track_audio_features(track_lists3$id[(i*100+1):(nrow(track_lists3))])
    features <- rbind.fill(features,features_temp)
  }
}

features <- features %>% select(-uri,-track_href,-analysis_url,-type,-id)
track_listsa <- cbind(track_lists3,features)

# get the artist information of songs (limit: 50)
artists <- data.frame()
for (i in 0:floor(nrow(track_lists3)/50)){
  if (i<floor(nrow(track_lists3)/50)){
    artists_temp <- get_artists(track_lists3$artist.id[(i*50+1):((i+1)*50)])
    artists <- rbind.fill(artists,artists_temp)
  }
  else{
    artists_temp <- get_artists(track_lists3$artist.id[(i*50+1):(nrow(track_lists3))])
    artists <- rbind.fill(artists,artists_temp)
  }
}

artists <- artists %>% 
  select(popularity,followers.total) %>% 
  rename(`artists.popularity`=popularity)
track_listsb <- cbind(track_listsa,artists)

# output
write.csv(track_listsb, file = "spotify.csv",row.names=FALSE)

Data Processing I

I have collected 55557 observations with 26 variables, among which there are 14 numerical variables and 12 categorical variables. Also, I outputed the summary information of all the variables.

We want to analyze the what would impact the popularity of a song. That is to say, we set the popularity as our dependent variable.

sp <- read_csv("e:/Documents/spotify.csv")

sp <- sp %>% 
  mutate(genre=as.factor(genre),
         album.release_year = substring(album.release_date,1,4),
         album.release_year = ifelse(album.release_year == '0000','2009',album.release_year),
         album.release_year = as.factor(album.release_year),
         key=as.factor(key),
         mode=as.factor(mode),
         time_signature=as.factor(time_signature)
  )
dim(sp)

## [1] 55557    26

summary(sp)

##       id                name             album.id          album.name       
##  Length:55557       Length:55557       Length:55557       Length:55557      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  album.release_date album.total_tracks  artist.id         artist.name       
##  Length:55557       Min.   :  1.00     Length:55557       Length:55557      
##  Class :character   1st Qu.: 10.00     Class :character   Class :character  
##  Mode  :character   Median : 12.00     Mode  :character   Mode  :character  
##                     Mean   : 17.01                                          
##                     3rd Qu.: 16.00                                          
##                     Max.   :531.00                                          
##                                                                             
##    popularity         genre       danceability        energy         
##  Min.   : 0.0   classical:9260   Min.   :0.0598   Min.   :0.0000203  
##  1st Qu.:41.0   country  :8136   1st Qu.:0.4410   1st Qu.:0.3220000  
##  Median :52.0   hip-hop  :3631   Median :0.5780   Median :0.5710000  
##  Mean   :50.8   jazz     :9868   Mean   :0.5628   Mean   :0.5262837  
##  3rd Qu.:61.0   pop      :5088   3rd Qu.:0.6980   3rd Qu.:0.7420000  
##  Max.   :99.0   rap      :9627   Max.   :0.9870   Max.   :0.9990000  
##                 rock     :9947                                       
##       key           loudness       mode       speechiness    
##  0      : 6401   Min.   :-47.432   0:18764   Min.   :0.0223  
##  7      : 6285   1st Qu.:-12.768   1:36793   1st Qu.:0.0354  
##  1      : 5834   Median : -7.894             Median :0.0461  
##  2      : 5813   Mean   :-10.426             Mean   :0.0913  
##  9      : 5136   3rd Qu.: -5.629             3rd Qu.:0.0905  
##  5      : 4982   Max.   :  1.593             Max.   :0.9530  
##  (Other):21106                                               
##   acousticness       instrumentalness       liveness         valence      
##  Min.   :0.0000014   Min.   :0.0000000   Min.   :0.0134   Min.   :0.0000  
##  1st Qu.:0.0534000   1st Qu.:0.0000000   1st Qu.:0.0951   1st Qu.:0.2450  
##  Median :0.2610000   Median :0.0000335   Median :0.1190   Median :0.4420  
##  Mean   :0.3945907   Mean   :0.1885753   Mean   :0.1749   Mean   :0.4522  
##  3rd Qu.:0.7640000   3rd Qu.:0.1220000   3rd Qu.:0.2030   3rd Qu.:0.6480  
##  Max.   :0.9960000   Max.   :0.9940000   Max.   :0.9900   Max.   :0.9850  
##                                                                           
##      tempo         duration_ms      time_signature artists.popularity
##  Min.   : 32.85   Min.   :  20467   0:    1        Min.   : 16.00    
##  1st Qu.: 92.48   1st Qu.: 184328   1:  539        1st Qu.: 60.00    
##  Median :115.14   Median : 218733   3: 5367        Median : 70.00    
##  Mean   :117.36   Mean   : 238368   4:48665        Mean   : 69.22    
##  3rd Qu.:139.35   3rd Qu.: 264585   5:  985        3rd Qu.: 79.00    
##  Max.   :232.69   Max.   :3391040                  Max.   :100.00    
##                                                                      
##  followers.total    album.release_year
##  Min.   :      13   Length:55557      
##  1st Qu.:  121037   Class :character  
##  Median :  558210   Mode  :character  
##  Mean   : 2368187                     
##  3rd Qu.: 2332368                     
##  Max.   :55418461                     
##

EDA

We firstly want to explore the relationship between popularity and several categorical variables to get a better sense of our data.

▶ Mode & Popularity

Null Hypothesis: There is no difference in the mean popularity of different music mode.

I used both graphical and statistical method to conduct hypothesis testing.

Surprisingly, a song with minor mode (a song contains more darkness, sadness, ugliness and feminitiy) would tend to be more popular.

ttest <- t.test(popularity~mode,sp)
tidy(ttest)

ggplot(sp)+
  geom_boxplot(aes(x=mode, y= popularity,color=mode),size=1.6)+
  theme(plot.background = element_blank(),
        panel.background = element_blank(),
        legend.key = element_blank())+
  labs(title = 'Relationship b/w Mode & Popularity')

▶ Key & Popularity

Null Hypothesis: There is no difference in the mean popularity of different music key.

Based on our ANOVA test result, the null hypothesis should be rejected. Artists may get some insights from the TukeyHSD result when considering whether shifting from one certain key to another key for a new song. For example, if the artist is debating between key C and key D for his/her new song, our suggestion based on the TukeyHSD would be that the debating is pointless. When other things remain the same, change between key C and key D for a song makes almost no different because the p-value for the 2-0 comparison (the first row in the table) is 1.

aov <- aov(popularity ~ key,sp)
tidy((aov))

ggplot(sp)+
  geom_boxplot(aes(x=key, y= popularity,color=key,middle=mean(popularity)),size=1.6)+
  theme(plot.background = element_blank(),
        panel.background = element_blank(),
        legend.key = element_blank())+
  labs(title = 'Relationship b/w Key & Popularity')+
  geom_hline(aes(yintercept=mean(popularity)),size=1.6)

tidy(TukeyHSD(aov)) %>% arrange(desc(adj.p.value))

▶ Time Signature & Popularity

The song with 4 time signature is dominating the market. If an artist wants to create a popular song, he/she might want to follow the main market trend and focus on the 4 time signature.

sp %>% 
  filter(time_signature!='0') %>% 
  ggplot()+
  geom_boxplot(aes(x=time_signature, y= popularity,color=time_signature,
                   middle=mean(popularity)),size=1.2)+
  theme(plot.background = element_blank(),
        panel.background = element_blank(),
        legend.key = element_blank())+
  labs(title = 'Relationship b/w Time_Signature & Popularity')+
  geom_hline(aes(yintercept=mean(popularity)),size=1.2)

Therefore, we can come up with a basic conclusion that, if an artist wants to make a popular song, he/she would better focus on a song with minor mode, C♯, D♭, E, F♯, G♭, G♯, A and B key,and 4 time signature.

Feature Selection

There are too many varibables at present and feature selection is needed. Therefore, I made the correlation plot to detect the multicollinearity as well as used Lasso Test to add pentalty on variables in order to get the most valuable ones.

sp_n <- sp %>% keep(is.numeric) %>% select(-popularity)
library(corrplot)
corrplot.mixed(cor(sp_n), lower = "number", upper = "square", tl.pos='lt',order = "FPC",
               tl.cex=0.7,tl.srt=45,number.cex=0.9,diag='l')

From the plot, we can find that the squares in the corner show the stronger relationship. It can help us to remove the acousticness, energy, instrumentalness and artists.popularity variables.

Then, I used the lasso test to add a penalty on the rest variables to get the most valuable ones.

# feature selection 1 (based on correlation)
sp1 <- sp %>% select(-acousticness,-energy,-instrumentalness,-artists.popularity)

# feature selection 2 (based on lasso)
library(penalized)
lassoTest <- penalized(response = popularity, 
                       penalized = ~ danceability + loudness + speechiness +
                                   liveness + valence + tempo + duration_ms +
                                     followers.total + mode + time_signature + key, 
                       data = sp1, lambda1 = 13000, 
                       model = "linear", trace = FALSE)
coefficients(lassoTest)

##     (Intercept)    danceability        loudness           tempo     duration_ms 
##    6.166056e+01    6.977324e-01    9.890076e-01    3.882204e-03   -1.124181e-05 
## followers.total 
##    5.405840e-07

Finally, I got 5 variables: danceability, loudness, tempo, duration_ms and followers.total, and now I can start to build my model.

Linear Regression

▶ Centering

In order to increase the interpretation power of my model, I centered some of my variables so that the intercept would have a meaning even though the variable equals zero.

sp1 <- sp1 %>% 
  mutate(tempo=tempo-mean(tempo,na.rm=TRUE),
         duration_ms=duration_ms-mean(duration_ms,na.rm=TRUE))

▶ Model

I built the simple linear regression model.

First of all, we can look at the R-squared value to learn how much the variability within the popularity of a song can be accounted for our predictor variables. The R-squared value is 0.33, which means the current independent variables we choose can explain approximately 33% of the variablitify of the song popularity.

Then, we can find that the p-value for F-test is quite small, which suggested that the model has passed the F-test.The overall significance of our model is significant.

Furthermore, we can see that all the coefficients here have pass the t-test and the p-values suggest that they are all very significant.

# linear regression - 5 variable
lm <- lm(popularity~danceability+loudness+tempo+duration_ms+followers.total,sp1)
summary(lm)

## 
## Call:
## lm(formula = popularity ~ danceability + loudness + tempo + duration_ms + 
##     followers.total, data = sp1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -87.127  -7.449   0.424   7.927  45.153 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.256e+01  2.566e-01 204.811  < 2e-16 ***
## danceability     1.057e+01  3.362e-01  31.453  < 2e-16 ***
## loudness         8.607e-01  8.922e-03  96.477  < 2e-16 ***
## tempo            8.431e-03  1.726e-03   4.884 1.04e-06 ***
## duration_ms     -7.948e-06  5.016e-07 -15.846  < 2e-16 ***
## followers.total  5.333e-07  9.656e-09  55.231  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.12 on 55551 degrees of freedom
## Multiple R-squared:  0.3304, Adjusted R-squared:  0.3304 
## F-statistic:  5482 on 5 and 55551 DF,  p-value: < 2.2e-16

▶ Post Power Analysis

We conduct an a post power analysis to determine the sample size we used is effective.Using \(f^2 = \frac{R^2_{adjusted}}{1 - R^2_{adjusted}}\) as our effect size, the power we got is 1. It suggests that it is safe for us to move forward.

pwr::pwr.f2.test(u=1, v=55551, f2=0.3304/(1-0.3304), sig.level=0.05)

## 
##      Multiple regression power calculation 
## 
##               u = 1
##               v = 55551
##              f2 = 0.4934289
##       sig.level = 0.05
##           power = 1

▶ Outliers

Then, we want to confirm that whether there are outliers in our model. The plot below is a Residuals vs Leverage plot in which no evidence of outliers have shown up. Those “Cook’s distance” dashed curves don’t even appear on the plot. None of the points come close to having both high residual and leverage.

plot(lm,5)

▶ Heteroscadesticity

In order to detect the heteroscadesticity, I used both graphical and statistical methods. From the graph, we can identify that there is a declining trend. The null hypothesis of the bptest is that the residuals are constant. We can see here the p-value of both models are quite small, which means that we will reject the null hypothesis.

plot(lm$fitted.values, lm$residuals)

lmtest::bptest(lm)

## 
##  studentized Breusch-Pagan test
## 
## data:  lm
## BP = 1666.6, df = 5, p-value < 2.2e-16

lmtest::coeftest(lm, vcov = sandwich::vcovHC(lm))

## 
## t test of coefficients:
## 
##                    Estimate  Std. Error  t value  Pr(>|t|)    
## (Intercept)      5.2558e+01  2.5388e-01 207.0214 < 2.2e-16 ***
## danceability     1.0575e+01  3.2444e-01  32.5932 < 2.2e-16 ***
## loudness         8.6072e-01  1.0386e-02  82.8732 < 2.2e-16 ***
## tempo            8.4309e-03  1.7622e-03   4.7842 1.721e-06 ***
## duration_ms     -7.9484e-06  5.8680e-07 -13.5453 < 2.2e-16 ***
## followers.total  5.3331e-07  1.0145e-08  52.5696 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the above analysis, we know that outliers are fine but there is a heteroscadesticity in my model. By using a heteroscedasticity-consistent covariance matrix to test our coefficients, we have more reasonable estimates.

Mixed Model

▶ The Genre Variable

I want to explore on the genre variable since I think music in different genre would have different performance. We can see from the left visual below that the popularity distribution within each genre is diffferent, and the right one shows that the music characteristics of each genre are different as well (This radar chart is interactive. You can click on the genres to choose whether to show or hide it).

ggplot(sp)+
  geom_density(aes(x=popularity,fill=genre),alpha=0.4)+
  theme(plot.background = element_blank(),
        panel.background = element_blank(),
        legend.key = element_blank())+
  labs(title = 'Relationship b/w Genre & Popularity')

quantile <- sp1 %>% 
  dplyr::group_by(genre) %>% 
  dplyr::summarise(popularity = list(enframe(quantile(popularity, probs=0.5)))) %>% 
  unnest(cols=c(popularity))

quantile_set <- sp1 %>% left_join(quantile,by=c('genre')) %>% filter(popularity==value)

quantile_set1 <- quantile_set %>% 
  group_by(genre) %>% 
  summarise_all(mean) %>%  
  map(~.x) %>%
  discard(~all(is.na(.x))) %>%
  map_df(~.x) %>% 
  select(-value,-popularity)

quantile_norm <- cbind(quantile_set1[1], apply(quantile_set1[-1],2, function(x){(x-min(x)) / diff(range(x))})) 

radar <- gather(quantile_norm, key=Attribute, value=Score, -genre) %>%
  spread(key=genre, value=Score)

library(radarchart) 

chartJSRadar(scores = radar, scaleStartValue = -1, maxScale =1, polyAlpha=0.2,lineAlpha = 0.9, showToolTipLabel = TRUE)

▶ Mixed Model

We are not going to add genre as another predictor, but we are going to include it as another level to our model with the mixed model.

# mixed model with genre

library(lme4)
intMod <- lmer(popularity ~  1 + (1|genre), data = sp1)
summary(intMod)

## Linear mixed model fit by REML ['lmerMod']
## Formula: popularity ~ 1 + (1 | genre)
##    Data: sp1
## 
## REML criterion at convergence: 407444.7
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -7.1547 -0.5096 -0.0617  0.5585  4.9853 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  genre    (Intercept) 148.95   12.205  
##  Residual              89.56    9.463  
## Number of obs: 55557, groups:  genre, 7
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   52.061      4.613   11.29

ICC means the intraclass correlation and it ranges from 0 (no variance between clusters) to 1 (variance between clusters). It shows that 62% of variance in song popularity is accounted for by the genre alone.

ICC <- 148.95 / (148.95 + 89.56)
ICC

## [1] 0.6245021

By integrating information about the groups, we are getting a better sense of how much uncertainty our model contains at the global average level. I put the final interpretation of the model at the end of this section.

The standard deviation is telling us how much the effect moves around based upon genre after getting the information from our fixed effects. For example, the effect move around nearly 11.663 from genre alone).

riMod <- lmer(popularity ~ danceability+loudness+tempo+duration_ms+followers.total + (1|genre), data = sp1)
summary(riMod)

## Linear mixed model fit by REML ['lmerMod']
## Formula: popularity ~ danceability + loudness + tempo + duration_ms +  
##     followers.total + (1 | genre)
##    Data: sp1
## 
## REML criterion at convergence: 407266.1
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -7.7652 -0.5284 -0.0467  0.5493  4.9582 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  genre    (Intercept) 136.03   11.663  
##  Residual              89.14    9.442  
## Number of obs: 55557, groups:  genre, 7
## 
## Fixed effects:
##                   Estimate Std. Error t value
## (Intercept)      5.243e+01  4.414e+00  11.880
## danceability    -1.161e-01  3.007e-01  -0.386
## loudness         6.246e-02  9.519e-03   6.562
## tempo           -2.303e-03  1.350e-03  -1.707
## duration_ms     -1.834e-06  3.958e-07  -4.633
## followers.total  1.131e-07  8.003e-09  14.135
## 
## Correlation of Fixed Effects:
##             (Intr) dncblt lodnss tempo  drtn_m
## danceabilty -0.042                            
## loudness     0.026 -0.129                     
## tempo       -0.006  0.090 -0.130              
## duration_ms -0.008  0.179 -0.055  0.042       
## follwrs.ttl -0.006  0.027 -0.023  0.004 -0.043
## fit warnings:
## Some predictor variables are on very different scales: consider rescaling

The fixed effects account for nearly 0.30% of the variation within the effects while the random effects account for nearly 60.53% of the variation within the effects.

MuMIn::r.squaredGLMM(riMod)

##              R2m       R2c
## [1,] 0.003069619 0.6053332

We also plot simulated random effect ranges for each of the random effect groups. We can see that the effect range is quite large.

library(merTools)
plotREsim(REsim(riMod), labs = TRUE)

We can see that after our mixed model, our plot becomes more condensed. It means that the mixed model is better.

mixedPred <- predict(riMod)
slimPred <- predict(lm)
allPred <- cbind(actual = sp1$popularity, 
                 mixed = mixedPred, 
                 slim = slimPred)

par(mfrow=c(1,2)) 
plot(allPred[, "actual"], allPred[, "slim"])
plot(allPred[, "actual"], allPred[, "mixed"])

Now we have a better understanding of how different genre groups affect the dependent variable and the globalised effect on the popularity variable.

From the regression result, we know that if the artist wants to produce a popular song, he/she should consider to make it louder. It may be beneficial for a song to spread out. Also, a popular song should have lower tempo and durations. It may be resulted from that people tend to lose patience on long songs. Besides, it is important for artists to get more followers to make their songs more popular.

I also noticed that the coefficients are a little bit different from my original linear regression model. I think my mixed model would be more reliable since it is more generalized and it has included the genre variable for the random effect. Also, random slope model should also be considered to improve the model.

Part ♪♪

Based on my current song database, I want to collect song data from one certain person and build a model to predict whether she would like a certain song or not. I got a volunteer from my class and I scrapped all the songs she liked in her spotify account.

Web Scraping II

I collected audio features as well as the artist information for all the songs that my volunteer has liked. Then I stored all the information I extracted in one .csv file.

mine <- data.frame()
main1 <- function(n){
  for (offset1 in 0:n){
    mine_temp <- get_my_saved_tracks1(limit = 20,offset = offset1*20)
    mine <- rbind(mine,mine_temp)
    print(paste0('Call #',offset1+1,' for the tracks'))
  }
  return(mine)
  Sys.sleep(5)
}
mine_kat <- lapply(163,main1)
mine_kat1 <- mine_kat[[1]]

# artist information
mine_kat1['artist.id'] <- NA
mine_kat1['artist.name'] <- NA
for (i in 1:nrow(mine_kat1)){
  mine_kat1[i,31] <- mine_kat1$track.artists[[i]][1,][["id"]]
  mine_kat1[i,32] <- mine_kat1$track.artists[[i]][1,][["name"]]
}

mine_kat2 <- mine_kat1 %>% 
  select(added_at,track.duration_ms,track.id,track.name,track.popularity,track.album.id,track.album.name,track.album.release_date,track.album.total_tracks,artist.id,artist.name) %>% rename(track.artist.id=artist.id,track.artist.name=artist.name)

features_kat <- data.frame()
for (i in 0:floor(nrow(mine_kat2)/100)){
  if (i<floor(nrow(mine_kat2)/100)){
    features_temp1 <- get_track_audio_features(mine_kat2$track.id[(i*100+1):((i+1)*100)]) #less than 100
    features_kat <- rbind.fill(features_kat,features_temp1)
  }
  else{
    features_temp1 <- get_track_audio_features(mine_kat2$track.id[(i*100+1):(nrow(mine_kat2))]) #less than 100
    features_kat <- rbind.fill(features_kat,features_temp1)
  }
}

features_kat <- features_kat %>% select(-uri,-track_href,-analysis_url,-type,-id)
mine_kata <- cbind(mine_kat2,features_kat)

# get the artist information of songs
artists_kat <- data.frame()
for (i in 0:floor(nrow(mine_kata)/50)){
  if (i<floor(nrow(mine_kata)/50)){
    artists_temp1 <- get_artists(mine_kata$track.artist.id[(i*50+1):((i+1)*50)]) #less than 50
    artists_kat <- rbind.fill(artists_kat,artists_temp1)
  }
  else{
    artists_temp1 <- get_artists(mine_kata$track.artist.id[(i*50+1):(nrow(mine_kata))]) #less than 50
    artists_kat <- rbind.fill(artists_kat,artists_temp1)
  }
}

artists_kat <- artists_kat %>% select(popularity,followers.total) %>% rename(artists.popularity=popularity)
mine_katb <- cbind(mine_kata,artists_kat)

# output
write.csv(mine_katb, file = "kat.csv",row.names=FALSE)

Data Processing II

I combined the two datasets and labeled the song in my old database as not-like(0) and the song my volunteer has already liked as like(1).

# data processing -
kat <- read_csv("e:/Documents/kat.csv") 
spc <- sp %>% dplyr::select(-genre,-album.release_year) %>% mutate(like=0)


mine_katc <- kat %>% dplyr::select(`track.id`,`track.name`,`track.album.id`,
                                   `track.album.name`,`track.album.release_date`,
                                   `track.album.total_tracks`,`track.artist.id`,
                                   `track.artist.name`,`track.popularity`,
                                   danceability,energy,key,loudness,mode,speechiness,
                                   acousticness,instrumentalness,liveness,valence,
                                   tempo,duration_ms,time_signature,`artists.popularity`,
                                   `followers.total`) %>% mutate(like=1)
colnames(mine_katc) <- colnames(spc)
new <- rbind(spc,mine_katc)

new <- new %>% 
  mutate(album.release_year = substring(album.release_date,1,4),
         album.release_year = ifelse(album.release_year == '0000','2009',album.release_year),
         album.release_year = as.factor(album.release_year),
         key=as.factor(key),
         mode=as.factor(mode),
         time_signature=as.factor(time_signature),
         like=as.numeric(like)
  )
summary(new)

##       id                name             album.id          album.name       
##  Length:58826       Length:58826       Length:58826       Length:58826      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  album.release_date album.total_tracks  artist.id         artist.name       
##  Length:58826       Min.   :  1.00     Length:58826       Length:58826      
##  Class :character   1st Qu.:  9.00     Class :character   Class :character  
##  Mode  :character   Median : 12.00     Mode  :character   Mode  :character  
##                     Mean   : 16.63                                          
##                     3rd Qu.: 16.00                                          
##                     Max.   :531.00                                          
##                                                                             
##    popularity     danceability        energy               key       
##  Min.   : 0.00   Min.   :0.0598   Min.   :0.0000203   0      : 6815  
##  1st Qu.:41.00   1st Qu.:0.4420   1st Qu.:0.3290000   7      : 6600  
##  Median :52.00   Median :0.5780   Median :0.5720000   1      : 6144  
##  Mean   :50.55   Mean   :0.5627   Mean   :0.5285491   2      : 6107  
##  3rd Qu.:61.00   3rd Qu.:0.6970   3rd Qu.:0.7420000   9      : 5452  
##  Max.   :99.00   Max.   :0.9870   Max.   :0.9990000   5      : 5299  
##                                                       (Other):22409  
##     loudness       mode       speechiness       acousticness      
##  Min.   :-47.432   0:20021   Min.   :0.02230   Min.   :0.0000014  
##  1st Qu.:-12.515   1:38805   1st Qu.:0.03530   1st Qu.:0.0527000  
##  Median : -7.855             Median :0.04590   Median :0.2570000  
##  Mean   :-10.306             Mean   :0.08993   Mean   :0.3908617  
##  3rd Qu.: -5.619             3rd Qu.:0.08840   3rd Qu.:0.7530000  
##  Max.   :  1.593             Max.   :0.95300   Max.   :0.9960000  
##                                                                   
##  instrumentalness       liveness         valence           tempo       
##  Min.   :0.0000000   Min.   :0.0134   Min.   :0.0000   Min.   : 32.85  
##  1st Qu.:0.0000000   1st Qu.:0.0952   1st Qu.:0.2400   1st Qu.: 92.85  
##  Median :0.0000366   Median :0.1190   Median :0.4340   Median :115.44  
##  Mean   :0.1846149   Mean   :0.1745   Mean   :0.4471   Mean   :117.42  
##  3rd Qu.:0.1060000   3rd Qu.:0.2020   3rd Qu.:0.6420   3rd Qu.:139.01  
##  Max.   :0.9940000   Max.   :0.9900   Max.   :0.9850   Max.   :232.69  
##                                                                        
##   duration_ms      time_signature artists.popularity followers.total   
##  Min.   :  20467   0:    1        Min.   :  0.0      Min.   :      13  
##  1st Qu.: 185340   1:  562        1st Qu.: 60.0      1st Qu.:  122650  
##  Median : 219291   3: 5603        Median : 71.0      Median :  566451  
##  Mean   : 238260   4:51614        Mean   : 69.3      Mean   : 2504080  
##  3rd Qu.: 264274   5: 1046        3rd Qu.: 79.0      3rd Qu.: 2368724  
##  Max.   :3391040                  Max.   :100.0      Max.   :55469919  
##                                                                        
##       like         album.release_year
##  Min.   :0.00000   Length:58826      
##  1st Qu.:0.00000   Class :character  
##  Median :0.00000   Mode  :character  
##  Mean   :0.05557                     
##  3rd Qu.:0.00000                     
##  Max.   :1.00000                     
##

sp_new1 <- new %>%
  distinct() %>% 
  keep(is.numeric) %>% 
  mutate(like=as.factor(like))
sp_new2 <- data.frame(scale(sp_new1[1:14]))
sp_new2[15] <- sp_new1[15]
sp_new1 <- sp_new2

Split and Balance

I decided to include 70% of the data in the training dataset and the rest of them in the test dataset. Then I found that the training dataset is not well-balanced and it may cause the accruacy paradox. Therefore, I used the SMOTE() function to balance it.

set.seed(1)
library(caret)
dsub <- createDataPartition(sp_new1$like,p=0.7,list = F)
sp_train <- sp_new1[dsub,]
sp_test <- sp_new1[-dsub,]

round(prop.table(table(dplyr::select(sp_train, like), exclude = NULL)), 4) * 100

## 
##     0     1 
## 94.44  5.56

sp_train <- DMwR::SMOTE(like ~ ., data.frame(sp_train), perc.over = 100, perc.under = 200)

round(prop.table(table(dplyr::select(sp_train, like), exclude = NULL)), 4) * 100

## 
##  0  1 
## 50 50

Logistic Regression

I want to predict whether my volunteer would like a certain song or not, so I choose to build a logistic regression model.

First of all, I used the backward regression method to come up with the model with the lowest AIC. The best logistic regression model should have 14 independent variables.

logTest = glm(like ~ ., 
              data = na.omit(sp_train), 
              family = binomial(link="logit"))

step(logTest, direction = "backward",trace=FALSE)

## 
## Call:  glm(formula = like ~ album.total_tracks + popularity + danceability + 
##     energy + loudness + speechiness + acousticness + instrumentalness + 
##     liveness + valence + tempo + duration_ms + artists.popularity + 
##     followers.total, family = binomial(link = "logit"), data = na.omit(sp_train))
## 
## Coefficients:
##        (Intercept)  album.total_tracks          popularity        danceability  
##           -0.79835            -2.22399            -0.52929             0.13995  
##             energy            loudness         speechiness        acousticness  
##           -0.10337             0.95500            -0.49872             0.07332  
##   instrumentalness            liveness             valence               tempo  
##           -0.13736            -0.11467            -0.74772            -0.03772  
##        duration_ms  artists.popularity     followers.total  
##           -0.15085             0.07057             0.43481  
## 
## Degrees of Freedom: 9155 Total (i.e. Null);  9141 Residual
## Null Deviance:       12690 
## Residual Deviance: 10130     AIC: 10160

Then, the logistic regression model was built up with the selected variables. We can explain the coefficients and describe the listening behaviour of our volunteer.

From the result of my logistic regression, we can that this person likes songs with high danceability and loudness, from which we can infer that she might like EDM (Electronic dance music) a lot. Also, we can see that she does not like songs with high speechiness, which might indicate she does not like rap songs. Lastly, we found that the popularity of a song would have a negative impact on the log-odds of liking the song while the popularity of an artist has a positive impact. This person is like a treasure hunter who want to dig deep and find songs that are not popular from an artist.

logreg=glm(like ~ 
             album.total_tracks + 
             popularity+
             danceability+
             loudness+
             speechiness+
             acousticness+
             instrumentalness+
             liveness+
             valence+
             tempo+
             duration_ms+
             artists.popularity+
             followers.total, 
           data = na.omit(sp_train),
           family = binomial(link="logit"))
summary(logreg)

## 
## Call:
## glm(formula = like ~ album.total_tracks + popularity + danceability + 
##     loudness + speechiness + acousticness + instrumentalness + 
##     liveness + valence + tempo + duration_ms + artists.popularity + 
##     followers.total, family = binomial(link = "logit"), data = na.omit(sp_train))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7429  -0.9403   0.0364   0.9316   5.4732  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -0.79661    0.03427 -23.245  < 2e-16 ***
## album.total_tracks -2.23137    0.13565 -16.449  < 2e-16 ***
## popularity         -0.52793    0.02692 -19.612  < 2e-16 ***
## danceability        0.15046    0.03385   4.445 8.81e-06 ***
## loudness            0.88069    0.05569  15.814  < 2e-16 ***
## speechiness        -0.50355    0.03469 -14.515  < 2e-16 ***
## acousticness        0.11091    0.03751   2.957  0.00311 ** 
## instrumentalness   -0.14888    0.03505  -4.247 2.16e-05 ***
## liveness           -0.11904    0.02714  -4.387 1.15e-05 ***
## valence            -0.76649    0.03251 -23.576  < 2e-16 ***
## tempo              -0.03963    0.02652  -1.494  0.13510    
## duration_ms        -0.15302    0.03260  -4.694 2.68e-06 ***
## artists.popularity  0.07154    0.03091   2.314  0.02065 *  
## followers.total     0.43464    0.02971  14.628  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12693  on 9155  degrees of freedom
## Residual deviance: 10133  on 9142  degrees of freedom
## AIC: 10161
## 
## Number of Fisher Scoring iterations: 6

The ideal cut-off value is 0.5032. It means that if a song gets a value over 0.5032, then we will say that our volunteer would like this song. Otherwise, the probability that our volunteer would like this song is very low. It can be a good reference for the song recommendation system.

like_pred <- predict(logreg, sp_test, type = "response")

ideal_cutoff <- InformationValue::optimalCutoff(
  actuals = sp_test$like,
  predictedScores = like_pred,
  optimiseFor = "Both")
ideal_cutoff

## [1] 0.503283

As we calculated, the accuracy of our logistic model is 70%, which means that the probability that our prediction is right is over 70 percent.

like_pred_value <- ifelse(like_pred > ideal_cutoff,1,0)

table <- table(sp_test$like, like_pred_value)
sum(diag(table)) / nrow(sp_test)

## [1] 0.7063524

Shiny App

A ShinyApp has been created for my Spotify Analysis.

To use it, we should firstly get a link of a song. We can access it by clicking on the ‘Copy Song Link’ in the song menu (See Below).

The id consisting of numbers and letters after the last slash in the song link is the id of the song we choose.

For example, https://open.spotify.com/track/0bYg9bo50gSsH3LtXe2SQn is the song link of ‘All I Want for Christmas Is You’. Then its id is 0bYg9bo50gSsH3LtXe2SQn.

We input it in the ‘track id’ blank (See Below), and then click ‘Submit’.

The output is as below. First of all, we would know the prediction result that whether our volunteer Katherine would like this song or not. The next part is the audio features information of this song. The graph here shows the position of this song among all the songs that our volunteer has liked. The red point is the song we select and the black points represent all the liked song. You are able to look at different dimensions by selecting different x-axis and y-axis.

Last Thoughts

My analysis focuses on both the market and user. There are some parties that may be benefited from my analysis.

Agent Companies Based on my anlysis, agent companies would know what kind of songs and artists would be the most popular in the market.
Artists When artists are makeing new songs, they can learn from my analysis to choose the optimal parameter to get their songs more popular.
Users They can analyse all their liked song to better understand their own listening behaviour.

Since there is so many numerical variables in my datasets, a clustering model can be built to generate clusters for songs and artists and get interesting findings.

A work by

Yun (Jessica) Yan - 17 December 2019 with many thanks to Katherine Li❤

Spotify Analysis

by Yun Yan – January 17 2020