Music contains so much information. We can not only compare all the hit songs to conclude the popular music trend, but also analyse the song behaviour of a particular person and get to know more about him/her through just a small music application.
Spotify is one of the top music streaming service around . With 35+ million songs and 170+ million monthly active users, musicians can spread their music and reach the audience easily. Also, it is an ideal platform for users to find their favorite music quickly and conveniently due to its large catalog, collaborative playlists, podcasts, and other attractive features. On the app, music can be browsed or searched via various parameters — such as artists, album, genre, playlist, or record label. Users can create, edit, and share playlists, share tracks on social media, and make playlists with other users.
Recently, I discovered that the Spotify API provides audio features for each song in its database, with which we are able to quantify a song. Therefore, I would like to conduct some music analysis based on it.
This project is divided into two parts. The goal in Part I is to analyze what song characteristics would affect its popularity. To achieve this, I built up a regression model and set the popularity as the dependent variable. The goal in Part II is to create a like song prediction system for a specific spotify user.
For the study, I will access the Spotify Web API, which provides data from the Spotify music catalog. This can be accessed via standard HTTPS requests to an API endpoint.
The Spotify API provides, among other things, track information for each song, including audio statistics such as danceability, instrumentalness, or tempo. Each feature measures an aspect of a song. Detailed information on how each feature is calculated can be found in the Spotify API Website.
First, I got my Client ID and the Client Secret credentials from the Spotify for Developers to connect to the Spotify API.
Sys.setenv(SPOTIFY_CLIENT_ID = 'client_id')
Sys.setenv(SPOTIFY_CLIENT_SECRET = 'client_secret')
access_token <- get_spotify_authorization_code()
Then, I created two new functions for personal usage based on some functions in the spotifyr
package.
search_spotify1(): It is like searching in the spotify application. The returned results should be album/artist/playlist/track search result for one specifc genre.
main(): I used the search_spotify1()
function I created above on 7 selected genres (country, jazz, classical, hip-hop, rap, pop and rock) repeatedly to get 10 thousand tracks for each genre and intend to bind them in one large list.
track_lists <- data.frame()
# function1
search_spotify1 <- function(genre = character(),
type = c('album', 'artist', 'playlist', 'track'),
market = NULL,
limit = 20,
offset = 0,
include_external = NULL,
authorization = get_spotify_access_token(),
include_meta_info = FALSE) {
base_url <- 'https://api.spotify.com/v1/search'
if (!is.character(genre)) {
stop('"genre" must be a string')
}
if (!is.null(market)) {
if (!str_detect(market, '^[[:alpha:]]{2}$')) {
stop('"market" must be an ISO 3166-1 alpha-2 country code')
}
}
if ((limit < 1 | limit > 50) | !is.numeric(limit)) {
stop('"limit" must be an integer between 1 and 50')
}
if ((offset < 0 | offset > 10000) | !is.numeric(offset)) {
stop('"offset" must be an integer between 1 and 10,000')
}
if (!is.null(include_external)) {
if (include_external != 'audio') {
stop('"include_external" must be "audio" or an empty string')
}
}
params <- list(
q = str_replace_all(str_glue('genre:"{genre}"'), ' ', '+'),
type = paste(type, collapse = ','),
market = market,
limit = limit,
offset = offset,
include_external = include_external,
access_token = authorization
)
res <- RETRY('GET', base_url, query = params, encode = 'json')
stop_for_status(res)
res <- fromJSON(content(res, as = 'text', encoding = 'UTF-8'), flatten = TRUE)
if (!include_meta_info && length(type) == 1) {
res <- res[[str_glue('{type}s')]]$items %>%
as_tibble %>%
mutate(genre = genre,market=market)
}
return(res)
}
# function2
track_lists <- data.frame()
main <- function(n){
for (genre1 in c('country','jazz', 'classical', 'hip-hop', 'rap', 'pop', 'rock')){
for (offset1 in 0:n){
track_list <- search_spotify1(genre=genre1,
type='track',
market='US',
limit = 50,
offset = offset1*50)
track_lists <- rbind.fill(track_lists,track_list)
print(paste0('Call #',offset1+1,' for the tracks in ',genre1))
}
}
return(track_lists)
Sys.sleep(5)
}
# get the full list of songs
track_lists1 <- lapply(199,main)
track_lists2 <- track_lists1[[1]]
With the help of the spotifyr
package, I was able to collect audio features information using the get_track_audio_features()
function and collect artist information using the get_artists()
function. I stored all the information I extracted in one .csv file.
# artist information
track_lists2['artist.id'] <- NA
track_lists2['artist.name'] <- NA
for (i in 1:nrow(track_lists2)){
track_lists2[i,31] <- track_lists2$artists[[i]][1,][["id"]]
track_lists2[i,32] <- track_lists2$artists[[i]][1,][["name"]]
}
# unique
track_lists3 <- track_lists2 %>%
group_by(id,name,album.id,album.name,album.release_date,album.total_tracks,artist.id,
artist.name,popularity) %>%
filter(genre == max(genre)) %>%
ungroup() %>%
select(id,name,album.id,album.name,album.release_date,album.total_tracks,artist.id,
artist.name,popularity,genre) %>%
distinct()
# get the feasures of songs (limit: 100)
features <- data.frame()
for (i in 0:floor(nrow(track_lists3)/100)){
if (i<floor(nrow(track_lists3)/100)){
features_temp <- get_track_audio_features(track_lists3$id[(i*100+1):((i+1)*100)])
features <- rbind.fill(features,features_temp)
}
else{
features_temp <- get_track_audio_features(track_lists3$id[(i*100+1):(nrow(track_lists3))])
features <- rbind.fill(features,features_temp)
}
}
features <- features %>% select(-uri,-track_href,-analysis_url,-type,-id)
track_listsa <- cbind(track_lists3,features)
# get the artist information of songs (limit: 50)
artists <- data.frame()
for (i in 0:floor(nrow(track_lists3)/50)){
if (i<floor(nrow(track_lists3)/50)){
artists_temp <- get_artists(track_lists3$artist.id[(i*50+1):((i+1)*50)])
artists <- rbind.fill(artists,artists_temp)
}
else{
artists_temp <- get_artists(track_lists3$artist.id[(i*50+1):(nrow(track_lists3))])
artists <- rbind.fill(artists,artists_temp)
}
}
artists <- artists %>%
select(popularity,followers.total) %>%
rename(`artists.popularity`=popularity)
track_listsb <- cbind(track_listsa,artists)
# output
write.csv(track_listsb, file = "spotify.csv",row.names=FALSE)
I have collected 55557 observations with 26 variables, among which there are 14 numerical variables and 12 categorical variables. Also, I outputed the summary information of all the variables.
We want to analyze the what would impact the popularity of a song. That is to say, we set the popularity as our dependent variable.
sp <- read_csv("e:/Documents/spotify.csv")
sp <- sp %>%
mutate(genre=as.factor(genre),
album.release_year = substring(album.release_date,1,4),
album.release_year = ifelse(album.release_year == '0000','2009',album.release_year),
album.release_year = as.factor(album.release_year),
key=as.factor(key),
mode=as.factor(mode),
time_signature=as.factor(time_signature)
)
dim(sp)
## [1] 55557 26
## id name album.id album.name
## Length:55557 Length:55557 Length:55557 Length:55557
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## album.release_date album.total_tracks artist.id artist.name
## Length:55557 Min. : 1.00 Length:55557 Length:55557
## Class :character 1st Qu.: 10.00 Class :character Class :character
## Mode :character Median : 12.00 Mode :character Mode :character
## Mean : 17.01
## 3rd Qu.: 16.00
## Max. :531.00
##
## popularity genre danceability energy
## Min. : 0.0 classical:9260 Min. :0.0598 Min. :0.0000203
## 1st Qu.:41.0 country :8136 1st Qu.:0.4410 1st Qu.:0.3220000
## Median :52.0 hip-hop :3631 Median :0.5780 Median :0.5710000
## Mean :50.8 jazz :9868 Mean :0.5628 Mean :0.5262837
## 3rd Qu.:61.0 pop :5088 3rd Qu.:0.6980 3rd Qu.:0.7420000
## Max. :99.0 rap :9627 Max. :0.9870 Max. :0.9990000
## rock :9947
## key loudness mode speechiness
## 0 : 6401 Min. :-47.432 0:18764 Min. :0.0223
## 7 : 6285 1st Qu.:-12.768 1:36793 1st Qu.:0.0354
## 1 : 5834 Median : -7.894 Median :0.0461
## 2 : 5813 Mean :-10.426 Mean :0.0913
## 9 : 5136 3rd Qu.: -5.629 3rd Qu.:0.0905
## 5 : 4982 Max. : 1.593 Max. :0.9530
## (Other):21106
## acousticness instrumentalness liveness valence
## Min. :0.0000014 Min. :0.0000000 Min. :0.0134 Min. :0.0000
## 1st Qu.:0.0534000 1st Qu.:0.0000000 1st Qu.:0.0951 1st Qu.:0.2450
## Median :0.2610000 Median :0.0000335 Median :0.1190 Median :0.4420
## Mean :0.3945907 Mean :0.1885753 Mean :0.1749 Mean :0.4522
## 3rd Qu.:0.7640000 3rd Qu.:0.1220000 3rd Qu.:0.2030 3rd Qu.:0.6480
## Max. :0.9960000 Max. :0.9940000 Max. :0.9900 Max. :0.9850
##
## tempo duration_ms time_signature artists.popularity
## Min. : 32.85 Min. : 20467 0: 1 Min. : 16.00
## 1st Qu.: 92.48 1st Qu.: 184328 1: 539 1st Qu.: 60.00
## Median :115.14 Median : 218733 3: 5367 Median : 70.00
## Mean :117.36 Mean : 238368 4:48665 Mean : 69.22
## 3rd Qu.:139.35 3rd Qu.: 264585 5: 985 3rd Qu.: 79.00
## Max. :232.69 Max. :3391040 Max. :100.00
##
## followers.total album.release_year
## Min. : 13 Length:55557
## 1st Qu.: 121037 Class :character
## Median : 558210 Mode :character
## Mean : 2368187
## 3rd Qu.: 2332368
## Max. :55418461
##
We firstly want to explore the relationship between popularity and several categorical variables to get a better sense of our data.
Null Hypothesis: There is no difference in the mean popularity of different music mode.
I used both graphical and statistical method to conduct hypothesis testing.
Surprisingly, a song with minor mode (a song contains more darkness, sadness, ugliness and feminitiy) would tend to be more popular.
Null Hypothesis: There is no difference in the mean popularity of different music key.
Based on our ANOVA test result, the null hypothesis should be rejected. Artists may get some insights from the TukeyHSD result when considering whether shifting from one certain key to another key for a new song. For example, if the artist is debating between key C and key D for his/her new song, our suggestion based on the TukeyHSD would be that the debating is pointless. When other things remain the same, change between key C and key D for a song makes almost no different because the p-value for the 2-0 comparison (the first row in the table) is 1.
ggplot(sp)+
geom_boxplot(aes(x=key, y= popularity,color=key,middle=mean(popularity)),size=1.6)+
theme(plot.background = element_blank(),
panel.background = element_blank(),
legend.key = element_blank())+
labs(title = 'Relationship b/w Key & Popularity')+
geom_hline(aes(yintercept=mean(popularity)),size=1.6)
The song with 4 time signature is dominating the market. If an artist wants to create a popular song, he/she might want to follow the main market trend and focus on the 4 time signature.
sp %>%
filter(time_signature!='0') %>%
ggplot()+
geom_boxplot(aes(x=time_signature, y= popularity,color=time_signature,
middle=mean(popularity)),size=1.2)+
theme(plot.background = element_blank(),
panel.background = element_blank(),
legend.key = element_blank())+
labs(title = 'Relationship b/w Time_Signature & Popularity')+
geom_hline(aes(yintercept=mean(popularity)),size=1.2)
Therefore, we can come up with a basic conclusion that, if an artist wants to make a popular song, he/she would better focus on a song with minor mode, C♯, D♭, E, F♯, G♭, G♯, A and B key,and 4 time signature.
There are too many varibables at present and feature selection is needed. Therefore, I made the correlation plot to detect the multicollinearity as well as used Lasso Test to add pentalty on variables in order to get the most valuable ones.
sp_n <- sp %>% keep(is.numeric) %>% select(-popularity)
library(corrplot)
corrplot.mixed(cor(sp_n), lower = "number", upper = "square", tl.pos='lt',order = "FPC",
tl.cex=0.7,tl.srt=45,number.cex=0.9,diag='l')
From the plot, we can find that the squares in the corner show the stronger relationship. It can help us to remove the acousticness, energy, instrumentalness and artists.popularity variables.
Then, I used the lasso test to add a penalty on the rest variables to get the most valuable ones.
# feature selection 1 (based on correlation)
sp1 <- sp %>% select(-acousticness,-energy,-instrumentalness,-artists.popularity)
# feature selection 2 (based on lasso)
library(penalized)
lassoTest <- penalized(response = popularity,
penalized = ~ danceability + loudness + speechiness +
liveness + valence + tempo + duration_ms +
followers.total + mode + time_signature + key,
data = sp1, lambda1 = 13000,
model = "linear", trace = FALSE)
coefficients(lassoTest)
## (Intercept) danceability loudness tempo duration_ms
## 6.166056e+01 6.977324e-01 9.890076e-01 3.882204e-03 -1.124181e-05
## followers.total
## 5.405840e-07
Finally, I got 5 variables: danceability, loudness, tempo, duration_ms and followers.total, and now I can start to build my model.
In order to increase the interpretation power of my model, I centered some of my variables so that the intercept would have a meaning even though the variable equals zero.
I built the simple linear regression model.
First of all, we can look at the R-squared value to learn how much the variability within the popularity of a song can be accounted for our predictor variables. The R-squared value is 0.33, which means the current independent variables we choose can explain approximately 33% of the variablitify of the song popularity.
Then, we can find that the p-value for F-test is quite small, which suggested that the model has passed the F-test.The overall significance of our model is significant.
Furthermore, we can see that all the coefficients here have pass the t-test and the p-values suggest that they are all very significant.
# linear regression - 5 variable
lm <- lm(popularity~danceability+loudness+tempo+duration_ms+followers.total,sp1)
summary(lm)
##
## Call:
## lm(formula = popularity ~ danceability + loudness + tempo + duration_ms +
## followers.total, data = sp1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -87.127 -7.449 0.424 7.927 45.153
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.256e+01 2.566e-01 204.811 < 2e-16 ***
## danceability 1.057e+01 3.362e-01 31.453 < 2e-16 ***
## loudness 8.607e-01 8.922e-03 96.477 < 2e-16 ***
## tempo 8.431e-03 1.726e-03 4.884 1.04e-06 ***
## duration_ms -7.948e-06 5.016e-07 -15.846 < 2e-16 ***
## followers.total 5.333e-07 9.656e-09 55.231 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.12 on 55551 degrees of freedom
## Multiple R-squared: 0.3304, Adjusted R-squared: 0.3304
## F-statistic: 5482 on 5 and 55551 DF, p-value: < 2.2e-16
We conduct an a post power analysis to determine the sample size we used is effective.Using \(f^2 = \frac{R^2_{adjusted}}{1 - R^2_{adjusted}}\) as our effect size, the power we got is 1. It suggests that it is safe for us to move forward.
##
## Multiple regression power calculation
##
## u = 1
## v = 55551
## f2 = 0.4934289
## sig.level = 0.05
## power = 1
Then, we want to confirm that whether there are outliers in our model. The plot below is a Residuals vs Leverage plot in which no evidence of outliers have shown up. Those “Cook’s distance” dashed curves don’t even appear on the plot. None of the points come close to having both high residual and leverage.
In order to detect the heteroscadesticity, I used both graphical and statistical methods. From the graph, we can identify that there is a declining trend. The null hypothesis of the bptest is that the residuals are constant. We can see here the p-value of both models are quite small, which means that we will reject the null hypothesis.
##
## studentized Breusch-Pagan test
##
## data: lm
## BP = 1666.6, df = 5, p-value < 2.2e-16
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.2558e+01 2.5388e-01 207.0214 < 2.2e-16 ***
## danceability 1.0575e+01 3.2444e-01 32.5932 < 2.2e-16 ***
## loudness 8.6072e-01 1.0386e-02 82.8732 < 2.2e-16 ***
## tempo 8.4309e-03 1.7622e-03 4.7842 1.721e-06 ***
## duration_ms -7.9484e-06 5.8680e-07 -13.5453 < 2.2e-16 ***
## followers.total 5.3331e-07 1.0145e-08 52.5696 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the above analysis, we know that outliers are fine but there is a heteroscadesticity in my model. By using a heteroscedasticity-consistent covariance matrix to test our coefficients, we have more reasonable estimates.
I want to explore on the genre variable since I think music in different genre would have different performance. We can see from the left visual below that the popularity distribution within each genre is diffferent, and the right one shows that the music characteristics of each genre are different as well (This radar chart is interactive. You can click on the genres to choose whether to show or hide it).
ggplot(sp)+
geom_density(aes(x=popularity,fill=genre),alpha=0.4)+
theme(plot.background = element_blank(),
panel.background = element_blank(),
legend.key = element_blank())+
labs(title = 'Relationship b/w Genre & Popularity')
quantile <- sp1 %>%
dplyr::group_by(genre) %>%
dplyr::summarise(popularity = list(enframe(quantile(popularity, probs=0.5)))) %>%
unnest(cols=c(popularity))
quantile_set <- sp1 %>% left_join(quantile,by=c('genre')) %>% filter(popularity==value)
quantile_set1 <- quantile_set %>%
group_by(genre) %>%
summarise_all(mean) %>%
map(~.x) %>%
discard(~all(is.na(.x))) %>%
map_df(~.x) %>%
select(-value,-popularity)
quantile_norm <- cbind(quantile_set1[1], apply(quantile_set1[-1],2, function(x){(x-min(x)) / diff(range(x))}))
radar <- gather(quantile_norm, key=Attribute, value=Score, -genre) %>%
spread(key=genre, value=Score)
library(radarchart)
chartJSRadar(scores = radar, scaleStartValue = -1, maxScale =1, polyAlpha=0.2,lineAlpha = 0.9, showToolTipLabel = TRUE)
We are not going to add genre as another predictor, but we are going to include it as another level to our model with the mixed model.
# mixed model with genre
library(lme4)
intMod <- lmer(popularity ~ 1 + (1|genre), data = sp1)
summary(intMod)
## Linear mixed model fit by REML ['lmerMod']
## Formula: popularity ~ 1 + (1 | genre)
## Data: sp1
##
## REML criterion at convergence: 407444.7
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -7.1547 -0.5096 -0.0617 0.5585 4.9853
##
## Random effects:
## Groups Name Variance Std.Dev.
## genre (Intercept) 148.95 12.205
## Residual 89.56 9.463
## Number of obs: 55557, groups: genre, 7
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 52.061 4.613 11.29
ICC means the intraclass correlation and it ranges from 0 (no variance between clusters) to 1 (variance between clusters). It shows that 62% of variance in song popularity is accounted for by the genre alone.
## [1] 0.6245021
By integrating information about the groups, we are getting a better sense of how much uncertainty our model contains at the global average level. I put the final interpretation of the model at the end of this section.
The standard deviation is telling us how much the effect moves around based upon genre after getting the information from our fixed effects. For example, the effect move around nearly 11.663 from genre alone).
riMod <- lmer(popularity ~ danceability+loudness+tempo+duration_ms+followers.total + (1|genre), data = sp1)
summary(riMod)
## Linear mixed model fit by REML ['lmerMod']
## Formula: popularity ~ danceability + loudness + tempo + duration_ms +
## followers.total + (1 | genre)
## Data: sp1
##
## REML criterion at convergence: 407266.1
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -7.7652 -0.5284 -0.0467 0.5493 4.9582
##
## Random effects:
## Groups Name Variance Std.Dev.
## genre (Intercept) 136.03 11.663
## Residual 89.14 9.442
## Number of obs: 55557, groups: genre, 7
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 5.243e+01 4.414e+00 11.880
## danceability -1.161e-01 3.007e-01 -0.386
## loudness 6.246e-02 9.519e-03 6.562
## tempo -2.303e-03 1.350e-03 -1.707
## duration_ms -1.834e-06 3.958e-07 -4.633
## followers.total 1.131e-07 8.003e-09 14.135
##
## Correlation of Fixed Effects:
## (Intr) dncblt lodnss tempo drtn_m
## danceabilty -0.042
## loudness 0.026 -0.129
## tempo -0.006 0.090 -0.130
## duration_ms -0.008 0.179 -0.055 0.042
## follwrs.ttl -0.006 0.027 -0.023 0.004 -0.043
## fit warnings:
## Some predictor variables are on very different scales: consider rescaling
The fixed effects account for nearly 0.30% of the variation within the effects while the random effects account for nearly 60.53% of the variation within the effects.
## R2m R2c
## [1,] 0.003069619 0.6053332
We also plot simulated random effect ranges for each of the random effect groups. We can see that the effect range is quite large.
We can see that after our mixed model, our plot becomes more condensed. It means that the mixed model is better.
mixedPred <- predict(riMod)
slimPred <- predict(lm)
allPred <- cbind(actual = sp1$popularity,
mixed = mixedPred,
slim = slimPred)
par(mfrow=c(1,2))
plot(allPred[, "actual"], allPred[, "slim"])
plot(allPred[, "actual"], allPred[, "mixed"])
Now we have a better understanding of how different genre groups affect the dependent variable and the globalised effect on the popularity variable.
From the regression result, we know that if the artist wants to produce a popular song, he/she should consider to make it louder. It may be beneficial for a song to spread out. Also, a popular song should have lower tempo and durations. It may be resulted from that people tend to lose patience on long songs. Besides, it is important for artists to get more followers to make their songs more popular.
I also noticed that the coefficients are a little bit different from my original linear regression model. I think my mixed model would be more reliable since it is more generalized and it has included the genre variable for the random effect. Also, random slope model should also be considered to improve the model.
Based on my current song database, I want to collect song data from one certain person and build a model to predict whether she would like a certain song or not. I got a volunteer from my class and I scrapped all the songs she liked in her spotify account.
I collected audio features as well as the artist information for all the songs that my volunteer has liked. Then I stored all the information I extracted in one .csv file.
mine <- data.frame()
main1 <- function(n){
for (offset1 in 0:n){
mine_temp <- get_my_saved_tracks1(limit = 20,offset = offset1*20)
mine <- rbind(mine,mine_temp)
print(paste0('Call #',offset1+1,' for the tracks'))
}
return(mine)
Sys.sleep(5)
}
mine_kat <- lapply(163,main1)
mine_kat1 <- mine_kat[[1]]
# artist information
mine_kat1['artist.id'] <- NA
mine_kat1['artist.name'] <- NA
for (i in 1:nrow(mine_kat1)){
mine_kat1[i,31] <- mine_kat1$track.artists[[i]][1,][["id"]]
mine_kat1[i,32] <- mine_kat1$track.artists[[i]][1,][["name"]]
}
mine_kat2 <- mine_kat1 %>%
select(added_at,track.duration_ms,track.id,track.name,track.popularity,track.album.id,track.album.name,track.album.release_date,track.album.total_tracks,artist.id,artist.name) %>% rename(track.artist.id=artist.id,track.artist.name=artist.name)
features_kat <- data.frame()
for (i in 0:floor(nrow(mine_kat2)/100)){
if (i<floor(nrow(mine_kat2)/100)){
features_temp1 <- get_track_audio_features(mine_kat2$track.id[(i*100+1):((i+1)*100)]) #less than 100
features_kat <- rbind.fill(features_kat,features_temp1)
}
else{
features_temp1 <- get_track_audio_features(mine_kat2$track.id[(i*100+1):(nrow(mine_kat2))]) #less than 100
features_kat <- rbind.fill(features_kat,features_temp1)
}
}
features_kat <- features_kat %>% select(-uri,-track_href,-analysis_url,-type,-id)
mine_kata <- cbind(mine_kat2,features_kat)
# get the artist information of songs
artists_kat <- data.frame()
for (i in 0:floor(nrow(mine_kata)/50)){
if (i<floor(nrow(mine_kata)/50)){
artists_temp1 <- get_artists(mine_kata$track.artist.id[(i*50+1):((i+1)*50)]) #less than 50
artists_kat <- rbind.fill(artists_kat,artists_temp1)
}
else{
artists_temp1 <- get_artists(mine_kata$track.artist.id[(i*50+1):(nrow(mine_kata))]) #less than 50
artists_kat <- rbind.fill(artists_kat,artists_temp1)
}
}
artists_kat <- artists_kat %>% select(popularity,followers.total) %>% rename(artists.popularity=popularity)
mine_katb <- cbind(mine_kata,artists_kat)
# output
write.csv(mine_katb, file = "kat.csv",row.names=FALSE)
I combined the two datasets and labeled the song in my old database as not-like(0) and the song my volunteer has already liked as like(1).
# data processing -
kat <- read_csv("e:/Documents/kat.csv")
spc <- sp %>% dplyr::select(-genre,-album.release_year) %>% mutate(like=0)
mine_katc <- kat %>% dplyr::select(`track.id`,`track.name`,`track.album.id`,
`track.album.name`,`track.album.release_date`,
`track.album.total_tracks`,`track.artist.id`,
`track.artist.name`,`track.popularity`,
danceability,energy,key,loudness,mode,speechiness,
acousticness,instrumentalness,liveness,valence,
tempo,duration_ms,time_signature,`artists.popularity`,
`followers.total`) %>% mutate(like=1)
colnames(mine_katc) <- colnames(spc)
new <- rbind(spc,mine_katc)
new <- new %>%
mutate(album.release_year = substring(album.release_date,1,4),
album.release_year = ifelse(album.release_year == '0000','2009',album.release_year),
album.release_year = as.factor(album.release_year),
key=as.factor(key),
mode=as.factor(mode),
time_signature=as.factor(time_signature),
like=as.numeric(like)
)
summary(new)
## id name album.id album.name
## Length:58826 Length:58826 Length:58826 Length:58826
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## album.release_date album.total_tracks artist.id artist.name
## Length:58826 Min. : 1.00 Length:58826 Length:58826
## Class :character 1st Qu.: 9.00 Class :character Class :character
## Mode :character Median : 12.00 Mode :character Mode :character
## Mean : 16.63
## 3rd Qu.: 16.00
## Max. :531.00
##
## popularity danceability energy key
## Min. : 0.00 Min. :0.0598 Min. :0.0000203 0 : 6815
## 1st Qu.:41.00 1st Qu.:0.4420 1st Qu.:0.3290000 7 : 6600
## Median :52.00 Median :0.5780 Median :0.5720000 1 : 6144
## Mean :50.55 Mean :0.5627 Mean :0.5285491 2 : 6107
## 3rd Qu.:61.00 3rd Qu.:0.6970 3rd Qu.:0.7420000 9 : 5452
## Max. :99.00 Max. :0.9870 Max. :0.9990000 5 : 5299
## (Other):22409
## loudness mode speechiness acousticness
## Min. :-47.432 0:20021 Min. :0.02230 Min. :0.0000014
## 1st Qu.:-12.515 1:38805 1st Qu.:0.03530 1st Qu.:0.0527000
## Median : -7.855 Median :0.04590 Median :0.2570000
## Mean :-10.306 Mean :0.08993 Mean :0.3908617
## 3rd Qu.: -5.619 3rd Qu.:0.08840 3rd Qu.:0.7530000
## Max. : 1.593 Max. :0.95300 Max. :0.9960000
##
## instrumentalness liveness valence tempo
## Min. :0.0000000 Min. :0.0134 Min. :0.0000 Min. : 32.85
## 1st Qu.:0.0000000 1st Qu.:0.0952 1st Qu.:0.2400 1st Qu.: 92.85
## Median :0.0000366 Median :0.1190 Median :0.4340 Median :115.44
## Mean :0.1846149 Mean :0.1745 Mean :0.4471 Mean :117.42
## 3rd Qu.:0.1060000 3rd Qu.:0.2020 3rd Qu.:0.6420 3rd Qu.:139.01
## Max. :0.9940000 Max. :0.9900 Max. :0.9850 Max. :232.69
##
## duration_ms time_signature artists.popularity followers.total
## Min. : 20467 0: 1 Min. : 0.0 Min. : 13
## 1st Qu.: 185340 1: 562 1st Qu.: 60.0 1st Qu.: 122650
## Median : 219291 3: 5603 Median : 71.0 Median : 566451
## Mean : 238260 4:51614 Mean : 69.3 Mean : 2504080
## 3rd Qu.: 264274 5: 1046 3rd Qu.: 79.0 3rd Qu.: 2368724
## Max. :3391040 Max. :100.0 Max. :55469919
##
## like album.release_year
## Min. :0.00000 Length:58826
## 1st Qu.:0.00000 Class :character
## Median :0.00000 Mode :character
## Mean :0.05557
## 3rd Qu.:0.00000
## Max. :1.00000
##
I decided to include 70% of the data in the training dataset and the rest of them in the test dataset. Then I found that the training dataset is not well-balanced and it may cause the accruacy paradox. Therefore, I used the SMOTE()
function to balance it.
set.seed(1)
library(caret)
dsub <- createDataPartition(sp_new1$like,p=0.7,list = F)
sp_train <- sp_new1[dsub,]
sp_test <- sp_new1[-dsub,]
round(prop.table(table(dplyr::select(sp_train, like), exclude = NULL)), 4) * 100
##
## 0 1
## 94.44 5.56
sp_train <- DMwR::SMOTE(like ~ ., data.frame(sp_train), perc.over = 100, perc.under = 200)
round(prop.table(table(dplyr::select(sp_train, like), exclude = NULL)), 4) * 100
##
## 0 1
## 50 50
I want to predict whether my volunteer would like a certain song or not, so I choose to build a logistic regression model.
First of all, I used the backward regression method to come up with the model with the lowest AIC. The best logistic regression model should have 14 independent variables.
logTest = glm(like ~ .,
data = na.omit(sp_train),
family = binomial(link="logit"))
step(logTest, direction = "backward",trace=FALSE)
##
## Call: glm(formula = like ~ album.total_tracks + popularity + danceability +
## energy + loudness + speechiness + acousticness + instrumentalness +
## liveness + valence + tempo + duration_ms + artists.popularity +
## followers.total, family = binomial(link = "logit"), data = na.omit(sp_train))
##
## Coefficients:
## (Intercept) album.total_tracks popularity danceability
## -0.79835 -2.22399 -0.52929 0.13995
## energy loudness speechiness acousticness
## -0.10337 0.95500 -0.49872 0.07332
## instrumentalness liveness valence tempo
## -0.13736 -0.11467 -0.74772 -0.03772
## duration_ms artists.popularity followers.total
## -0.15085 0.07057 0.43481
##
## Degrees of Freedom: 9155 Total (i.e. Null); 9141 Residual
## Null Deviance: 12690
## Residual Deviance: 10130 AIC: 10160
Then, the logistic regression model was built up with the selected variables. We can explain the coefficients and describe the listening behaviour of our volunteer.
From the result of my logistic regression, we can that this person likes songs with high danceability and loudness, from which we can infer that she might like EDM (Electronic dance music) a lot. Also, we can see that she does not like songs with high speechiness, which might indicate she does not like rap songs. Lastly, we found that the popularity of a song would have a negative impact on the log-odds of liking the song while the popularity of an artist has a positive impact. This person is like a treasure hunter who want to dig deep and find songs that are not popular from an artist.
logreg=glm(like ~
album.total_tracks +
popularity+
danceability+
loudness+
speechiness+
acousticness+
instrumentalness+
liveness+
valence+
tempo+
duration_ms+
artists.popularity+
followers.total,
data = na.omit(sp_train),
family = binomial(link="logit"))
summary(logreg)
##
## Call:
## glm(formula = like ~ album.total_tracks + popularity + danceability +
## loudness + speechiness + acousticness + instrumentalness +
## liveness + valence + tempo + duration_ms + artists.popularity +
## followers.total, family = binomial(link = "logit"), data = na.omit(sp_train))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7429 -0.9403 0.0364 0.9316 5.4732
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.79661 0.03427 -23.245 < 2e-16 ***
## album.total_tracks -2.23137 0.13565 -16.449 < 2e-16 ***
## popularity -0.52793 0.02692 -19.612 < 2e-16 ***
## danceability 0.15046 0.03385 4.445 8.81e-06 ***
## loudness 0.88069 0.05569 15.814 < 2e-16 ***
## speechiness -0.50355 0.03469 -14.515 < 2e-16 ***
## acousticness 0.11091 0.03751 2.957 0.00311 **
## instrumentalness -0.14888 0.03505 -4.247 2.16e-05 ***
## liveness -0.11904 0.02714 -4.387 1.15e-05 ***
## valence -0.76649 0.03251 -23.576 < 2e-16 ***
## tempo -0.03963 0.02652 -1.494 0.13510
## duration_ms -0.15302 0.03260 -4.694 2.68e-06 ***
## artists.popularity 0.07154 0.03091 2.314 0.02065 *
## followers.total 0.43464 0.02971 14.628 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12693 on 9155 degrees of freedom
## Residual deviance: 10133 on 9142 degrees of freedom
## AIC: 10161
##
## Number of Fisher Scoring iterations: 6
The ideal cut-off value is 0.5032. It means that if a song gets a value over 0.5032, then we will say that our volunteer would like this song. Otherwise, the probability that our volunteer would like this song is very low. It can be a good reference for the song recommendation system.
like_pred <- predict(logreg, sp_test, type = "response")
ideal_cutoff <- InformationValue::optimalCutoff(
actuals = sp_test$like,
predictedScores = like_pred,
optimiseFor = "Both")
ideal_cutoff
## [1] 0.503283
As we calculated, the accuracy of our logistic model is 70%, which means that the probability that our prediction is right is over 70 percent.
like_pred_value <- ifelse(like_pred > ideal_cutoff,1,0)
table <- table(sp_test$like, like_pred_value)
sum(diag(table)) / nrow(sp_test)
## [1] 0.7063524
A ShinyApp has been created for my Spotify Analysis.
To use it, we should firstly get a link of a song. We can access it by clicking on the ‘Copy Song Link’ in the song menu (See Below).
The id consisting of numbers and letters after the last slash in the song link is the id of the song we choose.
For example, https://open.spotify.com/track/0bYg9bo50gSsH3LtXe2SQn is the song link of ‘All I Want for Christmas Is You’. Then its id is 0bYg9bo50gSsH3LtXe2SQn.
We input it in the ‘track id’ blank (See Below), and then click ‘Submit’.
The output is as below. First of all, we would know the prediction result that whether our volunteer Katherine would like this song or not. The next part is the audio features information of this song. The graph here shows the position of this song among all the songs that our volunteer has liked. The red point is the song we select and the black points represent all the liked song. You are able to look at different dimensions by selecting different x-axis and y-axis.
My analysis focuses on both the market and user. There are some parties that may be benefited from my analysis.
Agent Companies Based on my anlysis, agent companies would know what kind of songs and artists would be the most popular in the market.
Artists When artists are makeing new songs, they can learn from my analysis to choose the optimal parameter to get their songs more popular.
Users They can analyse all their liked song to better understand their own listening behaviour.
Since there is so many numerical variables in my datasets, a clustering model can be built to generate clusters for songs and artists and get interesting findings.
A work by