Earnings call is a conference call among the management of a public company, analysts, investors, and the media to discuss the company’s financial situation during a given reporting period, such as a quarter or a fiscal year.
In this analysis, I dig into the WWE earnings call transcript and perform sentiment analysis with various lexicons. In addition, I scraped WWE’s stock prices using Alpha Vantage API to explore the closing price of the 10 trading days around each call’s date and analyze the relationship between closing prices and the sentiment scores in order to find out the investment behavior patterns based upon a call.
I use functions within R (e.g., unzip
or unz
) to unzip the data and load the parsed data.
rm(list=ls())
parsed <- list.files(pattern = ".*parse.*csv")
wweCalls <- parsed %>%
lapply(read.csv, stringsAsFactors=F) %>%
bind_rows
glimpse(wweCalls)
## Observations: 2,239
## Variables: 12
## $ name <chr> "operator", "wayne rappaport", "linda mcmahon", "mic...
## $ firstName <chr> "Operator", "Wayne", "Linda", "Michael", "Wayne", "O...
## $ firstLast <chr> "NA NA", "wayne rappaport", "linda mcmahon", "michae...
## $ organization <chr> NA, "World Wrestling EntertainmentInc.", "World Wres...
## $ title <chr> NA, "Director Planning & Analysis", "CEO", "CFO"...
## $ text <chr> " Good day. All sites are now online in listen-only ...
## $ gender <chr> NA, "male", "female", "male", "male", NA, "male", "f...
## $ likelyRace <chr> NA, "meanwhite", "meanwhite", "meanwhite", "meanwhit...
## $ likelyRaceProb <dbl> NA, 90.4780, 92.9470, 94.3630, 90.4780, NA, 94.3630,...
## $ ticker <chr> "WWE", "WWE", "WWE", "WWE", "WWE", "WWE", "WWE", "WW...
## $ date <chr> "1-Dec-05", "1-Dec-05", "1-Dec-05", "1-Dec-05", "1-D...
## $ quarter <chr> "Q2", "Q2", "Q2", "Q2", "Q2", "Q2", "Q2", "Q2", "Q2"...
First, I dealt with the some weird name in the dataset. In addition, I converted quarter
into numerical variables and date
into date format for later usage.
Next, I remove the operator’s and media’s speeches because I think these are just for connecting the meeting and do not contain much useful information.
Besides, I grouped all the different titles into three main categories: Analyst, Financial Lead and Management.
# Quarter & Date
text_taj <- function(x) (x = ifelse(grepl('it used to be Taj',as.character(x)),NA,as.character(x)))
wweCalls1 <- wweCalls %>%
mutate(quarter=as.numeric(substr(quarter,2,2)),
date=as.Date(date,'%d-%b-%y'),
name=as.character(name),
name = ifelse(name %in% c('unidentified participant','unidentified audience member','unknown speaker','unidentified company representative','india is the other up and coming for us. we have a wonderful tv deal with zee tv. it used to be -- i think it was -- its taj','as a reminder'),NA,name),
firstName=as.character(firstName),
firstName = ifelse(is.na(name),NA,firstName),
firstLast=as.character(firstLast),
firstLast = ifelse(is.na(name),NA,firstLast),
organization = ifelse(organization==',',NA,as.character(organization)),) %>%
mutate_at(c('organization','title'),text_taj,)
# Drop the Operator & Media
wweCalls1 <- wweCalls1 %>%
filter(firstName!='Operator')%>%
filter(title!='Media')
# Title
wweCalls2 <- wweCalls1 %>%
mutate(title1=ifelse(grepl('VP|Planning|IR|CFO|Financial', title),'Financial Lead',ifelse(grepl('Analyst',title),'Analyst','Management')))
This part is about cleansing the organization variable in the data.
I notice that there are different names for the same company, which may be resulted from the manual input problems. I tried to use the string distance matrix to make the name consistent for the same company. However, I also found that there are very similar company names. For example, Carrier Partners
and Terrier Partners
are highly similar to each other, but actually they are two different companies. Therefore, I understand that my grouping result may not be very accurate.
My solutions is to conduct grouping twice, the first time with a loose standard and the second time with a more strict standard. It seems to improve the misclassification situation a bit but still not perform perfectly without manual checking. I just try to make my data as clean as possible. Since the grouping result is not satisfying enough, this variable would not be used in my following analysis.
# Organization
oz_df <- data.frame(organization= unique(na.omit(wweCalls2$organization)))
oz_df$organization %>%
tolower %>%
trimws %>%
str_replace_all('&','and') %>%
removeWords(words = c(stopwords("en"))) %>%
stringdist::stringdistmatrix(method = "jaccard", q = 3.5) %>%
as.dist %>%
`attr<-`("Labels", oz_df$organzation)%>%
hclust %T>%
plot %T>%
rect.hclust(h = 0.3) %>%
cutree(h = 0.3) %>%
print -> oz_df$group
## [1] 1 2 3 4 5 6 7 1 8 4 9 10 3 11 12 13 10 14 2 15 16 17 18 19 20
## [26] 21 18 22 23 1 18 4 24 25 26 27 28 29 30 31 32 33 18 34 35 36 37 18 38 1
## [51] 39 40 41 42 43 20 30 44 45 46 47 1 48 49 50 7 51 20 52 53 54 55 56 57 1
## [76] 41 9 13 5
oz_df$organization <- as.character(oz_df$organization)
for (n in 1:max(as.integer(oz_df$group))){
oz_df[oz_df[,'group']==as.character(n),2] <- oz_df[oz_df[,'group']==as.character(n),][1,1]
}
## do second time to make it precise
oz_df$organization %>%
tolower %>%
trimws %>%
removeWords(words = c(stopwords("en"))) %>%
stringdist::stringdistmatrix(method = "jw",p=0.01) %>%
as.dist %>%
`attr<-`("Labels", oz_df$organzation)%>%
hclust %T>%
plot %T>%
rect.hclust(h = 0.3) %>%
cutree(h = 0.3) %>%
print -> oz_df$group1
## [1] 1 2 3 4 5 6 7 1 8 4 9 10 3 11 12 13 10 14 2 4 15 4 16 17 15
## [26] 18 16 6 19 1 16 4 20 21 22 1 15 17 23 3 6 23 16 24 25 7 26 16 14 1
## [51] 8 13 3 6 27 15 23 23 2 28 23 1 12 29 30 7 31 15 5 11 32 31 9 13 1
## [76] 3 9 13 5
for (n in 1:max(as.integer(oz_df$group1))){
oz_df[oz_df[,'group1']==as.character(n),3] <- oz_df[oz_df[,'group1']==as.character(n),][1,1]
}
oz_df['dis'] <- stringdist(oz_df$organization,oz_df$group1, method = "jw") %>% list()
if ((0.2 < oz_df['dis'] && oz_df['dis']<0.221) || 0.235<oz_df['dis']){
oz_df['group2'] <-oz_df['organization']
} else{
oz_df['group2'] <-oz_df['group1']
}
oz_df <- oz_df %>% select(1,5)
wweCalls3 <- wweCalls2 %>%
left_join(oz_df,by=c('organization'='organization')) %>%
mutate(group2=str_replace_all(group2,'&','and'))
As for sentiment analysis, I first conducted a general one on all the sentences with the Loughran and McDonald's lexicon
. My visualization is the sentiment trend line of people with different title in the company.
wweCalls3$label <- seq.int(nrow(wweCalls3))
songSentiment = sentiment(get_sentences(wweCalls3),
polarity_dt = hash_sentiment_loughran_mcdonald) %>%
group_by(label) %>%
summarize(meanSentiment = mean(sentiment))
cleanLyrics = left_join(wweCalls3, songSentiment, by = "label")
cleanLyrics %>%
group_by(title1,date) %>%
na.omit() %>%
summarize(meanSentiment = mean(meanSentiment)) %>%
ggplot(., aes(date, meanSentiment,color=title1)) +
geom_smooth(se=F) +
theme_minimal()+
geom_hline(yintercept=0,size = 0.3,color="black")
We can see that the Management board’s speeches are always more positive than others since they may need to show confience in their own companies and products to attract investors.
The sentiment score of the analyst are relatively stable overtime. It is because the analysts are only responsible for asking questions and there is no need for them to show many emotions.
Also, we notice that the sentiment in financial department leaders’ speeches was negative for several years and has increased in recent year. My guess is that this sentiment variability is relavant to the stock market volatility. Also, the financial leader may be told to stay positive to attract more investors.
I extracted the Top 10 highest frequency words for each title to get a sense of the calls.
nrcWord <- textdata::lexicon_nrc()
nrcValues <- lexicon::hash_sentiment_nrc
nrcDominance <- textdata::lexicon_nrc_vad()
# process the text(stopword, token, combine)
wweCalls3$text <- wweCalls3$text %>%
str_replace_all("-", "") %>%
str_squish(.) %>%
textstem::lemmatize_strings(.) %>%
textclean::replace_contraction(.) %>%
tm::removeWords(words = stopwords("en"))
# top 10
wweCalls_top <- wweCalls3 %>%
unnest_tokens(output = word, input = text) %>%
anti_join(stop_words) %>%
group_by(title1) %>%
count(word) %>%
arrange(title1) %>%
na.omit() %>%
top_n(10)
treemap(wweCalls_top,
index=c("title1","word"),
vSize="n",
title="Key words for different title",
palette = "Set2",
type="index"
)
In the calls, analysts are responsible for raising questions towards the company’s problem. Therefore, it is not surprising that question
is the word with highest frequency within analysts’ speeches.
In addition, there are a lot of finance-related words in the speeches of financial leaders, such as million, revenue and increase.
Lastly, as for the company’s management board, they are in charge of the whole company and what they care most is surely the development of the company. It is quite reasonable that they would mention words like growth and live frequently.
Apart from the above Loughran and McDonald’s lexicon, we choose nrcDominance
this time to examine the valence score of words with top50 highest frequency for each title group.
wweCalls_top <- wweCalls_top %>%
left_join(nrcDominance,by=c('word'='Word'))
wweCalls_top_50 <- wweCalls3 %>%
unnest_tokens(output = word, input = text) %>%
anti_join(stop_words) %>%
group_by(title1) %>%
count(word) %>%
arrange(title1) %>%
na.omit() %>%
top_n(300)%>%
left_join(nrcDominance,by=c('word'='Word'))
pirateplot(formula = Valence ~ title1, #Formula
data = wweCalls_top_50, #Data frame
xlab = NULL, ylab = "Valence", #Axis labels
main = "Valence score for different title", #Plot title
pal = "google", #Color scheme
point.o = .2, #Points
avg.line.o = 1, #Turn on the Average/Mean line
theme = 0, #Theme
point.pch = 16, #Point `pch` type
point.cex = 1.5, #Point size
jitter.val = .1, #Turn on jitter to see the songs better
cex.lab = .9, cex.names = .7) #Axis label size
From the visualization, we can see that the distribution for financial leader group and the management group are very similar while there are some points with relatively lower valence scores in the analyst group. Analysts are always giving out challenging questions to the company. Therefore, it is inevitable that some questions may sound not very pleasant.
Another finding is that the average valence score in the management group is slightly higher than the rest of the two groups. When management people start to talk, they may need to stay in a tone that is as positive (e.g. happy, cheerful, euphoric) as possible to cheer other people up.
I choose trust
among the emotions in the nrc lexicons. I calculated the average valence score for all the trust words contained in each speech.
# select the emotion: trust
trust <- nrcWord[nrcWord[,'sentiment']=='trust',]
# join
tokens = wweCalls3 %>%
unnest_tokens(output = word, input = text) %>%
inner_join(trust) %>%
inner_join(nrcDominance,by=c('word'='Word')) %>%
group_by(label) %>%
summarize(meanSentiment = mean(Valence))
wweCalls4 <- wweCalls3 %>%
left_join(tokens) %>%
mutate(meanSentiment = ifelse(is.na(meanSentiment),0,meanSentiment),
year=format(as.Date(date, format="%d/%m/%Y"),"%Y"))
# final output
wweCalls_title <- wweCalls4 %>%
group_by(title1) %>%
summarize(meanSentiment = mean(meanSentiment))
g <- wweCalls4 %>%
mutate(title2=fct_reorder(title1,-meanSentiment)) %>%
mutate(meanSentiment_all=mean(meanSentiment)) %>%
group_by(title1) %>%
mutate(meanSentiment_g = mean(meanSentiment)) %>%
ggplot(aes(text=paste("Name: ", name, "\n","Date: ", date, "\n", "Sentiment Score:",round(meanSentiment,2),"\n","Overall Avg:",round(meanSentiment_all,2),"\n","Title Avg:",round(meanSentiment_g,2)))) +
coord_flip() +
scale_colour_startrek()+
labs(x=NULL,y="Valence")+
theme(legend.position = "none",
axis.title= element_text(size=12),
axis.text.x= element_text(size =10),
panel.grid = element_blank(),
panel.background=element_rect(fill="white",color="grey50")) +
#geom_point(aes(region1,admission_rate,color= region1),size = 3, alpha = 0.15) +
geom_jitter(aes(title2,meanSentiment,color= title2),size=2,alpha = 0.13,width = 0.2) +
#stat_summary(aes(region1,admission_rate,color= region1),fun.y = mean, geom = 'point', size = 5)+
geom_point(aes(title2,meanSentiment_g,color=title2),size=5)+
geom_hline(aes(yintercept=meanSentiment_all),color="gray70",size = 0.6) +
geom_segment(aes(x = title2,xend=title2,y=meanSentiment_all,yend = meanSentiment_g,color = title2),size=0.8)+
ggtitle("Valence for people with different title")
ggplotly(g,tooltip="text")
The visualization is showing the valence score within each title group. Each small circles in the visual represents each speech delivered. There are a lot of zero points, which means there may be no word connected to trust
emotion in that speech. Since it is an interactive graph, we can view the details of each point on hovering onto it.
In the visual, we can see that the management has a higher valence score when expressing the trust emotion. It may be resulted from the fact that the management board ought to stay as positive and happy as possible to give confidence to all the investors. Also, it may imply that the company’s operation is in a relatively health condition. In addition, the analyst’s words tend to have a lower valence score. The analysts may address some questions and problems with some negative words, and some analysis result from analysts may be not pleasant.
#percentage that contains trust word
wweCalls_title_per <- wweCalls4 %>%
mutate(count = ifelse(meanSentiment==0, 0, 1)) %>%
group_by(title1) %>%
summarize(perSentiment = sum(count)/n())
wweCalls_title_per %>%
ggplot( aes(title1, perSentiment, fill = title1)) +
geom_col() +
geom_hline(yintercept=0,size = 0.3,color="black")+
theme(plot.title = element_text(size = 11)) +
theme_minimal()
As mentioned above, there are many zero points indicating that there is no trust
words in that speech. I decide to check the percentage of the speech containing a trust
words among all the speeches. Our conclusion aligns with the one above – the management group is highest while the analyst group is the lowest.
The visualization is about the valence score for all the trust
words among all three title groups overtime.
wweCalls_title_date <- wweCalls4 %>%
group_by(title1,date) %>%
summarize(meanSentiment = mean(meanSentiment))
wweCalls_title_date %>%
ggplot(aes(date, meanSentiment, color = title1)) +
geom_smooth(se=F)+
theme_minimal()
In general, we can see that from 2002 to 2019, the trend line has first declined and then increased. My guess is that it has something with the subprime crisis as well as financial crisis.
Also, we can tell that there is a gap between each two lines. The management board always has the highest valence score, and then the financial lead group following by the analyst group. It is consistent with our above conclusion that the management people tend to say words with stronger emotion.
First of all, I use alphavatager
package to get the full list of WWE’s daily stock price.
library(alphavantager)
av_api_key("IGQ8SBW7F5DN58D8")
full_list <- av_get(symbol = "WWE", av_fun = "TIME_SERIES_DAILY", interval = "daily", outputsize = "full")
Getting the closing stock price for the 10 trading days around each call’s date, I was able to come up with an interactive visualization showing time series trend line under different sentiment scores with plotly
.
# join
tokens1 = wweCalls3 %>%
unnest_tokens(output = word, input = text) %>%
inner_join(trust) %>%
inner_join(nrcDominance,by=c('word'='Word')) %>%
group_by(label) %>%
summarize(meanSentiment = mean(Valence))
wweCalls5 <- wweCalls3 %>%
inner_join(tokens1) %>%
mutate(meanSentiment = ifelse(is.na(meanSentiment),0,meanSentiment),
year=format(as.Date(date, format="%d/%m/%Y"),"%Y"))
wweCalls_date <- wweCalls5 %>%
group_by(date) %>%
summarize(meanSentiment = mean(meanSentiment)) %>%
right_join(full_list,by=c('date'='timestamp'))
temp2 <- list()
i <- 1
for (n in 1:length(which(!is.na(wweCalls_date$meanSentiment)))){
temp1 <- wweCalls_date[(which(!is.na(wweCalls_date$meanSentiment))[n]-5):(which(!is.na(wweCalls_date$meanSentiment))[n]+5),c('date','close','meanSentiment')]
temp1$date1 <- temp1$date[6]
temp1$date <- c(1:11)
temp1$meanSentiment <- temp1$meanSentiment[6]
temp2[[i]] <- temp1
i <- i+1
}
Combined <- do.call("rbind", temp2)
tx <- highlight_key(Combined, ~date1)
# initiate a plotly object
base <- plot_ly(tx, color = I("black"))
time_series <- base %>%
group_by(date1) %>%
add_lines(x = ~date, y = ~close) %>%
add_segments(x = 6, xend = 6, y = 8, yend = 20,color='red')%>%
add_trace(x = ~date, y = ~close,tye = 'scatter', mode = 'markers', size=1)
hist <- add_lines(base, x=~date1,y=~meanSentiment) %>%
add_trace(base, x=~date1,y=~meanSentiment,type = 'scatter', mode = 'markers', size=1)
subplot( hist, time_series, widths = c(.5, .5)) %>%
layout(barmode = "overlay", showlegend = FALSE) %>%
highlight(
dynamic = TRUE,
selectize = TRUE,
selected = attrs_selected(opacity = 0.3),)
The line graph on the left side is the average sentiment scores for each call’s date and the one on the right side is the closing price for the 10 trading days around each call’s date. I also added a red vertical line on the right visualization to represent the call’s date. When we click on the sentiment scores on the left or the closing price trend line on the right, the selected call’s date would be highlighted accordingly.
The insight from the visualization is that, when the sentiment score is over (0.78), the closing price trend line seems to have a slightly increase after the call (calls in Aug 2003, Aug 2008 and Nov 2010).
However, I am not very confident for the positive relationship between sentiment and closing price because the stock market would be affected by various elements and is so hard to predict overall.
There are two calls within the zip file that I did not use for the previous steps – they are not already parsed. I would like to parse them, incorporate them into the rest of the data and determine if any new information comes to light. It’s basically a big data wrangling assignment.
raw27 <- read.csv(list.files(pattern = ".*raw.*27.*csv"),stringsAsFactors=F)
raw28 <- read.csv(list.files(pattern = ".*raw.*28.*csv"),stringsAsFactors=F)
people1 <- do.call('rbind',str_split(raw27[5:7,],'–')) %>%
as.data.frame(stringsAsFactors=FALSE)
people1[3] <- people1[2]
people1[2] <- 'WWE'
colnames(people1) <- c('firstLast','organization','title')
people2 <- do.call('rbind',str_split(raw27[9:14,],'–')) %>%
as.data.frame(stringsAsFactors=FALSE)
people2[3] <- 'Analyst'
colnames(people2) <- c('firstLast','organization','title')
people3 <- do.call('rbind',str_split(raw28[12:13,],'-')) %>%
as.data.frame(stringsAsFactors=FALSE)
people3[3] <- 'Analyst'
colnames(people3) <- c('firstLast','organization','title')
people <- rbind(people1,people2,people3)
people$firstLast <- gsub("\\s+$", "", people$firstLast)
people[12,1] <- 'Operator'
people[13,] <- people[9,]
people[13,1] <- 'Robert Routh'
people[,4] <- 1
#d[c(T,F),]
f5 <- data.frame()
for (y in c(1,2)){
if (y==1){
e <- raw27[15:172,] %>% as.data.frame(stringsAsFactors=FALSE) %>% setNames('firstLast')
} else{
e <- raw28[15:134,] %>% as.data.frame(stringsAsFactors=FALSE) %>% setNames('firstLast')
}
e$firstLast <- gsub("\\s+$", "", e$firstLast)
f <- left_join(e,people[c(1,4)])
g <- which(f[2]==1)
h <- data.frame()
i <- 1
for (n in 1:(length(g))){
if (n<length(g)){
h[i,1] <- paste(sapply(f[as.integer(g[n]+1):as.integer(g[n+1]-1),1], paste, collapse=":"), collapse=" ")
} else {
h[i,1] <- paste(sapply(f[as.integer(g[n]+1):nrow(f),1], paste, collapse=":"), collapse=" ")
}
i <- i+1
}
f1 <- f %>%
filter(V4 == 1) %>% select(1)
f2 <- cbind(f1,h)
f3 <- f2 %>%
left_join(people)
if (y==1){
f3['quarter'] <- substring(raw27[2,1],1,2)
f3['date'] <- str_extract(raw27[3,1], "\\w+\\s\\d{1,2}.*\\d{4}")
} else{
f3['quarter'] <- substring(raw28[2,1],1,2)
f3['date'] <- str_extract(raw28[3,1], "\\w+\\s\\d{1,2}.*\\d{4}")
}
f3['gender'] <- NA
f3['likelyRace'] <- NA
f3['likelyRaceProb'] <- NA
f3['ticker'] <- 'WWE'
f3['firstName'] <- str_extract(f3[,1], "^\\w+\\s")
f3['name'] <- f3['firstLast']
f4 <- f3 %>%
select(name,firstName,firstLast,organization,title,V1,gender,likelyRace,likelyRaceProb,ticker,date,quarter) %>%
rename('text'='V1') %>%
mutate(quarter=as.numeric(substr(quarter,2,2)),
date=as.Date(date,'%b %d, %Y'),)
f5 <- rbind(f5,f4)
}
wweCallsN <- rbind(f5,wweCalls1)
# may want to drop the operators
wweCallsN <- wweCallsN %>%
filter(firstName!='Operator')%>%
filter(title!='Media')
# title
wweCallsN2 <- wweCallsN %>%
mutate(title1=ifelse(grepl('VP|Planning|IR|CFO|Financial', title),'Financial Lead',ifelse(grepl('Analyst',title),'Analyst','Management')))
# process the text(stopword, token, combine)
wweCallsN2$text <- wweCallsN2$text %>%
str_replace_all("-", "") %>%
str_squish(.) %>%
textstem::lemmatize_strings(.) %>%
textclean::replace_contraction(.) %>%
tm::removeWords(words = stopwords("en"))
wweCallsN2$label <- seq.int(nrow(wweCallsN2))
wweCallsN_top_50 <- wweCallsN2 %>%
unnest_tokens(output = word, input = text) %>%
anti_join(stop_words) %>%
group_by(title1) %>%
count(word) %>%
arrange(title1) %>%
na.omit() %>%
top_n(300)%>%
left_join(nrcDominance,by=c('word'='Word'))
pirateplot(formula = Valence ~ title1, #Formula
data = wweCalls_top_50, #Data frame
xlab = NULL, ylab = "Valence", #Axis labels
main = "Valence score for different title", #Plot title
pal = "google", #Color scheme
point.o = .2, #Points
avg.line.o = 1, #Turn on the Average/Mean line
theme = 0, #Theme
point.pch = 16, #Point `pch` type
point.cex = 1.5, #Point size
jitter.val = .1, #Turn on jitter to see the songs better
cex.lab = .9, cex.names = .7) #Axis label size
This graph about the valence score within each title group does not change by adding new data.
# join
tokens = wweCallsN2 %>%
unnest_tokens(output = word, input = text) %>%
inner_join(trust) %>%
inner_join(nrcDominance,by=c('word'='Word')) %>%
group_by(label) %>%
summarize(meanSentiment = mean(Valence))
wweCallsN3 <- wweCallsN2 %>%
left_join(tokens) %>%
mutate(meanSentiment = ifelse(is.na(meanSentiment),0,meanSentiment),
year=format(as.Date(date, format="%d/%m/%Y"),"%Y"))
# final output
wweCallsN_title <- wweCallsN3 %>%
group_by(title1) %>%
summarize(meanSentiment = mean(meanSentiment))
set.seed(123)
g <- wweCallsN3 %>%
mutate(title2=fct_reorder(title1,-meanSentiment)) %>%
mutate(meanSentiment_all=mean(meanSentiment)) %>%
group_by(title1) %>%
mutate(meanSentiment_g = mean(meanSentiment)) %>%
ggplot(aes(text=paste("Name: ", name, "\n","Date: ", date, "\n", "Sentiment Score:",round(meanSentiment,2),"\n","Overall Avg:",round(meanSentiment_all,2),"\n","Title Avg:",round(meanSentiment_g,2)))) +
coord_flip() +
scale_colour_startrek()+
labs(x=NULL,y="Valence")+
theme(legend.position = "none",
axis.title= element_text(size=12),
axis.text.x= element_text(size =10),
panel.grid = element_blank(),
panel.background=element_rect(fill="white",color="grey50")) +
#geom_point(aes(region1,admission_rate,color= region1),size = 3, alpha = 0.15) +
geom_jitter(aes(title2,meanSentiment,color= title2),size=2,alpha = 0.13,width = 0.2) +
#stat_summary(aes(region1,admission_rate,color= region1),fun.y = mean, geom = 'point', size = 5)+
geom_point(aes(title2,meanSentiment_g,color=title2),size=5)+
geom_hline(aes(yintercept=meanSentiment_all),color="gray70",size = 0.6) +
geom_segment(aes(x = title2,xend=title2,y=meanSentiment_all,yend = meanSentiment_g,color = title2),size=0.8)+
ggtitle("Valence for people with different title")
ggplotly(g,tooltip="text")
This graph does not change much besides the average valence score of the financial lead group increased from 0.44(below average) to 0.46(above average). It means that the raw data we added contains many speeches from the financial leaders with higher trust
emotion.
#percentage that contains trust word
wweCallsN_title_per <- wweCallsN3 %>%
mutate(count = ifelse(meanSentiment==0, 0, 1)) %>%
group_by(title1) %>%
summarize(perSentiment = sum(count)/n())
wweCallsN_title_per %>%
ggplot( aes(title1, perSentiment, fill = title1)) +
geom_col() +
geom_hline(yintercept = 0, size=0.3,color = "black") +
theme(plot.title = element_text(size = 11)) +
theme_minimal()
By adding new raw data to our original dataset, we can see that the proportion of the speech containing trust
words has increased. However, the ranking among three title groups has not changed.
wweCallsN_title_date <- wweCallsN3 %>%
group_by(title1,date) %>%
summarize(meanSentiment = mean(meanSentiment))
wweCallsN_title_date %>%
ggplot(aes(date, meanSentiment, color = title1)) +
geom_smooth(se=F)+
theme_minimal()
Wow, there is a big different here! Instead of having low valence score all the time, the analyst group would have giant jump between 2010 and 2015. Since I do not know much about WWE, my assumption is that there may be a big and profitable event/competition ongoing during that period, which leads most of the analysts to hold a positive attitule towards the company.
However, we should also notice that there are missing call records between 2010 and 2016. We should collect more data during that period to make our smooth line more accurate.
A work by Yun (Jessica) Yan