library(tidyverse)
library(survival)
library(survminer)
library(riskRegression)
library(ggfortify)
library(randomForestSRC)
Survival and Longitudinal Data Analysis Project
Introduction
Frequent employment turnover can create a major loss in the company. We want to predict an employee’s risk of quitting the company, for example, within a year. To do this, we will compare survival analysis methods (Cox models, Survival Random Forests) to classification methods. To compare performance, we will spare 25% of the data as a test sample.
Packages
Data Wrangling
Data Importation
Code
<- read.csv("turnover2.csv", sep = ";", header = TRUE)
turnover # head(turnover)
turnover
Name | Type | Description |
---|---|---|
duration | numeric | experience in months |
event | numeric | censorship flag: 1 if quit, 0 otherwise |
gender | categorical | gender |
age | numeric | age in years |
industry | categorical | employee’s industry |
profession | categorical | employee’s profession |
traffic | categorical | how employee came to the company |
coach | categorical | presence of a coach on probation |
head_gender | categorical | gender of the supervisor |
greywage | categorical | whether the salary is fully registered with tax authorities |
transport | categorical | employee’s means of transportation |
extraversion | numeric | extraversion score |
indepedent | numeric | independent score |
selfcontrol | numeric | selfcontrol score |
anxiety | numeric | anxiety score |
novator | numeric | novator score |
The code for the traffic variable is given as follows:
- advert (direct contact of one’s own initiative)
- recNErab (direct contact on the recommendation of a friend, not an employ of the company)
- referal (direct contact on the recommendation of a friend, an employee of the company)
- youjs (applied on a job site)
- KA (recruiting agency brought)
- rabrecNErab (employer contacted on the recommendation of a person who knows the employee)
- empjs (employer reached on the job site)
Data Cleaning
The following variables types gender, industry, profession, traffic, coach, head_gender, greywage
and transport
are not coded correctly. They should be Categorical variables, so we have to make the necessary changes.
Code
<- turnover %>%
turnover mutate(gender = as.factor(gender), industry = as.factor(industry), profession = as.factor(profession),
traffic = as.factor(traffic), coach = as.factor(coach), head_gender = as.factor(head_gender),
greywage = as.factor(greywage), Transportation = as.factor(Transportation)) %>%
rename(transport = Transportation)
now we look for missing data or duplicate observations.
Code
cat(sum(is.na(turnover)), "missing data\n")
0 missing data
Code
cat(turnover %>%
duplicated() %>%
sum(), "duplicated rows")
13 duplicated rows
There are no missing values in this dataset, but we found 13 duplicate observations. To maintain the quality of the data, we will therefore delete them.
Code
<- unique(turnover) turnover
Data Exploration
We are interested in the variable duration
according to the variable event
.
Code
%>%
turnover ggplot(aes(x=duration, color=factor(event),
fill=factor(event))) +
geom_histogram(alpha=0.5) +
scale_color_brewer(palette = "Dark2") +
scale_fill_brewer(palette = "Dark2") +
theme_minimal()
Looking at the graph, we can see that almost half of the data is censored.
To be more precise, let’s calculate the percentage of censorship in the dataset.
Code
<- dim(turnover)[1]
n print((n - sum(turnover$event))/n * 100)
[1] 49.82079
Thus, more precisely the percentage of censorship in the dataset is equal to 49.82%.
If the turnover depended only on this duration
variable, we could say that the number of employees who leave a company is equal to half of the total number of employees in a company. But this is not the case, since turnover also depends on the other categorical and quantitative variables. We will therefore focus on the other variables of the model.
Continuous variables
- duration (we just did the study previously)
- event (the dimension of the dataset is equal to (1116, 16) of which half of the data (558) is censored).
- age
- extraversion
- independ
- selfcontrol
- anxiety
- novator
Code
%>%
turnover ggplot(aes(x = age, color = factor(event), fill = factor(event))) +
geom_histogram(alpha = 0.5) +
labs(fill = "event", color = "event") +
scale_color_brewer(palette = "Dark2") +
scale_fill_brewer(palette = "Dark2") +
theme_minimal()
%>%
turnover ggplot(aes(x = extraversion, color = factor(event), fill = factor(event))) +
geom_histogram(alpha = 0.5) +
labs(fill = "event", color = "event") +
scale_color_brewer(palette = "Dark2") +
scale_fill_brewer(palette = "Dark2") +
theme_minimal()
%>%
turnover ggplot(aes(x = independ, color = factor(event), fill = factor(event))) +
geom_histogram(alpha = 0.5) +
labs(fill = "event", color = "event") +
scale_color_brewer(palette = "Dark2") +
scale_fill_brewer(palette = "Dark2") +
theme_minimal()
%>%
turnover ggplot(aes(x = selfcontrol, color = factor(event), fill = factor(event))) +
geom_histogram(alpha = 0.5) +
labs(fill = "event", color = "event") +
scale_color_brewer(palette = "Dark2") +
scale_fill_brewer(palette = "Dark2") +
theme_minimal()
%>%
turnover ggplot(aes(x = anxiety, color = factor(event), fill = factor(event))) +
geom_histogram(alpha = 0.5) +
labs(fill = "event", color = "event") +
scale_color_brewer(palette = "Dark2") +
scale_fill_brewer(palette = "Dark2") +
theme_minimal()
%>%
turnover ggplot(aes(x = novator, color = factor(event), fill = factor(event))) +
geom_histogram(alpha = 0.5) +
labs(fill = "event", color = "event") +
scale_color_brewer(palette = "Dark2") +
scale_fill_brewer(palette = "Dark2") +
theme_minimal()
Discrete variables
- gender
- industry
- profession
- traffic
- coach
- head_gender
- greywage
- transport
Code
<- c("#32908F", "#553A41")
palette
<- turnover %>%
categorial_variables select_if(is.factor) %>%
mutate(event = turnover$event)
%>%
categorial_variables ggplot(aes(x = gender,color = factor(event), fill = factor(event))) +
geom_bar(alpha = 0.5) +
labs(fill = "event", color = "event") +
scale_fill_manual(values = palette) +
scale_colour_manual(values = palette) +
theme_minimal()
%>%
categorial_variables ggplot(aes(x = industry, color = factor(event), fill = factor(event))) +
geom_bar(alpha = 0.5) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(fill="event", color="event") +
scale_fill_manual(values=palette) +
scale_colour_manual(values=palette)
%>%
categorial_variables ggplot(aes(x = profession, color = factor(event), fill = factor(event))) +
geom_bar(alpha = 0.5) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(fill="event", color="event") +
scale_fill_manual(values=palette) +
scale_colour_manual(values=palette)
%>%
categorial_variables ggplot(aes(x = traffic, color = factor(event), fill = factor(event))) +
geom_bar(alpha = 0.5) +
labs(fill = "event", color = "event") +
scale_fill_manual(values = palette) +
scale_colour_manual(values = palette) +
theme_minimal()
%>%
categorial_variables ggplot(aes(x = coach, color = factor(event), fill = factor(event))) +
geom_bar(alpha = 0.5) +
labs(fill = "event", color = "event") +
scale_fill_manual(values = palette) +
scale_colour_manual(values = palette) +
theme_minimal()
%>%
categorial_variables ggplot(aes(x = head_gender, color = factor(event), fill = factor(event))) +
geom_bar(alpha = 0.5) +
labs(fill = "event", color = "event") +
scale_fill_manual(values = palette) +
scale_colour_manual(values = palette) +
theme_minimal()
%>%
categorial_variables ggplot(aes(x = greywage, color = factor(event), fill = factor(event))) +
geom_bar(alpha = 0.5) +
labs(fill = "event", color = "event") +
scale_fill_manual(values = palette) +
scale_colour_manual(values = palette) +
theme_minimal()
%>%
categorial_variables ggplot(aes(x = transport, color = factor(event), fill = factor(event))) +
geom_bar(alpha = 0.5) +
labs(fill = "event", color = "event") +
scale_fill_manual(values = palette) +
scale_colour_manual(values = palette) +
theme_minimal()
Let’s check if there are any correlations between the variables.
Code
<- categorial_variables[-9]
turnover_corr
levels(turnover_corr$gender) <- 1:length(levels(turnover_corr$gender))
levels(turnover_corr$industry) <- 1:length(levels(turnover_corr$industry))
levels(turnover_corr$profession) <- 1:length(levels(turnover_corr$profession))
levels(turnover_corr$traffic) <- 1:length(levels(turnover_corr$traffic))
levels(turnover_corr$coach) <- 1:length(levels(turnover_corr$coach))
levels(turnover_corr$head_gender) <- 1:length(levels(turnover_corr$head_gender))
levels(turnover_corr$greywage) <- 1:length(levels(turnover_corr$greywage))
levels(turnover_corr$transport) <- 1:length(levels(turnover_corr$transport))
<- lapply(turnover, as.numeric)
turnover_corr <- cbind(turnover_corr, turnover %>%
turnover_corr select(-colnames(turnover_corr)))
::corrplot(cor(turnover_corr[, 1:16]), method = "color", type = "lower",
corrplotorder = "hclust", addCoef.col = "black", tl.col = "black", tl.srt = 45, sig.level = 0.01,
insig = "blank", diag = FALSE)
Finally, we can look at survival functions.
Survival plots
Code
for (col in colnames(categorial_variables %>% select(-event))) {
<- as.formula(paste("Surv(duration, event) ~ ", col))
f
print(autoplot(survfit(f, data = turnover)) +
theme_minimal() +
ggtitle(paste("Survival plot for the", col, "variable")))
}
We notice that the survival function of an employee who has as greywage = white
is superior to that of an employee who has as greywage = grey
. But as we said before, if the turnover depended only on this greywage
variable, we could say that an employee with greywage = white
has a greater probability of continuing in a company than an employee with greywage = grey
. But this is not the case, since the turnover also depends on the other variables.
On the other hand we notice that it is difficult to interpret the survival functions of the industry
variable as we can see from the graph, because it has several categories.
Survival Models
Train - Test data split
we will split our dataset into train and test, 75% and 25% respectively taking into consideration that we obtain the same percentage of censored and uncensored data, for training our models well in order to obtain good predictions.
Code
set.seed(2022)
<- turnover
test $id <- 1:nrow(turnover)
test
<- test[turnover$event == 0, ]
turnover_event_0 <- test[turnover$event == 1, ]
turnover_event_1
<- sample(turnover_event_0$id, size = 0.25 * nrow(turnover_event_0))
test_index_event_0 <- sample(turnover_event_1$id, size = 0.25 * nrow(turnover_event_1))
test_index_event_1
<- c(test_index_event_0, test_index_event_1)
test_index <- turnover[test_index, ]
test <- turnover[-test_index, ]
train
cat(sum(train$event)/dim(train)[1] * 100, "% of train data is censored\n")
50.17921 % of train data is censored
Code
cat(sum(test$event)/dim(test)[1] * 100, "% of test data is censored")
50.17921 % of test data is censored
Cox Model
Code
<- coxph(Surv(duration, event) ~ ., data = train, x = TRUE, y = TRUE)
cox_train cox_train
Call:
coxph(formula = Surv(duration, event) ~ ., data = train, x = TRUE,
y = TRUE)
coef exp(coef) se(coef) z p
genderm -0.243464 0.783907 0.148889 -1.635 0.10201
age 0.008931 1.008971 0.008177 1.092 0.27477
industryAgriculture 0.615680 1.850915 0.609920 1.009 0.31276
industryBanks 0.395962 1.485813 0.489001 0.810 0.41809
industryBuilding 0.469967 1.599941 0.516896 0.909 0.36324
industryConsult 0.204054 1.226365 0.504697 0.404 0.68598
industryetc 0.066571 1.068837 0.496985 0.134 0.89344
industryIT -0.519765 0.594660 0.505878 -1.027 0.30421
industrymanufacture -0.173233 0.840941 0.488930 -0.354 0.72311
industryMining 0.313733 1.368524 0.573355 0.547 0.58425
industryPharma -0.557050 0.572897 0.626831 -0.889 0.37418
industryPowerGeneration -0.215471 0.806162 0.566024 -0.381 0.70344
industryRealEstate -1.471277 0.229632 0.769054 -1.913 0.05574
industryRetail -0.408494 0.664651 0.481503 -0.848 0.39623
industryState -0.283116 0.753433 0.548065 -0.517 0.60545
industryTelecom -0.740431 0.476908 0.583587 -1.269 0.20453
industrytransport 0.096289 1.101077 0.545799 0.176 0.85997
professionBusinessDevelopment 0.666513 1.947435 0.604867 1.102 0.27050
professionCommercial 1.041342 2.833017 0.569083 1.830 0.06727
professionConsult 0.886547 2.426736 0.581191 1.525 0.12716
professionEngineer 0.908162 2.479760 0.614100 1.479 0.13918
professionetc 0.757982 2.133966 0.533294 1.421 0.15522
professionFinanñe 0.481479 1.618466 0.648115 0.743 0.45755
professionHR 0.210087 1.233786 0.478358 0.439 0.66053
professionIT 0.225704 1.253205 0.551199 0.409 0.68219
professionLaw 0.346763 1.414482 0.776939 0.446 0.65537
professionmanage 1.572460 4.818488 0.549829 2.860 0.00424
professionMarketing 0.605931 1.832958 0.543643 1.115 0.26503
professionPR 1.061753 2.891436 0.722354 1.470 0.14160
professionSales 0.615852 1.851233 0.530015 1.162 0.24525
professionTeaching 0.686938 1.987620 0.667503 1.029 0.30343
trafficempjs 0.892777 2.441902 0.351344 2.541 0.01105
trafficfriends -0.009859 0.990189 0.384720 -0.026 0.97956
trafficKA -0.116284 0.890222 0.406714 -0.286 0.77495
trafficrabrecNErab 0.515694 1.674800 0.350365 1.472 0.14105
trafficrecNErab -0.253653 0.775961 0.450749 -0.563 0.57361
trafficreferal 0.065158 1.067328 0.375697 0.173 0.86231
trafficyoujs 0.524092 1.688925 0.347587 1.508 0.13161
coachno 0.072837 1.075556 0.131964 0.552 0.58098
coachyes 0.190859 1.210289 0.178583 1.069 0.28519
head_genderm 0.171275 1.186817 0.120412 1.422 0.15491
greywagewhite -0.458269 0.632377 0.158851 -2.885 0.00392
transportcar -0.137584 0.871461 0.120956 -1.137 0.25534
transportfoot -0.418908 0.657765 0.197924 -2.117 0.03430
extraversion 0.006959 1.006983 0.042521 0.164 0.87000
independ -0.000399 0.999601 0.041229 -0.010 0.99228
selfcontrol -0.025556 0.974767 0.041261 -0.619 0.53566
anxiety -0.066980 0.935214 0.040597 -1.650 0.09897
novator 0.026133 1.026477 0.036280 0.720 0.47134
Likelihood ratio test=142.1 on 49 df, p=5.244e-11
n= 837, number of events= 420
Hypothesis of The Cox Model
H0 : The Cox Model is equal to the Null Model
H1 : The Cox Model is different from the Null Model
We see that the p-value is significantly lower than 5%, so the model is different from the Null model.
Let’s represent the Brier score as a function of time.
Brier Score are used to evaluate the accuracy of probabilistic predictions (probabilistic predictions made by machine learning algorithms). The best possible Brier Score will be 0, and the worst possible Brier Score would be a 1
Code
<- Score(list("cox model" = cox_train),
cox.test.brier formula = Surv(duration, event) ~ 1, data = test,
metrics = "brier", times = sort(unique(test$duration)))
$Brier$score %>%
cox.test.brierselect(model, times, Brier) %>%
ggplot(aes(x = times, y = Brier, color = model)) +
geom_line() +
xlab("Time") +
theme_minimal() +
scale_colour_manual(values=c("darkgrey", "orange")) +
ggtitle("Brier Score for the Cox Model")
it is difficult to see directly from the graph if the Cox model has a lower Brier score compared to that of the Null model, this is why we will calculate the Integrated Brier Score.
Integrated Brier Score
In survival analysis, The Integrated Brier Score (IBS) provides an overall calculation of the model performance at all available times.
Code
<- function(brier_score, model_name) {
IBrierScore
# extracting data
<- brier_score$Brier$score %>%
brier_score.data select("model", "times", "Brier") %>%
filter(model == model_name) %>%
select("times", "Brier")
<- brier_score.data$times
x <- brier_score.data$Brier
y <- order(x)
id
# area under the curve
<- sum(diff(x[id]) * zoo::rollmean(y[id],2))
area
return(area/x[length(x)])
}
<- IBrierScore(cox.test.brier, "cox model")
coxIBS coxIBS
[1] 0.1649923
Thus, the Cox model has an Integrated Brier Score of 16.499%.
Random Forest Model
Code
<- rfsrc(Surv(duration, event) ~ ., data = train)
rand_forest.model plot(vimp(rand_forest.model))
Importance Relative Imp
traffic 0.0629 1.0000
age 0.0558 0.8882
industry 0.0520 0.8269
profession 0.0443 0.7052
independ 0.0401 0.6382
greywage 0.0396 0.6293
selfcontrol 0.0340 0.5407
anxiety 0.0322 0.5119
transport 0.0290 0.4608
extraversion 0.0217 0.3446
novator 0.0167 0.2657
head_gender 0.0062 0.0985
gender 0.0030 0.0482
coach 0.0015 0.0242
The Structure of a Decision Tree of The Random Forest Model
Code
plot(get.tree(rand_forest.model, 5))
Code
<- Score(list("rf model" = rand_forest.model),
rand_forest.brier formula = Surv(duration, event) ~ 1, data = test,
metrics = "brier", times = sort(unique(test$duration)))
$Brier$score %>%
rand_forest.brierselect(model, times, Brier) %>%
full_join(cox.test.brier$Brier$score %>% select(model, times, Brier),
by = c("model", "times", "Brier")) %>%
ggplot(aes(x = times, y = Brier, color = model)) +
geom_line() +
xlab("Time") +
theme_minimal() +
scale_colour_manual(values=c("darkgrey", "darkgreen", "orange")) +
ggtitle("Brier Score for the Different Models")
Code
<- round(IBrierScore(rand_forest.brier, "rf model"), 4) rfIBS
As we can see, from the graph we see that the Random Forest model has a lower Brier Score than that the Cox model.
Integrated Brier Score
To be more precisely, the Random Forest model has an Integrated Brier Score of 14.73%, which is lower than that of the Cox model (16.499) .
This implies that the Random Forest model is better than the Cox model in terms of predictions.
Predictions
We want to predict the probability of an employee will stay for longer than 3 years, whose features are Female of age 30, referred by an employee of the company (referral) in IT industry, profession HR, commuting by bus, having a coach during the probation, with male supervisor, whose characteristic scores are 5 for all categories.
Code
<- data.frame(event = 0,
employee duration = 0,
gender = "f",
age = 30,
industry = "IT",
profession = "HR",
traffic = "referal",
coach = "yes",
head_gender = "m",
greywage = "white",
transport = "bus",
extraversion = 5.0,
independ = 5.0,
selfcontrol = 5.0,
anxiety = 5.0,
novator = 5.0, stringsAsFactors=TRUE)
str(employee)
'data.frame': 1 obs. of 16 variables:
$ event : num 0
$ duration : num 0
$ gender : Factor w/ 1 level "f": 1
$ age : num 30
$ industry : Factor w/ 1 level "IT": 1
$ profession : Factor w/ 1 level "HR": 1
$ traffic : Factor w/ 1 level "referal": 1
$ coach : Factor w/ 1 level "yes": 1
$ head_gender : Factor w/ 1 level "m": 1
$ greywage : Factor w/ 1 level "white": 1
$ transport : Factor w/ 1 level "bus": 1
$ extraversion: num 5
$ independ : num 5
$ selfcontrol : num 5
$ anxiety : num 5
$ novator : num 5
Let’s compute the probability that this employee will stay for longer than 3 years :
Code
<- predict(rand_forest.model, employee)
p
<- which(p$time.interest > 36)
index <- p$survival[index] prob
So the probability that the employee will stay for longer than:
- 3 years is 0.771
- 3 years and 1 month is 0.768
- 3 years and 2 months is 0.767
- 3 years and 6 months is 0.736
now we want to see if there is an impact if the same employee worked in another company.
Code
<- levels(turnover$industry)
industries <- c()
prob
for (industry in industries) {
$industry <- industry
employee<- predict(rand_forest.model, employee)
p
<- which(p$time.interest >= 36)
index <- c(prob, p$survival[index][1])
prob
}
<- data.frame(industry = industries, prob = prob)
estimation
<- function(estimation, title) {
highligh_bar
$color <- rep("gray", nrow(estimation))
estimationwhich.min(estimation$prob), ]$color <- "min"
estimation[which.max(estimation$prob), ]$color <- "max"
estimation[
<- estimation %>%
barplot ggplot(aes(x = industry, y = prob, fill = color)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c(max = "darkgreen",
min = "darkred",
gray = "gray")) +
ylab("Probability") +
theme_minimal() +
coord_flip() +
ggtitle(title) +
theme(plot.title = element_text(hjust = 0.15))
return (barplot)
}
highligh_bar(estimation, "The probability that the employee will stay for longer than 3 years")
We can see that if the employee works in a Retail sector, he will have the greatest probability of staying more than 3 years in the company than another employee with the same characteristics but in another sector.
We can also notice that if an employee works in a Consult sector, he will have the lowest probability of staying more than 3 years in the company than another employee with the same characteristics but in another sector.
Now we will consider that this employee has already worked for one year, we want to estimate the probability that this employee will stay for another 2 years
Code
$duration <- 12
employee<- c()
prob
for (industry in industries) {
$industry <- industry
employee<- predict(rand_forest.model, employee)
p
<- which(p$time.interest >= 24)
index <- c(prob, p$survival[index][1])
prob
}
<- data.frame(industry = industries, prob = prob)
estimation2
highligh_bar(estimation2, "The probability that the employee will stay for another 2 years")
We can see that if the employee works in a Retail sector, he will have a greater probability of staying another 2 years in the company than another employee with the same characteristics but working in another sector.
We can also notice that if an employee works in an Agriculture industry, he will have the lowest probability of staying another 2 years in the company compared to another employee with the same characteristics but working in another industry.
We notice that the probability that the employee continues another two years in a company knowing that he has already worked one year in it is higher than the probability of an employee with the same characteristics who would work three years knowing that he has just started in the company.
For example, if we look at the graph for an employee who has been working in the IT sector for one year, we notice that the probability that this employee will stay in the company for another two years is equal to 0.834 which is higher than the probability of another employee with the same profile who will stay for three years but who has just started working in this company (0.771).
This means that employees are more likely to stay with the same company if they have already worked there for at least one year than other employees with the same profile who have just started working. But in these days, this is not really the case, employees tend not to stay for a long time in the same company (especially in sectors where there are a lot of offers like in IT). IT employees can start as junior employees in startups or small companies and when they gain experience, they move to bigger companies that offer higher salaries.
Contrary to our previous conclusion, our predictions are not consistent with our conclusion. This may be due to the fact that the data is biased, or we may look at it from another point of view, that the data is not up to date with the current generations.
Conclusion
Turnover calculations are an important metric in most businesses. They are useful to provide a method to track the movement of employee’s out of your business and identifying potential risks.
Turnover and tenure have a complex relationship. A business with a high turnover will logically have a lower average tenure, and at the same time tenure is an important factor in the decision to exit.