Survival and Longitudinal Data Analysis Project

Author

Ahmed OSMAN

Introduction

Frequent employment turnover can create a major loss in the company. We want to predict an employee’s risk of quitting the company, for example, within a year. To do this, we will compare survival analysis methods (Cox models, Survival Random Forests) to classification methods. To compare performance, we will spare 25% of the data as a test sample.

Packages

library(tidyverse)
library(survival)
library(survminer)
library(riskRegression)
library(ggfortify)
library(randomForestSRC)

Data Wrangling

Data Importation

Code
turnover <- read.csv("turnover2.csv", sep = ";", header = TRUE)
# head(turnover)
turnover
Name Type Description
duration numeric experience in months
event numeric censorship flag: 1 if quit, 0 otherwise
gender categorical gender
age numeric age in years
industry categorical employee’s industry
profession categorical employee’s profession
traffic categorical how employee came to the company
coach categorical presence of a coach on probation
head_gender categorical gender of the supervisor
greywage categorical whether the salary is fully registered with tax authorities
transport categorical employee’s means of transportation
extraversion numeric extraversion score
indepedent numeric independent score
selfcontrol numeric selfcontrol score
anxiety numeric anxiety score
novator numeric novator score

The code for the traffic variable is given as follows:

  • advert (direct contact of one’s own initiative)
  • recNErab (direct contact on the recommendation of a friend, not an employ of the company)
  • referal (direct contact on the recommendation of a friend, an employee of the company)
  • youjs (applied on a job site)
  • KA (recruiting agency brought)
  • rabrecNErab (employer contacted on the recommendation of a person who knows the employee)
  • empjs (employer reached on the job site)

Data Cleaning

The following variables types gender, industry, profession, traffic, coach, head_gender, greywage and transport are not coded correctly. They should be Categorical variables, so we have to make the necessary changes.

Code
turnover <- turnover %>%
    mutate(gender = as.factor(gender), industry = as.factor(industry), profession = as.factor(profession),
        traffic = as.factor(traffic), coach = as.factor(coach), head_gender = as.factor(head_gender),
        greywage = as.factor(greywage), Transportation = as.factor(Transportation)) %>%
    rename(transport = Transportation)

now we look for missing data or duplicate observations.

Code
cat(sum(is.na(turnover)), "missing data\n")
0 missing data
Code
cat(turnover %>%
    duplicated() %>%
    sum(), "duplicated rows")
13 duplicated rows

There are no missing values in this dataset, but we found 13 duplicate observations. To maintain the quality of the data, we will therefore delete them.

Code
turnover <- unique(turnover)

Data Exploration

We are interested in the variable duration according to the variable event.

Code
turnover %>% 
  ggplot(aes(x=duration, color=factor(event),
             fill=factor(event))) + 
  geom_histogram(alpha=0.5) +
  scale_color_brewer(palette = "Dark2") + 
  scale_fill_brewer(palette = "Dark2") +
  theme_minimal()

Looking at the graph, we can see that almost half of the data is censored.

To be more precise, let’s calculate the percentage of censorship in the dataset.

Code
n <- dim(turnover)[1]
print((n - sum(turnover$event))/n * 100)
[1] 49.82079

Thus, more precisely the percentage of censorship in the dataset is equal to 49.82%.

If the turnover depended only on this duration variable, we could say that the number of employees who leave a company is equal to half of the total number of employees in a company. But this is not the case, since turnover also depends on the other categorical and quantitative variables. We will therefore focus on the other variables of the model.

Continuous variables

  • duration (we just did the study previously)
  • event (the dimension of the dataset is equal to (1116, 16) of which half of the data (558) is censored).
  • age
  • extraversion
  • independ
  • selfcontrol
  • anxiety
  • novator
Code
turnover %>% 
  ggplot(aes(x = age, color = factor(event), fill = factor(event))) + 
  geom_histogram(alpha = 0.5) + 
  labs(fill = "event", color = "event") + 
  scale_color_brewer(palette = "Dark2") + 
  scale_fill_brewer(palette = "Dark2") + 
  theme_minimal()
turnover %>% 
  ggplot(aes(x = extraversion, color = factor(event), fill = factor(event))) + 
  geom_histogram(alpha = 0.5) + 
  labs(fill = "event", color = "event") + 
  scale_color_brewer(palette = "Dark2") + 
  scale_fill_brewer(palette = "Dark2") +
  theme_minimal()
turnover %>% 
  ggplot(aes(x = independ, color = factor(event), fill = factor(event))) + 
  geom_histogram(alpha = 0.5) + 
  labs(fill = "event", color = "event") + 
  scale_color_brewer(palette = "Dark2") + 
  scale_fill_brewer(palette = "Dark2") + 
  theme_minimal()
turnover %>% 
  ggplot(aes(x = selfcontrol, color = factor(event), fill = factor(event))) + 
  geom_histogram(alpha = 0.5) + 
  labs(fill = "event", color = "event") + 
  scale_color_brewer(palette = "Dark2") + 
  scale_fill_brewer(palette = "Dark2") +
  theme_minimal()
turnover %>% 
  ggplot(aes(x = anxiety, color = factor(event), fill = factor(event))) + 
  geom_histogram(alpha = 0.5) + 
  labs(fill = "event", color = "event") + 
  scale_color_brewer(palette = "Dark2") + 
  scale_fill_brewer(palette = "Dark2") +
  theme_minimal()
turnover %>% 
  ggplot(aes(x = novator, color = factor(event), fill = factor(event))) + 
  geom_histogram(alpha = 0.5) + 
  labs(fill = "event", color = "event") + 
  scale_color_brewer(palette = "Dark2") + 
  scale_fill_brewer(palette = "Dark2") +
  theme_minimal()

Discrete variables

  • gender
  • industry
  • profession
  • traffic
  • coach
  • head_gender
  • greywage
  • transport
Code
palette <- c("#32908F", "#553A41")

categorial_variables <- turnover %>% 
  select_if(is.factor) %>% 
  mutate(event = turnover$event)

categorial_variables %>% 
  ggplot(aes(x = gender,color = factor(event), fill = factor(event))) + 
  geom_bar(alpha = 0.5) +
  labs(fill = "event", color = "event") +
  scale_fill_manual(values = palette) + 
  scale_colour_manual(values = palette) + 
  theme_minimal()
categorial_variables %>% 
  ggplot(aes(x = industry, color = factor(event), fill = factor(event))) + 
  geom_bar(alpha = 0.5)  +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  labs(fill="event", color="event") +
  scale_fill_manual(values=palette) + 
  scale_colour_manual(values=palette)
categorial_variables %>% 
  ggplot(aes(x = profession, color = factor(event), fill = factor(event))) + 
  geom_bar(alpha = 0.5)  +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  labs(fill="event", color="event") +
  scale_fill_manual(values=palette) + 
  scale_colour_manual(values=palette)
categorial_variables %>% 
  ggplot(aes(x = traffic, color = factor(event), fill = factor(event))) + 
  geom_bar(alpha = 0.5) +
  labs(fill = "event", color = "event") +
  scale_fill_manual(values = palette) + 
  scale_colour_manual(values = palette) + 
  theme_minimal()
categorial_variables %>% 
  ggplot(aes(x = coach, color = factor(event), fill = factor(event))) + 
  geom_bar(alpha = 0.5) +
  labs(fill = "event", color = "event") +
  scale_fill_manual(values = palette) + 
  scale_colour_manual(values = palette) + 
  theme_minimal()
categorial_variables %>% 
  ggplot(aes(x = head_gender, color = factor(event), fill = factor(event))) + 
  geom_bar(alpha = 0.5) +
  labs(fill = "event", color = "event") +
  scale_fill_manual(values = palette) + 
  scale_colour_manual(values = palette) + 
  theme_minimal()
categorial_variables %>% 
  ggplot(aes(x = greywage, color = factor(event), fill = factor(event))) + 
  geom_bar(alpha = 0.5) +
  labs(fill = "event", color = "event") +
  scale_fill_manual(values = palette) + 
  scale_colour_manual(values = palette) + 
  theme_minimal()
categorial_variables %>% 
  ggplot(aes(x = transport, color = factor(event), fill = factor(event))) +
  geom_bar(alpha = 0.5) +
  labs(fill = "event", color = "event") +
  scale_fill_manual(values = palette) + 
  scale_colour_manual(values = palette) + 
  theme_minimal()

Let’s check if there are any correlations between the variables.

Code
turnover_corr <- categorial_variables[-9]

levels(turnover_corr$gender) <- 1:length(levels(turnover_corr$gender))
levels(turnover_corr$industry) <- 1:length(levels(turnover_corr$industry))
levels(turnover_corr$profession) <- 1:length(levels(turnover_corr$profession))
levels(turnover_corr$traffic) <- 1:length(levels(turnover_corr$traffic))
levels(turnover_corr$coach) <- 1:length(levels(turnover_corr$coach))
levels(turnover_corr$head_gender) <- 1:length(levels(turnover_corr$head_gender))
levels(turnover_corr$greywage) <- 1:length(levels(turnover_corr$greywage))
levels(turnover_corr$transport) <- 1:length(levels(turnover_corr$transport))

turnover_corr <- lapply(turnover, as.numeric)
turnover_corr <- cbind(turnover_corr, turnover %>%
    select(-colnames(turnover_corr)))

corrplot::corrplot(cor(turnover_corr[, 1:16]), method = "color", type = "lower",
    order = "hclust", addCoef.col = "black", tl.col = "black", tl.srt = 45, sig.level = 0.01,
    insig = "blank", diag = FALSE)

Correlation plot

Finally, we can look at survival functions.

Survival plots

Code
for (col in colnames(categorial_variables %>% select(-event))) {
  
  f <- as.formula(paste("Surv(duration, event) ~ ", col))
  
  print(autoplot(survfit(f, data = turnover)) + 
          theme_minimal() + 
          ggtitle(paste("Survival plot for the", col, "variable")))
}

We notice that the survival function of an employee who has as greywage = white is superior to that of an employee who has as greywage = grey. But as we said before, if the turnover depended only on this greywage variable, we could say that an employee with greywage = white has a greater probability of continuing in a company than an employee with greywage = grey. But this is not the case, since the turnover also depends on the other variables.

On the other hand we notice that it is difficult to interpret the survival functions of the industry variable as we can see from the graph, because it has several categories.

Survival Models

Train - Test data split

we will split our dataset into train and test, 75% and 25% respectively taking into consideration that we obtain the same percentage of censored and uncensored data, for training our models well in order to obtain good predictions.

Code
set.seed(2022)

test <- turnover
test$id <- 1:nrow(turnover)

turnover_event_0 <- test[turnover$event == 0, ]
turnover_event_1 <- test[turnover$event == 1, ]

test_index_event_0 <- sample(turnover_event_0$id, size = 0.25 * nrow(turnover_event_0))
test_index_event_1 <- sample(turnover_event_1$id, size = 0.25 * nrow(turnover_event_1))

test_index <- c(test_index_event_0, test_index_event_1)
test <- turnover[test_index, ]
train <- turnover[-test_index, ]

cat(sum(train$event)/dim(train)[1] * 100, "% of train data is censored\n")
50.17921 % of train data is censored
Code
cat(sum(test$event)/dim(test)[1] * 100, "% of test data is censored")
50.17921 % of test data is censored

Cox Model

Code
cox_train <- coxph(Surv(duration, event) ~ ., data = train, x = TRUE, y = TRUE)
cox_train
Call:
coxph(formula = Surv(duration, event) ~ ., data = train, x = TRUE, 
    y = TRUE)

                                   coef exp(coef)  se(coef)      z       p
genderm                       -0.243464  0.783907  0.148889 -1.635 0.10201
age                            0.008931  1.008971  0.008177  1.092 0.27477
industryAgriculture            0.615680  1.850915  0.609920  1.009 0.31276
industryBanks                  0.395962  1.485813  0.489001  0.810 0.41809
industryBuilding               0.469967  1.599941  0.516896  0.909 0.36324
industryConsult                0.204054  1.226365  0.504697  0.404 0.68598
industryetc                    0.066571  1.068837  0.496985  0.134 0.89344
industryIT                    -0.519765  0.594660  0.505878 -1.027 0.30421
industrymanufacture           -0.173233  0.840941  0.488930 -0.354 0.72311
industryMining                 0.313733  1.368524  0.573355  0.547 0.58425
industryPharma                -0.557050  0.572897  0.626831 -0.889 0.37418
industryPowerGeneration       -0.215471  0.806162  0.566024 -0.381 0.70344
industryRealEstate            -1.471277  0.229632  0.769054 -1.913 0.05574
industryRetail                -0.408494  0.664651  0.481503 -0.848 0.39623
industryState                 -0.283116  0.753433  0.548065 -0.517 0.60545
industryTelecom               -0.740431  0.476908  0.583587 -1.269 0.20453
industrytransport              0.096289  1.101077  0.545799  0.176 0.85997
professionBusinessDevelopment  0.666513  1.947435  0.604867  1.102 0.27050
professionCommercial           1.041342  2.833017  0.569083  1.830 0.06727
professionConsult              0.886547  2.426736  0.581191  1.525 0.12716
professionEngineer             0.908162  2.479760  0.614100  1.479 0.13918
professionetc                  0.757982  2.133966  0.533294  1.421 0.15522
professionFinanñe              0.481479  1.618466  0.648115  0.743 0.45755
professionHR                   0.210087  1.233786  0.478358  0.439 0.66053
professionIT                   0.225704  1.253205  0.551199  0.409 0.68219
professionLaw                  0.346763  1.414482  0.776939  0.446 0.65537
professionmanage               1.572460  4.818488  0.549829  2.860 0.00424
professionMarketing            0.605931  1.832958  0.543643  1.115 0.26503
professionPR                   1.061753  2.891436  0.722354  1.470 0.14160
professionSales                0.615852  1.851233  0.530015  1.162 0.24525
professionTeaching             0.686938  1.987620  0.667503  1.029 0.30343
trafficempjs                   0.892777  2.441902  0.351344  2.541 0.01105
trafficfriends                -0.009859  0.990189  0.384720 -0.026 0.97956
trafficKA                     -0.116284  0.890222  0.406714 -0.286 0.77495
trafficrabrecNErab             0.515694  1.674800  0.350365  1.472 0.14105
trafficrecNErab               -0.253653  0.775961  0.450749 -0.563 0.57361
trafficreferal                 0.065158  1.067328  0.375697  0.173 0.86231
trafficyoujs                   0.524092  1.688925  0.347587  1.508 0.13161
coachno                        0.072837  1.075556  0.131964  0.552 0.58098
coachyes                       0.190859  1.210289  0.178583  1.069 0.28519
head_genderm                   0.171275  1.186817  0.120412  1.422 0.15491
greywagewhite                 -0.458269  0.632377  0.158851 -2.885 0.00392
transportcar                  -0.137584  0.871461  0.120956 -1.137 0.25534
transportfoot                 -0.418908  0.657765  0.197924 -2.117 0.03430
extraversion                   0.006959  1.006983  0.042521  0.164 0.87000
independ                      -0.000399  0.999601  0.041229 -0.010 0.99228
selfcontrol                   -0.025556  0.974767  0.041261 -0.619 0.53566
anxiety                       -0.066980  0.935214  0.040597 -1.650 0.09897
novator                        0.026133  1.026477  0.036280  0.720 0.47134

Likelihood ratio test=142.1  on 49 df, p=5.244e-11
n= 837, number of events= 420 

Hypothesis of The Cox Model

H0 : The Cox Model is equal to the Null Model
H1 : The Cox Model is different from the Null Model

We see that the p-value is significantly lower than 5%, so the model is different from the Null model.

Let’s represent the Brier score as a function of time.

Brier Score are used to evaluate the accuracy of probabilistic predictions (probabilistic predictions made by machine learning algorithms). The best possible Brier Score will be 0, and the worst possible Brier Score would be a 1

Code
cox.test.brier <- Score(list("cox model" = cox_train),
                        formula = Surv(duration, event) ~ 1, data = test,
                        metrics = "brier", times = sort(unique(test$duration)))

cox.test.brier$Brier$score %>% 
  select(model, times, Brier) %>% 
  ggplot(aes(x = times, y = Brier, color = model)) +
  geom_line() + 
  xlab("Time") +
  theme_minimal() +
  scale_colour_manual(values=c("darkgrey", "orange")) +
  ggtitle("Brier Score for the Cox Model")

it is difficult to see directly from the graph if the Cox model has a lower Brier score compared to that of the Null model, this is why we will calculate the Integrated Brier Score.

Integrated Brier Score

In survival analysis, The Integrated Brier Score (IBS) provides an overall calculation of the model performance at all available times.

Code
IBrierScore <- function(brier_score, model_name) {
  
  # extracting data
  brier_score.data <- brier_score$Brier$score %>% 
    select("model", "times", "Brier") %>% 
    filter(model == model_name) %>% 
    select("times", "Brier")
  
  x <- brier_score.data$times
  y <- brier_score.data$Brier
  id <- order(x)
  
  # area under the curve
  area <- sum(diff(x[id]) * zoo::rollmean(y[id],2))
  
  return(area/x[length(x)])
}

coxIBS <- IBrierScore(cox.test.brier, "cox model")
coxIBS
[1] 0.1649923

Thus, the Cox model has an Integrated Brier Score of 16.499%.

Random Forest Model

Code
rand_forest.model <- rfsrc(Surv(duration, event) ~ ., data = train)
plot(vimp(rand_forest.model))


               Importance   Relative Imp
traffic            0.0629         1.0000
age                0.0558         0.8882
industry           0.0520         0.8269
profession         0.0443         0.7052
independ           0.0401         0.6382
greywage           0.0396         0.6293
selfcontrol        0.0340         0.5407
anxiety            0.0322         0.5119
transport          0.0290         0.4608
extraversion       0.0217         0.3446
novator            0.0167         0.2657
head_gender        0.0062         0.0985
gender             0.0030         0.0482
coach              0.0015         0.0242

The Structure of a Decision Tree of The Random Forest Model

Code
plot(get.tree(rand_forest.model, 5))
Code
rand_forest.brier <- Score(list("rf model" = rand_forest.model),
                        formula = Surv(duration, event) ~ 1, data = test,
                        metrics = "brier", times = sort(unique(test$duration)))


rand_forest.brier$Brier$score %>% 
  select(model, times, Brier) %>%
  full_join(cox.test.brier$Brier$score %>% select(model, times, Brier),
            by = c("model", "times", "Brier")) %>% 
  ggplot(aes(x = times, y = Brier, color = model)) +
  geom_line() + 
  xlab("Time") +
  theme_minimal() +
  scale_colour_manual(values=c("darkgrey", "darkgreen", "orange")) +
  ggtitle("Brier Score for the Different Models")

Code
rfIBS <- round(IBrierScore(rand_forest.brier, "rf model"), 4)

As we can see, from the graph we see that the Random Forest model has a lower Brier Score than that the Cox model.

Integrated Brier Score

To be more precisely, the Random Forest model has an Integrated Brier Score of 14.73%, which is lower than that of the Cox model (16.499) .

This implies that the Random Forest model is better than the Cox model in terms of predictions.

Predictions

We want to predict the probability of an employee will stay for longer than 3 years, whose features are Female of age 30, referred by an employee of the company (referral) in IT industry, profession HR, commuting by bus, having a coach during the probation, with male supervisor, whose characteristic scores are 5 for all categories.

Code
employee <- data.frame(event = 0,
                       duration = 0,
                       gender = "f",
                       age = 30,
                       industry = "IT", 
                       profession = "HR",
                       traffic = "referal",
                       coach = "yes",
                       head_gender = "m",
                       greywage = "white",
                       transport = "bus",
                       extraversion = 5.0,
                       independ = 5.0,
                       selfcontrol = 5.0, 
                       anxiety = 5.0,
                       novator = 5.0, stringsAsFactors=TRUE)

str(employee)
'data.frame':   1 obs. of  16 variables:
 $ event       : num 0
 $ duration    : num 0
 $ gender      : Factor w/ 1 level "f": 1
 $ age         : num 30
 $ industry    : Factor w/ 1 level "IT": 1
 $ profession  : Factor w/ 1 level "HR": 1
 $ traffic     : Factor w/ 1 level "referal": 1
 $ coach       : Factor w/ 1 level "yes": 1
 $ head_gender : Factor w/ 1 level "m": 1
 $ greywage    : Factor w/ 1 level "white": 1
 $ transport   : Factor w/ 1 level "bus": 1
 $ extraversion: num 5
 $ independ    : num 5
 $ selfcontrol : num 5
 $ anxiety     : num 5
 $ novator     : num 5

Let’s compute the probability that this employee will stay for longer than 3 years :

Code
p <- predict(rand_forest.model, employee)

index <- which(p$time.interest > 36)
prob <- p$survival[index]

So the probability that the employee will stay for longer than:

  • 3 years is 0.771
  • 3 years and 1 month is 0.768
  • 3 years and 2 months is 0.767
  • 3 years and 6 months is 0.736

now we want to see if there is an impact if the same employee worked in another company.

Code
industries <- levels(turnover$industry)
prob <- c()

for (industry in industries) {
  
  employee$industry <- industry
  p <- predict(rand_forest.model, employee)
  
  index <- which(p$time.interest >= 36)
  prob <- c(prob, p$survival[index][1])
  
}

estimation <- data.frame(industry = industries, prob = prob)


highligh_bar <- function(estimation, title) {
  
  estimation$color <- rep("gray", nrow(estimation))
  estimation[which.min(estimation$prob), ]$color <- "min"
  estimation[which.max(estimation$prob), ]$color <- "max"

  barplot <- estimation %>% 
    ggplot(aes(x = industry, y = prob, fill = color)) +
    geom_bar(stat = "identity") + 
    scale_fill_manual(values = c(max = "darkgreen",
                                 min = "darkred",
                                 gray = "gray")) +
    ylab("Probability") + 
    theme_minimal() + 
    coord_flip() +
    ggtitle(title) +
    theme(plot.title = element_text(hjust = 0.15))
  
  return (barplot)
}

highligh_bar(estimation, "The probability that the employee will stay for longer than 3 years")

We can see that if the employee works in a Retail sector, he will have the greatest probability of staying more than 3 years in the company than another employee with the same characteristics but in another sector.

We can also notice that if an employee works in a Consult sector, he will have the lowest probability of staying more than 3 years in the company than another employee with the same characteristics but in another sector.

Now we will consider that this employee has already worked for one year, we want to estimate the probability that this employee will stay for another 2 years

Code
employee$duration <- 12
prob <- c()

for (industry in industries) {
  
  employee$industry <- industry
  p <- predict(rand_forest.model, employee)
  
  index <- which(p$time.interest >= 24)
  prob <- c(prob, p$survival[index][1])

}

estimation2 <- data.frame(industry = industries, prob = prob)

highligh_bar(estimation2, "The probability that the employee will stay for another 2 years")

We can see that if the employee works in a Retail sector, he will have a greater probability of staying another 2 years in the company than another employee with the same characteristics but working in another sector.

We can also notice that if an employee works in an Agriculture industry, he will have the lowest probability of staying another 2 years in the company compared to another employee with the same characteristics but working in another industry.

We notice that the probability that the employee continues another two years in a company knowing that he has already worked one year in it is higher than the probability of an employee with the same characteristics who would work three years knowing that he has just started in the company.

For example, if we look at the graph for an employee who has been working in the IT sector for one year, we notice that the probability that this employee will stay in the company for another two years is equal to 0.834 which is higher than the probability of another employee with the same profile who will stay for three years but who has just started working in this company (0.771).

This means that employees are more likely to stay with the same company if they have already worked there for at least one year than other employees with the same profile who have just started working. But in these days, this is not really the case, employees tend not to stay for a long time in the same company (especially in sectors where there are a lot of offers like in IT). IT employees can start as junior employees in startups or small companies and when they gain experience, they move to bigger companies that offer higher salaries.

Contrary to our previous conclusion, our predictions are not consistent with our conclusion. This may be due to the fact that the data is biased, or we may look at it from another point of view, that the data is not up to date with the current generations.

Conclusion

Turnover calculations are an important metric in most businesses. They are useful to provide a method to track the movement of employee’s out of your business and identifying potential risks.
Turnover and tenure have a complex relationship. A business with a high turnover will logically have a lower average tenure, and at the same time tenure is an important factor in the decision to exit.