This note describes R code to create and display a support vector machine (SVM) model for predicting recidivism using the ProPublica data about COMPAS available here. Even though the model uses only two risk factors or predictors (age and priors count), its performance is relatively good. The model, however, performs differently across racial groups despite race not being a risk factor or predictor in the model.

For good introductions to Support Vector Machine in R, check out the following resources:

Support Vector Machine in R from DataCamp
A Gentle Introduction to Support Vector Machines Using R

Dataset

We begin by loading the dataset:

dataset <- read.csv("data/compas-scores-two-years.csv") # you might need to change the path

Select appropriate columns

The goal is to construct a simple model that uses only two risk factors (age and priors count) to predict the outcome, and then test the performance of this model against different racial groups. So we restrict our focus to just four columns in the data set, corresponding to the four variables of interest: age, priors count, race and the outcome variable (is_recid) coded as 1 or 0.

dataset <- dataset[,c("age","priors_count","is_recid", "race")]
dataset

Straightening up the data

The variables “age” and “priors count” are integers, as expected, since they can take any integer value. However, the variable “is_recid” should not be an integer since it only takes two values, 1 or 0. We should then tell R to consider “is_recid” a factor and modify our dataset accordingly.

# Setting is_recid explicitly as factor 
dataset$is_recid <- as.factor(dataset$is_recid)
dataset

Plotting

We are now ready to plot our data. Variables “age” and “priors count” are plotted relative to the x-axis and y-axis. The binary variable “is_recid” is the outcome variable and is plotted using two difference colors.

library(ggplot2)
library(gridExtra) 
library(ggpubr)

p1 <-    ggplot(dataset, aes(age, priors_count, col=is_recid)) + 
          geom_point(size = 0.5) +
          xlab("Age") + 
          ylab("Prior counts") +
          labs(title = "Age, Prior counts and Recidivism", color = "Reoffender (1) or not (0)") +
          scale_color_discrete(breaks=c("1","0")) +
          theme_bw()

p2 <-     ggplot(dataset, aes(priors_count, age, col=is_recid)) + 
          geom_point(size = 0.5) +
          ylab("Age") + 
          xlab("Prior counts") +
          labs(title = "Age, Prior counts and Recidivism", color = "Reoffender (1) or not (0)") +
          scale_color_discrete(breaks=c("1","0")) +
          theme_bw()

ggarrange(p1, p2, nrow=1, ncol=2, common.legend = TRUE, legend="bottom")

A pattern

The data are not easily separated with a linear model – they are not “linearly separable” – as most real data are. No matter where the line is drawn, it is impossible to always correctly classify an individual as re-offender or not. So false positives and false negatives will be inevitable.

The data do possess a pattern, though. Re-offenders tend to be younger and have a higher priors count. So the trick will be to draw a line through the data that is a good predictive model, while still making (inevitably) a number of errors. This is what our SVM model will do.

Training and test data

The next step is to divide the data into training data and test data. We perform a split of the data with a ratio of 75-to-25 for training-to-test data.

library(caTools) # install.packages('caTools') need this package
set.seed(123)
split = sample.split(dataset$is_recid, SplitRatio = 0.75)
training_set <- subset(dataset, split == TRUE)
test_set <- subset(dataset, split == FALSE)

There will be about 5,500 data points for training and 1,800 for testing.

nrow(training_set)

## [1] 5410

nrow(test_set)

## [1] 1804

Plotting training and test data

Just out of curiosity, we can plot the training and test data and satisfy ourselves that they are roughly the same but not identical.

tr <-  ggplot(training_set, aes(priors_count, age, col=is_recid)) + 
    geom_point(size = 0.5) +
    ylab("Age") + 
    xlab("Prior counts") +
    labs(title = "Training data", color = "Reoffender (1) or not (0)") +
    scale_color_discrete(breaks=c("1","0")) +
    theme_bw()

ts <-  ggplot(test_set, aes(priors_count, age, col=is_recid)) + 
    geom_point(size = 0.5) +
    ylab("Age") + 
    xlab("Prior counts") +
    labs(title = "Test data", color = "Reoffender (1) or not (0)") +
    scale_color_discrete(breaks=c("1","0")) +
    theme_bw()

ggarrange(tr, ts, nrow=1, ncol=2, common.legend = TRUE, legend="bottom")

Support vector machine (SVM) model

We now train a SVM linear model from our training data. Note that only variables “age” and “priors_count” are used to train the model (is_recid ~ age + priors_count). The variable “race” plays no role in the model.

library(e1071) # Support vector machine package -- install.packages("e1071")
svmfit = svm(is_recid ~ age + priors_count, data = training_set, type = 'C-classification', kernel = "linear", cost=1, scale = FALSE)

We can print and plot the basic features of our SVM predictive model. We see that it consists of about 4,000 support vectors. The plot shows where the liner model (essentially a line through the data points) should be located. The plot is not very nice looking, but it gives the idea.

print(svmfit)

## 
## Call:
## svm(formula = is_recid ~ age + priors_count, data = training_set, 
##     type = "C-classification", kernel = "linear", cost = 1, scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  4230

plot(svmfit, subset(training_set, select = -c(race)))

Can we make a nicer looking plot?

Since the standard plot of the SVM linear model provided by R is not very nice looking, we can try to make it look nicer for presentation purposes. This section uses ideas from An intuitive introduction to support vector machines using R.

Plotting support vectors

We want to identify the coordinates of the support vectors, namely those points in the training data that were used to calculate the decision boundary (in the linear case, just a line across the data points).

# identify support vectors in training set
df_sv <- training_set[svmfit$index,] # index of support vectors in training dataset -- svmfit$index
# Support vectors
# svmfit$SV

We can then add the support vector points (shaded in blue) to the plot depicting the training data:

# Plot of training data as before
p <- ggplot(training_set, aes(age, priors_count, col=is_recid)) + 
      geom_point(size = 0.5) +
      xlab("Age") + 
      ylab("Prior counts") +
      labs(title = "Support vectors", color = "Reoffender (1) or not (0)") +
      scale_color_discrete(breaks=c("1","0")) +
      theme_bw()

# Adding support vectors points 
pdec  <- p + geom_point(data=df_sv,aes(x=age,y=priors_count),colour= "blue",size = 3, alpha=0.1)

# display plot
pdec

Plotting decision boundary (linear model)

Next, plot the decision boundary (line) across the data. We need to extract the parameters from the SVM model. This is cumbersome.

#build weight vector
w <- t(svmfit$coefs) %*% svmfit$SV
w

##             age priors_count
## [1,] 0.04997593   -0.1999037

Given the weight vector, next calculate the slope and intercept of the predicted decision boundary as follows:

#calculate slope
slope_1 <- -w[1]/w[2]
slope_1

## [1] 0.25

#calculate intercept
intercept_1 <- svmfit$rho/w[2]
intercept_1

## [1] -4.748555

Next, plot the decision boundary using slope and intercept just calculated.

# decision boundary using calculated slope and intercept
p_all <- pdec + 
      geom_abline(slope=slope_1,intercept = intercept_1) +
      labs(title = "Support vectors and linear model", color = "Reoffender (1) or not (0)")  

p_line <- p + geom_abline(slope=slope_1,intercept = intercept_1) +
      labs(title = "Linear model (only)", color = "Reoffender (1) or not (0)")  

ggarrange(p_all, p_line, nrow=1, ncol=2, common.legend = TRUE, legend="bottom")

The plot is the same as the one automatically generated by R (a good sign!), but the axes are switched. The plot that displays the decision boundary (line) without the support vectors clearly shows that the linear model captures the pattern that re-offenders tend to be concentrated among the younger people and people with with a higher priors count.

Testing the SVM model

Now that we have our SVM linear model, we check its performance against the test data.

Visual check

As a preliminary step, we can do a visual check by plotting the decision boundary (line) against the test data.

    ggplot(test_set, aes(age, priors_count, col=is_recid)) + 
    geom_point(size = 0.5) +
    ylab("Priors count") + 
    xlab("Age") +
    labs(title = "Test data", color = "Reoffender (1) or not (0)") +
    scale_color_discrete(breaks=c("1","0")) +
    theme_bw() + 
    geom_abline(slope=slope_1,intercept = intercept_1) +
    labs(title = "Linear model against test data", color = "Reoffender (1) or not (0)")

All in all, the linear model seems to be placed in the right position, often correctly classifying individuals as re-offenders or not.

Predictions

Next, given the test data (test_set), we list the predictions the model makes for the output variable (is_recid) and store them (y_pred).

y_pred <- predict(svmfit, newdata <- test_set)
# y_pred
length(y_pred)

## [1] 1804

Confusion matrix

We now create a confusion matrix that compares the predicted values of the output variable we just calculated (y_pred) with the actual values of the output variable in the test set (test_set[, 3]).

cm <- table(actual = test_set[, 3], y_pred)
cm

##       y_pred
## actual   0   1
##      0 738 198
##      1 370 498

Just to get a feel of how the confusion matrix works, here is a toy example with just five data points.

true <- c(1, 0, 1, 1, 0)
predicted <- c(0, 1, 0, 1, 0)
table(true, predicted)

##     predicted
## true 0 1
##    0 1 1
##    1 2 1

TP, TN, FP, FN

We know the number of true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN) by looking at the appropriate cells in the confusion matrix table.

TP <- cm[2, 2]
TN <- cm[1, 1]
FP <- cm[2, 1]
FN <- cm[1, 2]

TP

## [1] 498

TN

## [1] 738

FP

## [1] 370

FN

## [1] 198

Sensitivity and specificity

Sensitivity and specificity are then calculated in the standard way.

sen <- TP/(TP + FN)
spe <- TN/(TN + FP)
sen

## [1] 0.7155172

spe

## [1] 0.666065

The model does not perform too bad, especially considering it is based on just two predictors and it is linear. We will see if we can improve its performance later. The SVM linear model has 72% sensitivity (that is, it correctly classifies actual re-offenders 72% of the time) and a specificity of 66% (that is, it correctly classifies non re-offenders 66% of the time).

Classification errors

To reflect the terminology of the debate between ProPublica and Northpointe, we should think in terms of classification and prediction errors. False negative classification (FNC) and false positive classification (FPC) rates (which are used in the ProPublica article about COMPAS) are the complement of sensitivity and specificity. Their values should therefore be about 28% and 33% respectively.

FNC <- FN/(TP+FN)
FNC

## [1] 0.2844828

FPC <- FP/(TN+FP)
FPC

## [1] 0.333935

Predictive errors

In its response, Northpointe used predictive errors. Predictive errors are divided into false positive prediction (FPP) and false negative predictions (FNP). They are calculated as follows:

FPP <- FP/(TP+FP)
FPP

## [1] 0.4262673

FNP <- FN/(FN+TN)
FNP

## [1] 0.2115385

The predictive errors are not too bad, although the false positive predictive errors at 42% is relatively high. This means that 42% of those who are classified as (or predicted to be) re-offenders do not actually re-offend. The false negative predictive error is lower. Only 21% of those classified as (or predicted to be) non re-offenders are actually re-offenders.

Note the difference in the denominator compared to classification errors. The denominator here included all positive predictions whether correct or incorrect (TP + FP) and all negative predictions whether correct or incorrect (FN + TN), In the case of classification errors, instead, the denominator included all positive cases whether predicted or not (TP + FN) and all negatives cases whether predicted or not (TN + FP).

Confusion matrix (quick way)

Fortunately, there is a quicker way to get the confusion matrix without computing all the quantities manually. Here it is:

library(caret)

## Loading required package: lattice

confusionMatrix(test_set[, 3], y_pred, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 738 198
##          1 370 498
##                                           
##                Accuracy : 0.6851          
##                  95% CI : (0.6631, 0.7065)
##     No Information Rate : 0.6142          
##     P-Value [Acc > NIR] : 2.074e-10       
##                                           
##                   Kappa : 0.3648          
##                                           
##  Mcnemar's Test P-Value : 7.231e-13       
##                                           
##             Sensitivity : 0.7155          
##             Specificity : 0.6661          
##          Pos Pred Value : 0.5737          
##          Neg Pred Value : 0.7885          
##              Prevalence : 0.3858          
##          Detection Rate : 0.2761          
##    Detection Prevalence : 0.4812          
##       Balanced Accuracy : 0.6908          
##                                           
##        'Positive' Class : 1               
##

The table above gives the “positive predictive value” and the “negative predictive value” which are the complements of “false positive prediction” and “false negative prediction” respectively. Sensitivity and specificity, as noted earlier, are the complements of “false positive classification” and “false negative classification”. The values in the manual calculations match the ones calculated automatically.

Comparing the SVM linear model across races

We have seen that the SVM liner model does not perform too bad in terms of classification or prediction errors. But what happens if we evaluate its performance by race? Does the model perform better in making predictions about white versus black people? We can make the same test that ProPublica used for COMPAS and apply it to our SVM linear model.

First, we need to divide our test sets and predictions by race.

# test set and predictions for black people
test_set_b <- subset(test_set, race == "African-American")
y_pred_b <- predict(svmfit, newdata <- test_set_b)

# test set and predictions for white people
test_set_w <- subset(test_set, race == "Caucasian")
y_pred_w <- predict(svmfit, newdata <- test_set_w)

# test set and predictions for hispanic people
test_set_h <- subset(test_set, race == "Hispanic")
y_pred_h <- predict(svmfit, newdata <- test_set_h)

Next, we calculate the error rates for predictions and classifications using confusion matrices by race.

# confusion matrix black

cm_b <- table(actual_b = test_set_b[, 3], y_pred_b)

TP <- cm_b[2, 2]
TN <- cm_b[1, 1]
FP <- cm_b[2, 1]
FN <- cm_b[1, 2]

FNC_b <- FN/(TP+FN)
FPC_b <- FP/(TN+FP)
FPP_b <- FP/(TP+FP)
FNP_b <- FN/(FN+TN)

# confusion matrix white

cm_w <- table(actual_w = test_set_w[, 3], y_pred_w)

TP <- cm_w[2, 2]
TN <- cm_w[1, 1]
FP <- cm_w[2, 1]
FN <- cm_w[1, 2]

FNC_w <- FN/(TP+FN)
FPC_w <- FP/(TN+FP)
FPP_w <- FP/(TP+FP)
FNP_w <- FN/(FN+TN)

# confusion matrix hispanic

cm_h <- table(actual_h = test_set_h[, 3], y_pred_h)

TP <- cm_h[2, 2]
TN <- cm_h[1, 1]
FP <- cm_h[2, 1]
FN <- cm_h[1, 2]

FNC_h <- FN/(TP+FN)
FPC_h <- FP/(TN+FP)
FPP_h <- FP/(TP+FP)
FNP_h <- FN/(FN+TN)

We plot the values of classification errors by race:

fnc <- c(FNC, FNC_b, FNC_w, FNC_h)
race <- c("all", "black", "white", "hispanic")
fnc_race <- data.frame(fnc, race)

fpc <- c(FPC, FPC_b, FPC_w, FPC_h)
race <- c("all", "black", "white", "hispanic")
fpc_race <- data.frame(fnc, race)

fc <- c(fnc, fpc) 
race2 <- rep(c("all", "black", "white", "hispanic"))
error <- c(rep("FNC" , 4), rep("FPC", 4))
fc_race <- data.frame(fc, race2, error)

ggplot(fc_race, aes(x=race2, y=fc, fill=error)) + geom_bar(stat="identity", position="dodge", alpha=.6, width=.4) +
    coord_flip() + theme_bw() + labs(title = "False classifications rates by race") + xlab("race") + ylab("FC rate")

We plot the values of prediction errors by race:

fnp <- c(FNP, FNP_b, FNP_w, FNP_h)
race <- c("all", "black", "white", "hispanic")
fnp_race <- data.frame(fnp, race)

fpp <- c(FPP, FPP_b, FPP_w, FPP_h)
race <- c("all", "black", "white", "hispanic")
fpp_race <- data.frame(fnp, race)

fp <- c(fnp, fpp) 
race2 <- rep(c("all", "black", "white", "hispanic"))
error <- c(rep("FNP" , 4), rep("FPP", 4))
fp_race <- data.frame(fp, race2, error)

ggplot(fp_race, aes(x=race2, y=fp, fill=error)) + geom_bar(stat="identity", position="dodge", alpha=.6, width=.4) +
    coord_flip() + theme_bw() + labs(title = "False prediction rates by race") + xlab("race") + ylab("FP rate")

Interestingly, while classification errors are somewhat similar across races. Predictions are more markedly different across races. This is opposite to the COMPAS algorithm.

Race-relative models

We can divide the data by race to see if there are race-specific patterns. Let’s start by dividing the training data by race and plot the results.

# training set for black people
training_set_b <- subset(training_set, race == "African-American")
test_set_b <- subset(test_set, race == "African-American")

# training set for white people people
training_set_w <- subset(training_set, race == "Caucasian")
test_set_w <- subset(test_set, race == "Caucasian")

# plot training set for black people people
b <-  ggplot(training_set_b, aes(age, priors_count, col=is_recid)) + 
    geom_point(size = 0.5) +
    xlab("Age") + 
    ylab("Prior counts") +
    labs(title = "Training data - black people", color = "Reoffender (1) or not (0)") +
    scale_color_discrete(breaks=c("1","0")) +
    theme_bw() +
    xlim(15, 90)

# plot training set for white people people
w <-  ggplot(training_set_w, aes(age, priors_count, col=is_recid)) + 
    geom_point(size = 0.5) +
    xlab("Age") + 
    ylab("Prior counts") +
    labs(title = "Training data - white people", color = "Reoffender (1) or not (0)") +
    scale_color_discrete(breaks=c("1","0")) +
    theme_bw() +
    xlim(15, 90)

# printing the two plots side by side
ggarrange(b, w, nrow=1, ncol=2, common.legend = TRUE, legend="bottom")

Looks like the same trend as in the general data is present. Younger people with a higher priors count tend to be re-offenders more often than others. But the strength of the correlation seems different across racial groups. So the location of the decision boundary (slope and intercept of the line) is likely to be different for the two racial groups.

SVM models by race

If we construct the two SVM liner models – same procedure as used above, one model per racial group – the slope and intercept of the line are indeed quite different for white and black people.

# SVM models by race
svmfit_b <- svm(is_recid ~ age + priors_count, data = training_set_b, type = 'C-classification', kernel = "linear", cost=1, scale = FALSE)

svmfit_w <- svm(is_recid ~ age + priors_count, data = training_set_w, type = 'C-classification', kernel = "linear", cost=1, scale = FALSE)

# Predictions by race
y_pred_b <- predict(svmfit_b, newdata <- test_set_b)
y_pred_w <- predict(svmfit_w, newdata <- test_set_w)

# Weight vector by race
w_b <- t(svmfit_b$coefs) %*% svmfit_b$SV
w_w <- t(svmfit_w$coefs) %*% svmfit_w$SV

# Slope of the line by race
slope_1_b <- -w_b[1]/w_b[2]
slope_1_w <- -w_w[1]/w_w[2]

# intercept by race
intercept_1_b <- svmfit_b$rho/w_b[2]
intercept_1_w <- svmfit_w$rho/w_w[2]

# decision boundary using calculated slope and intercept by race
line_b <- b + geom_abline(slope=slope_1_b,intercept = intercept_1_b) +
      labs(title = "Black people", color = "Reoffender (1) or not (0)")  

line_w <- w + geom_abline(slope=slope_1_w,intercept = intercept_1_w) +
      labs(title = "White people", color = "Reoffender (1) or not (0)")  

# liner model for all people
line_all <- ggplot(training_set, aes(age, priors_count, col=is_recid)) + 
    geom_point(size = 0.5) +
    xlab("Age") + 
    ylab("Priors count") +
    labs(title = "Test data", color = "Reoffender (1) or not (0)") +
    scale_color_discrete(breaks=c("1","0")) +
    theme_bw() + 
    geom_abline(slope=slope_1,intercept = intercept_1) +   # slope and intercept calculated earlier
    labs(title = "All", color = "Reoffender (1) or not (0)") +
    xlim(15, 90)

# plot three SVM models together for comparison
plot <- ggarrange(line_b, line_w, line_all, nrow=1, ncol=3, common.legend = TRUE, legend="bottom")

## Warning: Removed 1 rows containing missing values (geom_point).

annotate_figure(plot, top = text_grob("Race-based SVM linear models", 
               color = "black", face = "bold", size = 14))

Confusion matrices by race

How do the models model perform? Let’s look at the confusion matrices by race:

# cm for SVM model for black people
confusionMatrix(test_set_b[,3], y_pred_b, positive = "1" )

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 179 249
##          1  86 436
##                                          
##                Accuracy : 0.6474         
##                  95% CI : (0.616, 0.6778)
##     No Information Rate : 0.7211         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.2625         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.6365         
##             Specificity : 0.6755         
##          Pos Pred Value : 0.8352         
##          Neg Pred Value : 0.4182         
##              Prevalence : 0.7211         
##          Detection Rate : 0.4589         
##    Detection Prevalence : 0.5495         
##       Balanced Accuracy : 0.6560         
##                                          
##        'Positive' Class : 1              
##

# cm for SVM model for white people
confusionMatrix(test_set_w[,3], y_pred_w, positive = "1" )

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 327  27
##          1 166  81
##                                           
##                Accuracy : 0.6789          
##                  95% CI : (0.6399, 0.7161)
##     No Information Rate : 0.8203          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2751          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7500          
##             Specificity : 0.6633          
##          Pos Pred Value : 0.3279          
##          Neg Pred Value : 0.9237          
##              Prevalence : 0.1797          
##          Detection Rate : 0.1348          
##    Detection Prevalence : 0.4110          
##       Balanced Accuracy : 0.7066          
##                                           
##        'Positive' Class : 1               
##

SVM with ProPublica COMPAS Data

Marcello Di Bello