Prediction Probability Threshold and Confidence determination for Binary Classification using Decision Trees

Binil
4 min readNov 21, 2021

In this article I am covering two key things for binary classification, both based on Decision Trees.

a) Determine the threshold for splitting the classes based on the prediction probability

b) Determine the confidence thresholds for the prediction based on the prediction probability

In most binary classification problem, we predict the probability of an instance to be on a particular class. The prediction value range is from 0 to 1 with 1 implying the most probability for a particular class and 0 being the least probability. Mostly the prediction values ranges between 0 and 1. When the classes are balanced, mostly the threshold is set to 0.5 for determining the class. Any thing above 0.5 will be considered for one class and anything below 0.5 is considered for the next class. But in most practical scenarios we need to deal with imbalanced datasets, where the dataset skew towards a particular class. Cases like fraud detection, intrusion detection, anomaly detection etc are mostly imbalanced datasets, where the instances of positive cases are relatively low when compared to the normal(negative) cases. In such cases, considering a prediction split threshold of 0.5 may not be right. There are different ways in which we can determine an optimum threshold to determine the class. Here we are considering Decision Tree based approach. In this approach, the idea is to split the predictions, such that it best optimizes the concentration of a particular class in the child nodes. The evaluation criteria to decide the optimal is gini score. Details on Gini score can be found in the below url:

https://en.wikipedia.org/wiki/Gini_coefficient

Consider the dataset, where y_pred is the prediction and y_train is the actual values. As indicated earlier, y_pred in the prediction probability which determines the probability of a prediction to be in a particular class. Here we need to determine what is the threshold for splitting y_pred, so that we can classify the instance into positive or negative class.

For this we need to create a Decision Tree based classifier to predict y_train from y_pred. By nature, Decision Tree split the nodes, to reduce the impurity(so that concentration of one class is higher and other class is low).

Here is the python code for the same

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from io import StringIO
from sklearn.tree import export_graphviz
import pydotplus

clf_gini = DecisionTreeClassifier(criterion = “gini”, random_state = 100,max_depth=1, min_samples_leaf=3)
clf_gini.fit(y_pred, y_train)

dot_data = StringIO()
export_graphviz(clf_gini, out_file=dot_data,filled=True, rounded=True, special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())

Here X0 represents the prediction, gini is the score considering all the child instances belonging to that node, samples represent the number of instances and value represent the split of positive and negative instances in that particular node. Here we can see that, threshold is 0.48 which optimizes the positive and negative instances in the child nodes.

Another related problem, we can solve is finding, what confidence the model has, when it makes a prediction. General rule, is higher the prediction probaility, higher will be the chance for that instance to belong to the positive class(considering 0 for normal and 1 for positive class). We can solve this also using Decision Tree, we just need to increase the tree depth. Code given below.

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from io import StringIO
from sklearn.tree import export_graphviz
import pydotplus

clf_gini = DecisionTreeClassifier(criterion = “gini”, random_state = 100,max_depth=3, min_samples_leaf=3)
clf_gini.fit(y_pred, y_train)

dot_data = StringIO()
export_graphviz(clf_gini, out_file=dot_data,filled=True, rounded=True, special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())

Decision Tree to measure confidence of prediction

Here blue indicates, dominance of positive class and amber indicates, dominance of negative class. From the figure, we can see that the probability of the instance to belong to positive class is much higher, when the prediction probability is greater than 0.891 and the probability for the instnace ot be in negative class, when the prediction probability is less than 0.114. For values in between, we can see that dominance of a particular class is not very high.

Based on the probability score, if we need to assign a confidence score, following would be the split,

>0.891 — High probability for the prediction to belong to positive class

≤0.891 and > 0.114 — Medium probability for the prediction to belong to positive class

≤ 0.114 — Low chance for the prediction to belong to the positive class.

Please let me know for any comments you have.

--

--

Binil

I am a Data Science practitioner. I have good experience in building and deploying ML models and using ML for Operations Research. My Email: binil_kg@yahoo.com