Binning variables before running logistic regression

Sneha Tody

2018-05-21

The logiBin package enables fast binning of multiple variables using parallel processing. A summary of all the variables binned is generated which provides the information value, entropy, an indicator of whether the variable follows a monotonic trend or not, etc. It supports rebinning of variables to force a monotonic trend as well as manual binning based on pre specified cuts.

Creating the bins for continuous and categorical variables

The getBins function uses parallel processing to compute bins for continuous and categorical variables. The splits are computed using the partykit package which uses conditional inferencing trees. Refer to the package documentation for more details. A separate bin is created for NA values. This can be combined using naCombine function. Categorical variables with a maximum of 10 distinct values are supported.

Eg: b1 <- getBins(loanData, “bad_flag”, c(“age”, “LTV”, “score”, “balance”), minCr = 0.8, nCores = 2)

This retrurns a list containing 3 elements. One is a a dataframe called err which contains details of all the variables that could not be split and the reason for the same.

var error
9 score No significant splits


It can be seen that no significant splits were found for the variable ‘score’. The other variables specified were split into bins. The summary of these splits can be seen from the next element of the list which is a dataframe called varSummar. This contains the summary of the variables’ IV value, entropy, p value from ctree function in partykit package, flag which indicates if bad rate increases/decreases with variable value, flag to indicate if a monotonic trend is present, number of bins which flip (i.e. do not follow a monotonic trend), number of bins of the variable and a flag to indicate whether it includes pure nodes (node which do not have any defaults).

var iv pVal stat ent trend monTrend flipRatio numBins purNode varType
4 age 0.8399 0.0356891 4.411899 0.7367 I N 0.5 3 N integer
8 LTV 0.5241 0.0067388 7.341301 0.7567 D Y 0.0 3 Y numeric
12 balance 0.3536 0.0360245 4.395943 0.7900 D Y 0.0 2 N integer


The variables LTV & balance have a monotonic decreasing trend which indicates that the bad rate decreases as the value of the variable increases. The variable age has an increasing trend. However it is not monotonic and there is a flip in 50% of the bins. In order to check this, look at the second element of the list which is a data frame called bin which contains details of all the bins of the variables.

var bin count bads goods propn bad_rate iv ent
1 age age <= 34 44 19 25 44 43.18 0.2602 0.9865
2 age age > 34 & age <= 45 32 2 30 32 6.25 0.5772 0.3373
3 age age > 45 24 6 18 24 25.00 0.0025 0.8113
4 age Total 100 27 73 1 27.00 0.8399 0.7367
5 LTV LTV <= 0.77 24 13 11 24 54.17 0.3843 0.9950
6 LTV LTV > 0.77 74 14 60 74 18.92 0.1398 0.6998
7 LTV is.na(LTV) 2 0 2 2 0.00 Inf 0.0000
8 LTV Total 100 27 73 1 27.00 0.5241 0.7567
10 balance balance <= 6359 19 10 9 19 52.63 0.2718 0.9980
11 balance balance > 6359 81 17 64 81 20.99 0.0818 0.7412
12 balance Total 100 27 73 1 27.00 0.3536 0.7900

Looking at the bins of the variable age, it can be seen that the first bin has a high bad rate and contains a large proportion of the population. The bad rate of the middle bin is lower than the last bin. However if the second & third bins are combined a monotonic decreasing trend can be forced. The function forceDecrTrend can be used for this. Eg: b1 <- forceDecrTrend(b1,“age”)

We can see that once a decreasing trend is forced, the variable age is now monotonically decreasing.

var bin count bads goods propn bad_rate iv ent
5 LTV LTV <= 0.77 24 13 11 24 54.17 0.3843 0.9950
6 LTV LTV > 0.77 74 14 60 74 18.92 0.1398 0.6998
7 LTV is.na(LTV) 2 0 2 2 0.00 Inf 0.0000
8 LTV Total 100 27 73 1 27.00 0.5241 0.7567
10 balance balance <= 6359 19 10 9 19 52.63 0.2718 0.9980
11 balance balance > 6359 81 17 64 81 20.99 0.0818 0.7412
12 balance Total 100 27 73 1 27.00 0.3536 0.7900
1 age age <= 34 44 19 25 44 43.18 0.2602 0.9865
2 age age > 34 56 8 48 56 14.29 0.2880 0.5917
3 age Total 100 27 73 1 27.00 0.5482 0.7654

This function can also take multiple variables as input if a decreasing trend is to be forced on multiple variables.

Eg: forceDecrTrend(b1, c(“age”, “LTV”))


Similarly the function forceIncrTrend can be used to force a monotonically increasing trend if required. The function manualSplit can be used to manually split the variable based on specified cuts. The function naCombine can be used to combine the NA bin with either the bin having the closest bad rate or the average bad rate if the count of observations in NA bin is low.


Once this is done, the splits created can be replicated on a test dataframe to check if the same trand will hold on this.
Eg: b2 <- binTest(b1, testDf, “BAD_FLG”, c(“age”, “LTV”))


If there are a lot of flips on the test data, the variable can be discarded. Otherwise, increasing/decreasing trends can be forced on b2 to ensure that there are no flips. This can then be tested on the original data.
Eg: b1 <- binTest(b2, loanData, “BAD_FLG”, c(“age”, “LTV”))


Once the bins have been finalized, variables can be shortlisted based on IV and linearity. The bins of these shortlisted variables can be created in the data using the function createBins.
Eg: loanData1 <- createBins(b1, loanData, c(“age”, “LTV”))

The data frame loanData1 will have all the variables of data frame loanData along with binned variables which will be created with the prefix “b_” before the original name of the variable.