Binning variables before running logistic regression

The logiBin package enables fast binning of multiple variables using parallel processing. A summary of all the variables binned is generated which provides the information value, entropy, an indicator of whether the variable follows a monotonic trend or not, etc. It supports rebinning of variables to force a monotonic trend as well as manual binning based on pre specified cuts.

Creating the bins for continuous and categorical variables

The getBins function uses parallel processing to compute bins for continuous and categorical variables. The splits are computed using the partykit package which uses conditional inferencing trees. Refer to the package documentation for more details. A separate bin is created for NA values. This can be combined using naCombine function. Categorical variables with a maximum of 10 distinct values are supported.

Eg: b1 <- getBins(loanData, “bad_flag”, c(“age”, “LTV”, “score”, “balance”), minCr = 0.8, nCores = 2)

This retrurns a list containing 3 elements. One is a a dataframe called err which contains details of all the variables that could not be split and the reason for the same.

	var	error
9	score	No significant splits

It can be seen that no significant splits were found for the variable ‘score’. The other variables specified were split into bins. The summary of these splits can be seen from the next element of the list which is a dataframe called varSummar. This contains the summary of the variables’ IV value, entropy, p value from ctree function in partykit package, flag which indicates if bad rate increases/decreases with variable value, flag to indicate if a monotonic trend is present, number of bins which flip (i.e. do not follow a monotonic trend), number of bins of the variable and a flag to indicate whether it includes pure nodes (node which do not have any defaults).

	var	iv	pVal	stat	ent	trend	monTrend	flipRatio	numBins	purNode	varType
4	age	0.8399	0.0356891	4.411899	0.7367	I	N	0.5	3	N	integer
8	LTV	0.5241	0.0067388	7.341301	0.7567	D	Y	0.0	3	Y	numeric
12	balance	0.3536	0.0360245	4.395943	0.7900	D	Y	0.0	2	N	integer

The variables LTV & balance have a monotonic decreasing trend which indicates that the bad rate decreases as the value of the variable increases. The variable age has an increasing trend. However it is not monotonic and there is a flip in 50% of the bins. In order to check this, look at the second element of the list which is a data frame called bin which contains details of all the bins of the variables.

	var	bin	count	bads	goods	propn	bad_rate	iv	ent
1	age	age <= 34	44	19	25	44	43.18	0.2602	0.9865
2	age	age > 34 & age <= 45	32	2	30	32	6.25	0.5772	0.3373
3	age	age > 45	24	6	18	24	25.00	0.0025	0.8113
4	age	Total	100	27	73	1	27.00	0.8399	0.7367
5	LTV	LTV <= 0.77	24	13	11	24	54.17	0.3843	0.9950
6	LTV	LTV > 0.77	74	14	60	74	18.92	0.1398	0.6998
7	LTV	is.na(LTV)	2	0	2	2	0.00	Inf	0.0000
8	LTV	Total	100	27	73	1	27.00	0.5241	0.7567
10	balance	balance <= 6359	19	10	9	19	52.63	0.2718	0.9980
11	balance	balance > 6359	81	17	64	81	20.99	0.0818	0.7412
12	balance	Total	100	27	73	1	27.00	0.3536	0.7900

Looking at the bins of the variable age, it can be seen that the first bin has a high bad rate and contains a large proportion of the population. The bad rate of the middle bin is lower than the last bin. However if the second & third bins are combined a monotonic decreasing trend can be forced. The function forceDecrTrend can be used for this. Eg: b1 <- forceDecrTrend(b1,“age”)

We can see that once a decreasing trend is forced, the variable age is now monotonically decreasing.

	var	bin	count	bads	goods	propn	bad_rate	iv	ent
5	LTV	LTV <= 0.77	24	13	11	24	54.17	0.3843	0.9950
6	LTV	LTV > 0.77	74	14	60	74	18.92	0.1398	0.6998
7	LTV	is.na(LTV)	2	0	2	2	0.00	Inf	0.0000
8	LTV	Total	100	27	73	1	27.00	0.5241	0.7567
10	balance	balance <= 6359	19	10	9	19	52.63	0.2718	0.9980
11	balance	balance > 6359	81	17	64	81	20.99	0.0818	0.7412
12	balance	Total	100	27	73	1	27.00	0.3536	0.7900
1	age	age <= 34	44	19	25	44	43.18	0.2602	0.9865
2	age	age > 34	56	8	48	56	14.29	0.2880	0.5917
3	age	Total	100	27	73	1	27.00	0.5482	0.7654

This function can also take multiple variables as input if a decreasing trend is to be forced on multiple variables.

Eg: forceDecrTrend(b1, c(“age”, “LTV”))

Similarly the function forceIncrTrend can be used to force a monotonically increasing trend if required. The function manualSplit can be used to manually split the variable based on specified cuts. The function naCombine can be used to combine the NA bin with either the bin having the closest bad rate or the average bad rate if the count of observations in NA bin is low.

Once this is done, the splits created can be replicated on a test dataframe to check if the same trand will hold on this.
Eg: b2 <- binTest(b1, testDf, “BAD_FLG”, c(“age”, “LTV”))

If there are a lot of flips on the test data, the variable can be discarded. Otherwise, increasing/decreasing trends can be forced on b2 to ensure that there are no flips. This can then be tested on the original data.
Eg: b1 <- binTest(b2, loanData, “BAD_FLG”, c(“age”, “LTV”))

Once the bins have been finalized, variables can be shortlisted based on IV and linearity. The bins of these shortlisted variables can be created in the data using the function createBins.
Eg: loanData1 <- createBins(b1, loanData, c(“age”, “LTV”))

The data frame loanData1 will have all the variables of data frame loanData along with binned variables which will be created with the prefix “b_” before the original name of the variable.

Binning variables before running logistic regression

Sneha Tody

2018-05-21

Creating the bins for continuous and categorical variables