Author Topic: Machine Learning - Default or Not - Ideas? (Read 5139 times)

WorkingOnStubble · « **on:** November 19, 2013, 07:45:06 AM »

Hey all. So I'm taking a Machine Learning course in my CS degree this term, and one of the practicals is implementing an algorithm, taking around 150,000 data points of people, with some credit information, and deciding whether they're going to default on a loan (In this situation, that is that their payment will be 90 days late) within the next two years. There's a competition for whoever can get their algorithm to predict it best, and at the moment the winning score is around 88%.

I've done most of the programming so far and gotten my score to around 80%, but I need to get it higher. And you people are some fiscally knowledgeable ones, so I'm wondering if you guys could give me some ideas on when you think a person is more likely to default on a loan. Here is the information I have for each person:

Revolving Utilisation of Unsecured Lines - The total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans, divided by the sum of the credit limits
Age - In years
Number of Times the borrower has been 30-59 days past due but no worse in the last 2 years
Number of times the borrower has been 60-89 days past due but no worse in the last 2 years
Number of times the borrower has been 90+ days past due in the last 2 years
Debt Ratio - Monthly debt payments, alimony, living costs, divided by the monthly gross income
Monthly Income
Number of Open Credit Lines and Loans
Number of Real Estate Loans or Real Estate Lines of Credit
Number of Dependents

An example of some connections, is that I gained some prediction by ignoring the number of times they were 30-59, 60-89, and 90+ days late, and just took it that if they had been more than 30 days late whatsoever, then it would record a 1, and if not, a 0.

I've been finding this really interesting, and want to see how good I can get this!

Thanks

footenote · « **Reply #1 on:** November 19, 2013, 08:10:32 AM »

Fascinating assignment! Machine learning is so cool.

From my background in credit collections and recovery, I would focus on Utilization and Debt Ratio.

(I'm surprised you got good prediction improvement bundling all overdue states; when I was working in collections, a single-digit % of 30 day delinquencies defaulted. 90 days was the red-light zone with a higher % eventually defaulting. But it may not be apples-to-apples to your simulation as accounts didn't "write off" until 150 days past due.)

No Name Guy · « **Reply #2 on:** November 19, 2013, 01:36:50 PM »

Is this data only single points, or time series per person? If the data on a person is a time series, what about the rate of change of key factors like utilization (e.g. I would expect if their balance to limit ratio is going up, and say accelerating over time, then the likelihood of ultimate default is higher than a person with steady or declining utilization ratio's).

dragoncar · « **Reply #3 on:** November 19, 2013, 02:13:19 PM »

Correct me if I'm wrong, but isn't the point to build a statistical model so that the computer can find these associations? As opposed to hard coding heuristics based on your own logic and observations? Serious question, since I never took such a class and really don't understand it. It seems contrary to the purpose if you are tweaking weights and such to fit a particular data set.

WorkingOnStubble · « **Reply #4 on:** November 19, 2013, 04:01:42 PM »

Thanks footenote, I'll look into the utilisation and debt ratio and play around with them.

No Name Guy, each data point is one person, or one 'application' for a loan, I'd imagine.

Dragoncar, the model sadly has limits. In its present form, it can only predict things with weights in a linear way. So say the income squared is the actual predictor of a default, that means the predictor wouldn't be able to account for that. This means that if I want to actually get some improvement in prediction, I have to get the feature as close to linear as possible, in this case by square rooting the income. Hope that helped explain it!

StarswirlTheMustached · « **Reply #5 on:** November 20, 2013, 12:49:03 PM »

Quote from: WorkingOnStubble on November 19, 2013, 04:01:42 PM

Dragoncar, the model sadly has limits. In its present form, it can only predict things with weights in a linear way. So say the income squared is the actual predictor of a default, that means the predictor wouldn't be able to account for that. This means that if I want to actually get some improvement in prediction, I have to get the feature as close to linear as possible, in this case by square rooting the income. Hope that helped explain it!

I suddenly fear skynet much less. Thanks.

steveo · « **Reply #6 on:** November 20, 2013, 02:00:48 PM »

My answer depends on what you are using to build your model. Are you using something like R or python. If so I think you should just chuck all the data in a bunch of different models and see how it goes.

feelingroovy · « **Reply #7 on:** November 20, 2013, 07:10:16 PM »

Like Stevio said, I don't understand why this is an algorithm and not statistical model building, but I'm not a programmer.

That said, if it were a statistical model, I would try some interaction terms. I suspect, for example, that debt ratio has a different effect for low than for high incomes.

I would try a CART model, personally. I'm guessing the point here is to program such a non-theoretical model? Get it to try different cut-points of predictors to find the optimal values?

btw, this sounds very fun. :)

ender · « **Reply #8 on:** November 20, 2013, 07:52:06 PM »

Quote from: WorkingOnStubble on November 19, 2013, 04:01:42 PM

Dragoncar, the model sadly has limits. In its present form, it can only predict things with weights in a linear way. So say the income squared is the actual predictor of a default, that means the predictor wouldn't be able to account for that. This means that if I want to actually get some improvement in prediction, I have to get the feature as close to linear as possible, in this case by square rooting the income. Hope that helped explain it!

One way to do this is create a duplicate set of inputs which are the squared sum of each quantity, then have your model treat it linearly.

For example, do something like:

inc2 = income*income

Then your model can treaty inc2 as a linear term, like:

2*income + 3*inc2 (etc)

dragoncar · « **Reply #9 on:** November 21, 2013, 03:09:33 PM »

Quote from: enderland on November 20, 2013, 07:52:06 PM

Quote from: WorkingOnStubble on November 19, 2013, 04:01:42 PM
Dragoncar, the model sadly has limits. In its present form, it can only predict things with weights in a linear way. So say the income squared is the actual predictor of a default, that means the predictor wouldn't be able to account for that. This means that if I want to actually get some improvement in prediction, I have to get the feature as close to linear as possible, in this case by square rooting the income. Hope that helped explain it!

One way to do this is create a duplicate set of inputs which are the squared sum of each quantity, then have your model treat it linearly.

For example, do something like:

inc2 = income*income

Then your model can treaty inc2 as a linear term, like:

2*income + 3*inc2 (etc)

I think he's asking for ideas on new inputs, e.g. income^2, debt/income, etc.

WorkingOnStubble · « **Reply #10 on:** November 22, 2013, 04:00:05 PM »

Thanks for all the posts!

I'm aware I can just square various features and put them as new features, and that's what I'm doing so far. But the problem is that there are an infinite number of things I can do to each feature, and obviously a very limited amount of code I can write :P. I'm looking at the moment for ideas on how things would relate to each other, or ways in which they can be approximated linearly. An example from dragoncar was right, debt/income, if it reveals a linear pattern, or some way of discerning a person being more or less likely to default based on its value, then it's useful to add.

I'm programming this in Python, for Steveo.

gimp · « **Reply #11 on:** November 22, 2013, 05:32:30 PM »

I'm a programmer, among other things...

I think you need to read about Linear Programming, and Integer Linear Programming.

Essentially, look, you can break this down into a huge equation, that works something like this:

Coefficient A * Data Point A + Coefficient B * Data Point B + ... = Result (Default)

Now you know all the data points, and you know the result, so you need to find the coefficients.

Can you solve precisely for the coefficients? Sure, for one person, of course. But you don't have one person, you have a huge matrix:

Coefficient A * Data Point 1A + Coefficient B * Data Point 1B + ... = Result 1
Coefficient A * Data Point 2A + Coefficient B * Data Point 2B + ... = Result 2
Coefficient A * Data Point 3A + Coefficient B * Data Point 3B + ... = Result 3
Coefficient A * Data Point 4A + Coefficient B * Data Point 4B + ... = Result 4

Can you solve that perfectly? It's possible, but fairly unlikely. In pure math, if you have n unknowns, and m equations, you need m >= n to solve precisely. This is not pure math, and it's not governed by pure math, so there's no guarantee that you can solve this precisely. Doesn't matter.

What you're actually trying to do is minimize the difference between your prediction, and the result, on average. You could do something like a root mean square to find your error, or you can do your error in other ways if you want to. I'm not claiming rms is right, just that it's a known method.

Okay, so you end up with a problem: How do I pick A, B, C ... such that my error is minimal?

Linear programming, if not precisely the answer, is at the very least an excellent approach to thinking about this problem.

Read about linear programming, and integer programming, and let me know what you think.

There are of course other methods of solving this... you could quite literally pick random values and see what gives you the minimum answer. You could pick random values, then use a lowest-neighbor approach to see if one of the coefficients can be modified to improve the answer, again and again until you get a minimum. You could do the same thing, but modify multiple coefficients at once. You could even write a genetic algorithm to do it. Really, the world is your oyster, but I'd start with LP first.

dragoncar · « **Reply #12 on:** November 23, 2013, 02:43:41 PM »

Quote from: gimp on November 22, 2013, 05:32:30 PM

I'm a programmer, among other things...

I think you need to read about Linear Programming, and Integer Linear Programming.

Essentially, look, you can break this down into a huge equation, that works something like this:

Coefficient A * Data Point A + Coefficient B * Data Point B + ... = Result (Default)

Now you know all the data points, and you know the result, so you need to find the coefficients.

Can you solve precisely for the coefficients? Sure, for one person, of course. But you don't have one person, you have a huge matrix:

Coefficient A * Data Point 1A + Coefficient B * Data Point 1B + ... = Result 1
Coefficient A * Data Point 2A + Coefficient B * Data Point 2B + ... = Result 2
Coefficient A * Data Point 3A + Coefficient B * Data Point 3B + ... = Result 3
Coefficient A * Data Point 4A + Coefficient B * Data Point 4B + ... = Result 4

Can you solve that perfectly? It's possible, but fairly unlikely. In pure math, if you have n unknowns, and m equations, you need m >= n to solve precisely. This is not pure math, and it's not governed by pure math, so there's no guarantee that you can solve this precisely. Doesn't matter.

What you're actually trying to do is minimize the difference between your prediction, and the result, on average. You could do something like a root mean square to find your error, or you can do your error in other ways if you want to. I'm not claiming rms is right, just that it's a known method.

Okay, so you end up with a problem: How do I pick A, B, C ... such that my error is minimal?

Linear programming, if not precisely the answer, is at the very least an excellent approach to thinking about this problem.

Read about linear programming, and integer programming, and let me know what you think.

There are of course other methods of solving this... you could quite literally pick random values and see what gives you the minimum answer. You could pick random values, then use a lowest-neighbor approach to see if one of the coefficients can be modified to improve the answer, again and again until you get a minimum. You could do the same thing, but modify multiple coefficients at once. You could even write a genetic algorithm to do it. Really, the world is your oyster, but I'd start with LP first.

I'm partial to simulated annealing

gimp · « **Reply #13 on:** November 25, 2013, 11:38:00 AM »

Yep, simulated annealing would work too.

Actually, there are so many methods that would work, that we can start a trash talking thread. "Simulated annealing? Yeah, get that weak sauce outta here. Partitioning will whoop its ass!"

News:

Author Topic: Machine Learning - Default or Not - Ideas? (Read 5139 times)

WorkingOnStubble

Machine Learning - Default or Not - Ideas?

footenote

Re: Machine Learning - Default or Not - Ideas?

No Name Guy

Re: Machine Learning - Default or Not - Ideas?

dragoncar

Re: Machine Learning - Default or Not - Ideas?

WorkingOnStubble

Re: Machine Learning - Default or Not - Ideas?

StarswirlTheMustached

Re: Machine Learning - Default or Not - Ideas?

steveo

Re: Machine Learning - Default or Not - Ideas?

feelingroovy

Re: Machine Learning - Default or Not - Ideas?

ender

Re: Machine Learning - Default or Not - Ideas?

dragoncar

Re: Machine Learning - Default or Not - Ideas?

WorkingOnStubble

Re: Machine Learning - Default or Not - Ideas?

gimp

Re: Machine Learning - Default or Not - Ideas?

dragoncar

Re: Machine Learning - Default or Not - Ideas?

gimp

Re: Machine Learning - Default or Not - Ideas?