The objective of this project is to determine what machine learning method is best for predicting credit risk. Overall, six methods are used and compared.
We ran six machine learning algorithms on the loan data provided by Fast Lending. The first three are using RandomOverSampler, SMOTE, and ClusterCentroids with the LogisticRegression classifier. The fourth method uses SMOTEENN which combines under and over sampling with the LogisticRegression classifier. The final two use the BalancedRandomForestClassifier and the EasyEnsembleClassifier.
Balance Score
Confusion Matrix
Classification Report
Balance Score
Confusion Matrix
Classification Report
Balance Score
Confusion Matrix
Classification Report
Balance Score
Confusion Matrix
Classification Report
Balance Score
Confusion Matrix
Classification Report
Balance Score
Confusion Matrix
Classification Report
All models have a low precision for the high risk credit for a loan. The undersampling model has the worst recall with an avg/total at 0.40 with the high risk recall contributing the most to it also at 0.40. The Easy Ensemble method had the highest recall at 0.94, with a recall of 0.94 and 0.91 for the high risk and low risk respectively. The Easy Ensemble method also has the highest F1 score at 0.97. The Easy Ensemble method has the highest balance accuracy score at 0.925. SMOTEENN, SMOTE, and Oversampling all have similar results, but fair poorer than the Easy Ensemble and Random Forest Method. The Random Forest Method has a high avg/total recall score at 0.91, but the recall for high risk is low at 0.67.
Overall the suggestion is to use the Easy Ensemble Classifier method due to it's overall superiority in every catagory.
sklearn has a known issue where many of the larger machine learning algorithms will kill the kernel if too much memory is alloted to the process. The ClusterCentroids portion of the code had to be run in Google Colab to ensure adaquate disk space and RAM for the process. No other algorithm had this issue.