Lately, we hear a lot that oversampling and SMOTE distort probability distributions.
👉🏻 And it is correct. But those are not the only methods that affect calibration.
👉🏻 Cost sensitive learning also does.
👉🏻 Undersampling also affects probability distributions.
👉🏻 And in fact, many classifiers, like random forests, naive Bayes and GBMs, also return uncalibrated probabilities.
Why does it matter? Because calibrated probabilities inform the confidence around the prediction.
➡️ A well-calibrated classifier will correctly estimate the probability of an event occurring.
➡️ What this means is that, for example, if a fraud classifier outputs 0.8, that observation has an 80% chance of being fraudulent, or in other words, that 80 of 100 observations with similar probability will be indeed fraudulent.
🤔 I think people like repeating that SMOTE returns uncalibrated probabilities, because there are 1-2 recent articles that mention that the probabilities returned by classifiers trained using SMOTE are uncalibrated beyond repair. Or in other words, that you can’t recalibrate a classifier if you trained it with SMOTE.
Anyhow, what I wanted to discuss is that we can recalibrate uncalibrated probabilities.
There are various methods. The 2 implemented in scikit-learn are Platt-scaling and isotonic regression.
If you want to learn more about probability distribution and how to recalibrate classifiers, check out my course Machine Learning with Imbalanced Data.
I hope this information was useful!
Wishing you a successful week ahead - see you next Monday! 👋🏻
Sole