smote-isnt-the-only-method-that-distorts-probability-calibration

Share your thoughts on Linkedin!

If you’re currently taking our Forecasting with Machine Learning or Feature Engineering for Time Series Forecasting, share your thoughts on LinkedIn!

A short post about what you’ve learned or how you’re applying forecasting in your work, can help others find their way. And we'll thank you with a 30% Off!

Please tag Kishan and me so we can connect.

As a thank you, I’ll send you a DM (direct message) with a 30% discount towards your next course, book, or specialisation with us.

Thanks for being part of this journey!

Post on LinkedIn & Get 30% Off

SMOTE isn’t the Only Method That Distorts Probability Calibration

Lately, we hear a lot that oversampling and SMOTE distort probability distributions.

👉🏻 And it is correct. But those are not the only methods that affect calibration.

👉🏻 Cost sensitive learning also does.

👉🏻 Undersampling also affects probability distributions.

👉🏻 And in fact, many classifiers, like random forests, naive Bayes and GBMs, also return uncalibrated probabilities.

Why does it matter? Because calibrated probabilities inform the confidence around the prediction.

➡️ A well-calibrated classifier will correctly estimate the probability of an event occurring.

➡️ What this means is that, for example, if a fraud classifier outputs 0.8, that observation has an 80% chance of being fraudulent, or in other words, that 80 of 100 observations with similar probability will be indeed fraudulent.

🤔 I think people like repeating that SMOTE returns uncalibrated probabilities, because there are 1-2 recent articles that mention that the probabilities returned by classifiers trained using SMOTE are uncalibrated beyond repair. Or in other words, that you can’t recalibrate a classifier if you trained it with SMOTE.

Anyhow, what I wanted to discuss is that we can recalibrate uncalibrated probabilities.

There are various methods. The 2 implemented in scikit-learn are Platt-scaling and isotonic regression.

If you want to learn more about probability distribution and how to recalibrate classifiers, check out my course Machine Learning with Imbalanced Data.

I hope this information was useful!

Wishing you a successful week ahead - see you next Monday! 👋🏻

Sole

Ready to enhance your skills?

Our specializations, courses and books are here to assist you:

Clustering and Dimensionality Reduction (new course)
Forecasting with Machine Learning (course)
Feature Selection for Machine Learning with Python (book)

More courses

Did someone share this email with you? Think it's pretty cool? Then just hit the button and subscribe to Data Bites. Don’t miss out on any of our tips and propel your data science career to new heights.

Subscribe

Hi…I’m Sole