Mounting evidence suggests that there are better ways to tackle imbalanced datasets that do not include resampling.
For once, we should use strong classifiers like xgboost and catboost.
Then, we need to optimize the decision threshold for the classification, and not just use 0.5
There is also cost-sensitive learning to optimize the model at no extra cost. It comes backed into most model implementations.
🤔 So then, when should we use resampling?
👉 Resampling can still be used when it is impossible to use a strong classifier, say for legacy, or for whatever reason that I can’t imagine. It has been shown to improve the performance of weaker learners like random forests, adaboost, SVMs and MLPs.
👍Resampling can also be useful if the model does not output a probability, just a class.
💡So as always, there is not one solution that fits all. Depending on the project, the model, and the data, resampling may be a tool we can use, or it is better to stay away from it.
I hope this information was useful!
Wishing you a successful week ahead - see you next Monday! 👋🏻
Sole