Ensemble methods combine machine learning models through bagging or boosting. Classical examples are random forests and adaboost, respectively.
They are famous for providing better results than the individual models.
And the bagging and boosting process can be combined with under or oversampling to improve their performance on imbalanced datasets.
I compared bagging and boosting ensembles on 27 imbalanced datasets, and this is what I found: 👇
- In 7 datasets, adaboost clearly outperformed random forests, but random forests never outperformed adaboost.
We expected that because boosting is, in general, a better model than just bagging.
- Sklearn's implementation of gbm outperformed random forests only in 4 datasets.
🧐I find this a bit unusual, because gbms are better than adaboost in general, so I expected at least the same success as adaboost. Maybe the hyperparameter space that we searched was not optimal. It might be worth comparing the results with those of xgboost. I leave this with you.
- Balanced random forests and easyEnsemble outperformed adaboost in 8 and 10 datasets, respectively.
👉These 2 models are quite promising, and the good news is, that they are relatively fast to train.
- Bagging of adaboost and rusBoost showed good performance overall, but it is less clear if their performance is significantly superior to that of adaboost.
👉 And those models are quite costly, so, unless we have access to good computing resources, we can stick to balanced random forests and easyEnsemble.
Models that combine bagging and boosting with undersampling are available in the open source Python library imbalanced-learn.
If you want to know more about how these algorithms work and what else you can do to tackle imbalance data, check out my course.
I hope this information was useful!
Wishing you a successful week ahead - see you next Monday! 👋🏻
Sole