bagging-vs-boosting-key-findings-from-27-imbalanced-datasets

⏰ Only 4 Days to Go...

Our Biggest Learning Event of the Year!

When we started Train in Data, our mission was simple: to make high-quality, practical machine learning education accessible to everyone.

Since then, thousands of learners have grown with us; mastering new tools, building real-world projects, and finding confidence in their skills.

Now, for the first time since we launched, we’re about to do something special to celebrate this journey with you. 🖤

In just 4 days, on November 28, we’ll be unveiling our first-ever Black Friday Sale. A full day dedicated to learning, growth, and opportunity.

You’ll get a 50% discount on all our courses, books, and specialisations but.... only for one day.

✨ Be sure to check your inbox on November 28 to get your exclusive discount code and join the celebration.

Until then, take a moment to browse our courses and think about where you’d like your learning journey to go next.

📅 Mark your calendar: November 28 (one day only).

Plan Your Black Friday Learning

I compared bagging and boosting ensembles on 27 imbalanced datasets, and this is what I found.

Ensemble methods combine machine learning models through bagging or boosting. Classical examples are random forests and adaboost, respectively.

They are famous for providing better results than the individual models.

And the bagging and boosting process can be combined with under or oversampling to improve their performance on imbalanced datasets.

I compared bagging and boosting ensembles on 27 imbalanced datasets, and this is what I found: 👇

In 7 datasets, adaboost clearly outperformed random forests, but random forests never outperformed adaboost.

We expected that because boosting is, in general, a better model than just bagging.

Sklearn's implementation of gbm outperformed random forests only in 4 datasets.

🧐I find this a bit unusual, because gbms are better than adaboost in general, so I expected at least the same success as adaboost. Maybe the hyperparameter space that we searched was not optimal. It might be worth comparing the results with those of xgboost. I leave this with you.

Balanced random forests and easyEnsemble outperformed adaboost in 8 and 10 datasets, respectively.

👉These 2 models are quite promising, and the good news is, that they are relatively fast to train.

Bagging of adaboost and rusBoost showed good performance overall, but it is less clear if their performance is significantly superior to that of adaboost.

👉 And those models are quite costly, so, unless we have access to good computing resources, we can stick to balanced random forests and easyEnsemble.

Models that combine bagging and boosting with undersampling are available in the open source Python library imbalanced-learn.

If you want to know more about how these algorithms work and what else you can do to tackle imbalance data, check out my course.

I hope this information was useful!

Wishing you a successful week ahead - see you next Monday! 👋🏻

Sole

Ready to enhance your skills?

Our specializations, courses and books are here to assist you:

Clustering and Dimensionality Reduction (new course)
Forecasting with Machine Learning (course)
Feature Selection for Machine Learning with Python (book)

More courses

Did someone share this email with you? Think it's pretty cool? Then just hit the button and subscribe to Data Bites. Don’t miss out on any of our tips and propel your data science career to new heights.

Subscribe

Hi…I’m Sole