Is this email not displaying correctly? View it in your browser.
Train in Data, learn machine learning online

Bagging vs Boosting : Key Findings From 27 Imbalanced Datasets

Image description

Welcome to Data Bites!



Every Monday, I’ll drop a no-fluff, straight-to-the-point tip on a data science skill, tool, or
method to help you stay sharp in the field. I hope you find it useful!

Only 4 Days to Go...

Our Biggest Learning Event of the Year!

When we started Train in Data, our mission was simple: to make high-quality, practical machine learning education accessible to everyone.

Image description

Since then, thousands of learners have grown with us; mastering new tools, building real-world projects, and finding confidence in their skills.



Now, for the first time since we launched, we’re about to do something special to celebrate this journey with you. 🖤



In just 4 days, on November 28, we’ll be unveiling our first-ever Black Friday Sale. A full day dedicated to learning, growth, and opportunity.



You’ll get a 50% discount on all our courses, books, and specialisations but.... only for one day.



✨ Be sure to check your inbox on November 28 to get your exclusive discount code and join the celebration.



Until then, take a moment to browse our courses and think about where you’d like your learning journey to go next.



📅 Mark your calendar: November 28 (one day only).

Plan Your Black Friday Learning

I compared bagging and boosting ensembles on 27 imbalanced datasets, and this is what I found.

Ensemble methods combine machine learning models through bagging or boosting. Classical examples are random forests and adaboost, respectively.



They are famous for providing better results than the individual models.



And the bagging and boosting process can be combined with under or oversampling to improve their performance on imbalanced datasets.



I compared bagging and boosting ensembles on 27 imbalanced datasets, and this is what I found: 👇

  • In 7 datasets, adaboost clearly outperformed random forests, but random forests never outperformed adaboost.

We expected that because boosting is, in general, a better model than just bagging.

  • Sklearn's implementation of gbm outperformed random forests only in 4 datasets.
🧐I find this a bit unusual, because gbms are better than adaboost in general, so I expected at least the same success as adaboost. Maybe the hyperparameter space that we searched was not optimal. It might be worth comparing the results with those of xgboost. I leave this with you.
  • Balanced random forests and easyEnsemble outperformed adaboost in 8 and 10 datasets, respectively.
👉These 2 models are quite promising, and the good news is, that they are relatively fast to train.
  • Bagging of adaboost and rusBoost showed good performance overall, but it is less clear if their performance is significantly superior to that of adaboost.

👉 And those models are quite costly, so, unless we have access to good computing resources, we can stick to balanced random forests and easyEnsemble.



Models that combine bagging and boosting with undersampling are available in the open source Python library imbalanced-learn.



If you want to know more about how these algorithms work and what else you can do to tackle imbalance data, check out my course.


I hope this information was useful!



Wishing you a successful week ahead - see you next Monday! 👋🏻


Sole

Ready to enhance your skills?

Our specializations, courses and books are here to assist you:

More courses

Did someone share this email with you? Think it's pretty cool? Then just hit the button and subscribe to Data Bites. Don’t miss out on any of our tips and propel your data science career to new heights.

Subscribe
Image description

Hi…I’m Sole



The main instructor at Train in Data. My work as a data scientist, includes creating and implementing machine learning models for evaluating insurance claims, managing credit risk, and detecting fraud. In 2018, I was honoured with a Data Science Leaders' award, and in 2019 and again in 2024, I was acknowledged as one of LinkedIn's voices in data science and analytics.

View

You are receiving this email because you subscribed to our newsletter, signed up on our website, purchased or downloaded any products from us.

Follow us on social media

Copyright (C) 2025 Train in Data. All rights reserved.

If you would like to unsubscribe, please click here.