Imbalanced data in Machine Learning Modeling

Overview

Imbalanced data is a term used to describe datasets where there’s an uneven representation of target outcomes across different categories. In such cases, one category often has a significantly higher frequency compared to others. A classic example can be seen in binary classification problems, where one class may have a disproportionate number of observations relative to its counterpart. This imbalance presents a unique challenge in modeling: standard algorithms, which are usually designed for more evenly distributed data, may inherently favor the more common class. This bias can result in suboptimal performance, particularly in accurately predicting the outcomes for the minority class. Imbalance datasets are a practical concern in various fields, including research on adolescent health issues and drug use, where some outcomes or behaviors might be notably less common but are crucial to identify correctly.

In this workshop, we will explore various modeling techniques and resampling methods tailored to this specific challenge. Key topics include cross-validation, estimation of empirical error, and the identification of model hyperparameters. This comprehensive approach will provide participants with a deep understanding of handling imbalanced datasets effectively.

This workshop will be held at the University of Miami on Tuesday, December 5, 2023. We will meet in room 1080A at Don Soffer Clinical Research Center.

We will utilize the R programming language along with the tidymodel package to implement our models. The workshop will include a range of practical exercises and illustrative examples. Participants are expected to have a basic understanding of R and familiarity with regression analysis.

Workshop content

  1. Workshop document.

Organizers

Francisco Cardozo. University of Miami.
Eric C. Brown.. University of Miami. María Fernanda Reyes, Universidad de los Andes.

Preparation

  1. Open an account in Posit Cloud.

Dataset

we will be utilizing a database that features elderly individuals from various regions across Colombia. This unique dataset is the result of extensive interviews conducted to assess each individual’s exposure to various forms of violence, including economic, sexual, psychological, and physical violence. In addition, the database records each participant’s overall well-being, mental health status, and functional abilities. The richness of this dataset lies in its detailed description of the participants, encompassing a wide range of demographic characteristics.

Our goal is to identify key variables that may be associated with the experience of any form of violence within the last three years. This exploration aims to provide deeper insights into the factors influencing violence exposure among the elderly in Colombia, offering a unique opportunity for attendees to apply machine learning techniques to real-world, socially relevant data.

This dataset not only represents a significant resource for understanding the dynamics of violence among the elderly in Colombia but also serves as a valuable tool for developing predictive models that can aid in identifying at-risk individuals and inform targeted interventions.

The Mission

In this workshop, we will embark on a Tolkien-inspired journey to introduce the concepts and tools necessary for working with imbalanced data. The central quest for participants will be to collect three mythical rings, each symbolizing a key area of knowledge. To acquire these rings, participants must complete a series of tasks, each designed to provide practical experience and deepen understanding of the critical concepts required to effectively manage imbalanced datasets.

First Ring - The Two Paths of Destiny

Divide the data into two groups, some data to train the model and another to test it.

In an age long past, when the world was still uncharted, the First Ring emerged, gleaming with the promise of unexplored knowledge. It was said to hold the power to split the fabric of reality into two distinct paths: one leading through the verdant forests of Training, lush with learning and growth, and the other winding into the misty valleys of Testing, where truth is revealed in the shadows. Those who embark on this journey, much like the brave heroes of old, are fated to traverse these paths, each step a new chapter in their quest for wisdom.

Second Ring - The Mirror of Fates

Validate the model with the data that was not used to train it.

The Second Ring was forged in the hidden realms, where magic and mystery intertwine. It was no ordinary artifact, for it held a mirror reflecting not just images but destinies. In the hands of the learned, this ring divided the waters of Training, creating a silent pool of Validation, serene yet profound. Here, seers and sages could gaze into the depths, discerning the hidden strengths and weaknesses of their creations, much like the ancient oracles who, in their sacred groves, saw truths beyond the ken of mortals.

Third Ring - The Balance of Light and Shadow

The balance between bias and variance is the key to the power of the model.

In the great halls of knowledge, the Third Ring was revered as the most elusive and powerful. It was the embodiment of the eternal struggle between Light and Shadow, an artifact that sought the perfect equilibrium. In the hands of a master, it could weave together the threads of Variance and Bias, creating a tapestry as balanced and harmonious as the cycle of day and night. The quest for this balance was akin to a legendary saga, where heroes journeyed through realms of dazzling brilliance and profound darkness, seeking the ancient wisdom that would bring peace to the land.

Secret ring: trees are very good at learning the distribution of training data, but sometimes, they tend to over-specialize on the data they are trained on. One way to solve this is to make many trees, which we can then average.

Additional Resources

tidyverse tidymodels

Code of Conduct

This is our Code of Conduct(COC/coc.html){target=“_blank”}. In summary, attendees of the workshop commit to creating and respecting an inclusive and safe environment for all participants. Unacceptable behaviors such as discrimination and harassment are not expected. Any violation of these rules may result in actions ranging from warnings to expulsion from the event.

Past workshops

Future workshops

  • Mastering Effective Data Visualization with ggplot2 in R: A Hands-on Workshop for Social Work Researchers.
Back to top

Aknowledgment

This workshop is possible thanks to University of Miami, specially PERLA, the Frost Institute for Data Science and Computing and the Universidad de los Andes. If you want to improve this workshop or have fix something, please do at GitHub.

License

This material has been developed under the license CC BY-SA 4.0 DEED.