Abstract
One of the biggest obstacles to supervised machine learning is class imbalance. The problem arises when the class distribution of training data is skewed, i.e., one or more classes are underrepresented relative to others. Class imbalance is way more common in the real world than one might think. We are facing a dilemma that rare events, while usually more of interest and misclassification could be costly, are hard to gain information from because of their low representation among observations. Over the past few decades, scholars have been working on handling the imbalance class data by developing new methods or modifying the existing standard algorithms. The common goal is to gain as much valuable information as possible from the limited minority representations while diminishing the effects of the overwhelming majority class. Besides classification, conditional probability estimation (CPE) is another essential task in supervised machine learning. An accurate probability estimate is vital in decision-making tasks due to its interpretability and informativeness. In applications like clinical medicine, both patients and healthcare professionals would like to know the probability reference of a particular medical condition instead of some raw numbers a model generates. One primary concern regarding most current methods addressing imbalanced data is their over-attention to classification and negligence in conditional probability estimation (CPE). Many of them somehow distort the data distributions by removing or adding points or reassigning weights based on some ad-hoc standards. The probabilities estimates or probability-like scores generated by classifiers usually need calibration even in the balanced class settings, no saying when the classes are imbalanced together with disturbing add-ons, which brings up the need for calibration. This thesis aims to show that CPE is a much more tedious yet more inclusive and essential task than classification. The latter is a trivial matter of thresholding if the former is ensured, whereas the converse is not true.