Abstract
Semi-supervised learning (SSL) on imbalanced data is largely under-explored and suffers from erroneous pseudo-labels, biased model training, or intolerable training costs. To alleviate these issues, we propose a meta-distribution-based ensemble sampler (MDSampler) approach11Code have been made publicly available on GitHub (https://github.com/CUHKSZ-NING/MDSampler). for imbalanced SSL. MDSampler is a unified framework that integrates SSL, imbalanced learning, and ensemble learning via iterative instance under-sampling and cascade classifier aggregation. Specifically, MDSampler considers the confidence-diversity distribution of both labeled and unlabeled samples and obtains the so-called meta-distribution via 2-D histogram discretization. Sampling on the meta-distribution (1) assigns pseudo-labels to unlabeled data for SSL, (2) alleviates class imbalance since the sampling process is unbiased, (3) improves the diversity of the ensemble learning framework, and (4) is highly efficient and flexible. Additionally, an adaptive instance interpolation strategy is presented to improve the quality of pseudo-labeled samples. Extensive experiments show that MDSampler can be organically combined with various classifiers to achieve superior performance in imbalanced SSL.
•MDSampler samples on the meta-distribution for imbalanced semi-supervised learning.•MDSampler unifies semi-supervised, imbalanced, and ensemble learning techniques.•MDSampler conducts iterative data resampling and cascade classifier aggregation.•MDSampler creates synthetic minority samples via adaptive instance interpolation.•MDSampler can be organically and efficiently combined with various base classifiers.