Abstract
Random forest (RF) is an ensemble machine learning method that offers many advantages, including the ability to measure variable importance utilizing out-of-bag samples generated during its model construction. The class balancing technique is a well-known solution to deal with class imbalance problem. However, it has not been actively studied on RF permutation variable importance combined with the area under the receiver operating characteristic curve (pAUC). Our simulation results show that combining pAUC with under-, over-sampling, and synthetic minority over-sampling technique (SMOTE) are beneficial to measure variable importance in the class imbalanced situation, resulting in appropriately discriminating associated and noise predictors. We then propose a novel variable selection algorithm utilizing pAUC with class balancing techniques to target skewed data in the outcome. Our experimental studies demonstrate that the proposed algorithm, especially with over-sampling and SMOTE, effectively selects a subset of features, leading to improved prediction performance in class imbalance problem.