Abstract
Today, more and more practical problems are moving beyond the variable selection framework to focus on prediction at the subject or (small) sub-population level. Mixed model prediction(MMP) has been widely used in the fields of health sciences and business as well as the traditional fields of genetics, agriculture, etc. Significant increases in prediction accuracy can be achieved by identifying a group that a new subject belongs to as was done in the Classified Mixed Model Prediction (CMMP) work of Jiang et al. (2018). Nowadays, new and challenging problems have emerged with the different availability of data. With the availability of massive data, the challenges arise for high dimensional modeling when the number of predictors dominates the sample size. With the limited availability of training data, the new data of prediction interest can be outside the range of the training data. These bring out two interesting topics; one is the classified mixed model prediction in high dimensional space; another one is the prediction through extrapolation given a limited set of observations.
In the first part of the thesis, a new method called High Dimensional Classified Mixed Model Prediction(HDCMMP) is proposed, which allows classified mixed predictions for high dimensional predictors. A four-step algorithm is used which includes a mixed model variable screening step, followed by a penalized estimation of both fixed and random effects, followed by re-estimation for the restricted model, followed by CMMP prediction. Importantly, this work also extends the methodology to allow for the challenging case of the unknown grouping of observations. Asymptotic and empirical studies are carried out, which demonstrate favorable properties of HDCMMP. Finally, an analysis of riboflavin production data provided by DSM (Kaiseraugst, Switzerland) clearly shows the utility of the new methodology in practice.
In the second part of the thesis, a new method called Prediction Using Random-effects Extrapolation (PURE) is introduced. PURE involves constructing a generalized independent variable hull (gIVH) to isolate a minority training set which is "close" to the test data, followed by the regrouping of the minority data according to the response variable which results in a new (but misspecified) random effect distribution. This misspecification mimics what we term "extrapolating random effects" which proves vital to capture information that is needed for accurate model projections. Projections are then made using CMMP with the regrouped minority data. Remarkable gains in prediction accuracy are observed. Simulation studies and analysis of data from the National Longitudinal Mortality Study (NLMS) demonstrate superior predictive performance in these very challenging paradigms.