Abstract
As massive quantities of data are produced in all fields, subsampling has caught many researchers’ attention because its application could ensure people obtain useful information and knowledge more efficiently and effectively. Subsampling saves the computational resources without losing important information and is also quite essential for extraordinary large datasets due to computational limitations. Various methods are developed in this area, including supervised and unsupervised techniques. Supervised methods integrate information by sampling data from regions most crucial for modeling the desired input-output relationship, while the unsupervised methods make the best use of the design matrix. This work focuses on the unsupervised subsampling techniques to tackle the more general case where the response variable is not available. We developed a novel information-based subsampling method that uses the D-optimality criteria for information-based subsampling. This method can minimize the estimation error and prediction error in the linear regression setting, and D-optimality criteria have various applications in the design of experiments. Starting with information-based sub-data selection (IBOSS), it is superior to subsampling-based methods, sometimes by orders of magnitude (Wang et al., 2018). We first propose the principal component IBOSS to improve the IBOSS method by handling the correlation structures in real data applications. Then, we focus on the D-optimality maximization problem and find multiple ways to achieve it. First, observe that the D-optimality subsampling problem can be cast into a determinant maximization problem which can be solved using the Lagrange multiplier and interior-point methods. Second, we find the upper bound of the information and propose several different algorithms to achieve D-optimality. Further, we show the benefit of prescreening procedure before subsampling when data contain noise.