Abstract
Crowdsourced information systems make network data ubiquitous in our daily lives, including biological networks, social networks, brain networks, etc. In recent years, researchers have witnessed many attempts to gain insights from these complex network data by performing various network learning tasks such as node classification, cluster detection, and link prediction. However, real-world data are complicated as cross-domain interactions between different types of entities are widely observed. Therefore, analyzing these heterogeneous information networks sheds light on the comprehension of multiple aspects of complex systems. Yet, there are still open issues as unsolved, especially in learning the representations for multilayer networks. Especially, embedding nodes based on different types of interactions remains a challenging issue. In this dissertation, I implement several network representation learning techniques and use them to study the link prediction, node classification, and knowledge completion tasks in complex networks. Network representation learning (NRL) aims at learning a projection from original network data, including node and edges, to low-dimensional vector space while maintaining a variety of structural and semantic features. The vector representations could effectively support extensive tasks such as node classification, node clustering, link prediction, and graph classification. Recent years have witnessed a surge of approaches that automatically learn to encode network topological structures into low-dimensional embeddings, using techniques based on deep learning and nonlinear dimensionality reduction.
The subsequent chapters detail NRL techniques in heterogeneous and multilayer networks. Chapter 1 gives a background introduction to current network representation learning techniques. Chapter 2 describes the statistical analysis of various types of meta paths in academic networks by learning the weights for the individual meta-path. Chapter 3 discuss the random walk-based algorithm to generate meta paths in the networks. Later, those sequences of nodes on the paths are treated as words in the sentences and are learned using the Skip-gram or the Continuous Bag of Words (CBOW) model to determine the node representations automatically. This process is regarded as a self-supervised machine learning process because it does not require manually labeling data. After that, I integrate the protein amino acid features from the UniProt database such that the intrinsic protein features are taken into account. During this process, I also study the automatic feature extraction algorithm for raw protein sequences using 1 Dimensional(1D) Convolutional Neural Network(CNN) framework in Chapter 4. Finally, Chapters 5 and 6 examine the multilayer network embedding. Compare to simple network models, the study of complex networks is still a young and active area of scientific research (since 2000). Therefore, I propose a Motif Aware Deep Representation Learning framework for the Multilayer Networks (MARML) to learn the network representations. This is done by taking into account the recurring motif patterns and topological proximity among the triplets within the distinct layers in the networks. Especially in Chapter 6, I examine the drug repositioning and drug-target prediction challenges using the Hybrid Neural Tensor Network (Hybrid NTN). The results show that integrating drug-disease and drug-target networks can boost the interaction prediction accuracy, novel drug prediction precision, and novel target prediction performance. Finally, I conclude the dissertation and envision future work in Chapter 7 and Chapter 8. My studies of these methods prove that encoding networks into continuous vector space help with understanding different aspects of social life, such as the structure of societies, information diffusion, and communication patterns.