Workshop: Semi-Supervised Method for Countering Batch Effect
Abstract: Predictive modeling of patient tumor growth response to drug treatment is severely limited by a lack of experimental data for training. Combining experimental data from several studies is an attractive approach, but presents two problems: batch effects and limited data for individual drugs and cell lines. Batch effects are caused by systematic procedural differences among studies, which causes systematic experimental outcome differences. Directly using these experimental results as features for machine learning commonly causes problems when training on one study and testing on another. This severely limits a model’s ability to perform well on new experiments. Even after combining studies, predicting outcomes on new patient tumors remains an open challenge.
We propose a semi-supervised, autoencoder-based, machine learning procedure, which learns a smaller set of gene expression features that are robust to batch effects using background information on a cell line or tissue’s tumor type. We implemented this reduced feature representation and show that the new feature space clusters strongly according to tumor type. This experiment is carried out across multiple studies: CCLE, CTRP, gCSI, GDSC, NCI60, and patient derived tumors. We hypothesize that using a batch effect resistant feature set across studies will improve prediction performance.
Genome Data Commons (GDC) gene expression profiles for publicly available human tissues and cell lines from NCI60 and CCLE were processed using the semi-supervised learning procedure. Our autoencoder repurposes the ‘center loss’ (CL) cost function of Wen et. al. to learn a more generalized set of features using the cell line or tissue’s tumor type. Classification is performed by branching network the ‘pinch’ layer of the autoencoder. The ‘pinch’ layer now gets fed into a classification layer as well as the decoder portion of the autoencoder.
The new cost function balances the reconstruction performance, with the classification and ‘center loss’ performance. Reconstruction performance ensures that the ‘pinch’ layer retains information about original gene expression while classification performance shapes the space so tumors of the same type of close together regardless of the source study. Using the ‘pinch’ layer as new features reduces the number of features from 17,000 genes to approximately 1000 features or as few as 20 features. The performance of this method is compared with traditional batch correction methods (e.g. ComBat). Before applying these methods, individual samples clustered more strongly along study, a property that is not useful in many machine learning applications. We compare the new features from our ‘center loss’ autoencoder and ComBat using Silhouette score, the Calinski – Harabasz index, and the Davies – Bouldin index. All metrics show that using the prosed ‘center loss’ autoencoder features provide a latent space with better clusters than applying ComBat.