Sijian Fan

My name is 范思键 (fàn sī jiàn). I am currently a Ph.D. student in Statistics at the University of South Carolina, advised by Dr. Ray Bai. Before that, I obtained my M.S. degree in Biostatistics from Emory University where I was advised by Dr. Benjamin Risk, and my B.S. degree in Applied Bioscience from Zhejiang University.

My research focuses on Bayesian statistical learning and optimization for high-dimensional structured data, with an emphasis on biclustering and inductive matrix completion. I am also interested in developing scalable algorithms in computational biology and biomedical data analysis with strong theoretical guarantees. If you would like to contact me, please feel free to reach out.

news

Apr 30, 2025	Honored to receive the 2025 Huynh-Feldt Award, presented by Sarah Schroeder and Charlotte Dunn, daughters of Huynh Huynh.
Apr 18, 2025	I received the Best Student Talk Award at the 2025 Palmetto Symposium for our binary spike-and-slab lasso biclustering project (BiSSLB).

selected publications [all]

arXiv
BVSIMC: Bayesian Variable Selection-Guided Inductive Matrix Completion for Improved and Interpretable Drug Discovery

Sijian Fan, Liyan Xiong, Dayuan Wang, Guoshuai Cai, and 1 more author

2026

Abs Bib PDF Code

Recent advances in drug discovery have demonstrated that incorporating side information (e.g., chemical properties about drugs and genomic information about diseases) often greatly improves prediction performance. However, these side features can vary widely in relevance and are often noisy and high-dimensional. We propose Bayesian Variable Selection-Guided Inductive Matrix Completion (BVSIMC), a new Bayesian model that enables variable selection from side features in drug discovery. By learning sparse latent embeddings, BVSIMC improves both predictive accuracy and interpretability. We validate our method through simulation studies and two drug discovery applications: 1) prediction of drug resistance in Mycobacterium tuberculosis, and 2) prediction of new drug-disease associations in computational drug repositioning. On both synthetic and real data, BVSIMC outperforms several other state-of-the-art methods in terms of prediction. In our two real examples, BVSIMC further reveals the most clinically meaningful side features.
@misc{fan2026bvsimcbayesianvariableselectionguided, title = {BVSIMC: Bayesian Variable Selection-Guided Inductive Matrix Completion for Improved and Interpretable Drug Discovery}, author = {Fan, Sijian and Xiong, Liyan and Wang, Dayuan and Cai, Guoshuai and Bai, Ray}, year = {2026}, eprint = {2603.18957}, archiveprefix = {arXiv}, primaryclass = {cs.LG}, }
arXiv
BiSSLB: Binary Spike-and-Slab Lasso Biclustering

Sijian Fan and Ray Bai

2026

Abs Bib PDF Code

Binary biclustering is a crucial analytical technique for identifying local patterns in binary data matrices, with applications spanning genomics, text mining, and market analysis. In this study, we propose a novel statistical methodology for binary biclustering called Binary Spike-and-Slab Lasso Biclustering (BiSSLB), which enhances both accuracy and interpretability. Our approach is based on a logistic matrix factorization model with spike-and-slab lasso priors that enable adaptive shrinkage of latent spaces to exact zeros. To automatically determine the number of biclusters, we incorporate Indian Buffet Process (IBP) priors, which induce column-wise sparsity in the latent space. Furthermore, we employ a highly efficient coordinate descent method with proximal steps, allowing for scalable estimation in large-scale real-world applications.

To assess the effectiveness of our method, we conduct extensive comparisons against established biclustering techniques, including Bimax, BiBit, iBBiG, and GBC. Performance is evaluated using both simulated datasets and real gene expression datasets, with key metrics including clustering error (CE), consensus score (CS), recovery, relevance, sensitivity, specificity, Matthews correlation coefficient (MCC), and the number of biclusters (K). Experimental results demonstrate that our proposed method accurately estimates the true number of biclusters, regardless of overlapping regions or background noise. Additionally, our method consistently outperforms state-of-the-art approaches in binary biclustering, achieving higher CE, CS, and MCC—three comprehensive metrics for overall comparison.

The proposed methodology advances binary biclustering by offering improved bicluster determination, enhanced robustness to noisy data, and greater interpretability in binary gene expression analysis. Future research directions include integrating side information into the model and extending it to multi-class data, further enhancing its applicability for differential gene expression analysis, as well as broader domains such as recommender systems, drug discovery, and bibliometric studies.
@misc{fan2026bisslbbinaryspikeandslablasso, title = {BiSSLB: Binary Spike-and-Slab Lasso Biclustering}, author = {Fan, Sijian and Bai, Ray}, year = {2026}, eprint = {2603.18378}, archiveprefix = {arXiv}, primaryclass = {stat.ME}, }
Caries Res.
Predictors of Developmental Defects of Enamel in Primary Maxillary Central Incisors Using Bayesian Model Selection

Susan G Reed, Sijian Fan, Carol L Wagner, and Andrew B Lawson

Caries research, 2024

Abs Bib

Introduction: Localized non-inheritable developmental defects of tooth enamel (DDE) are classified as enamel hypoplasia (EH), opacity (OP), and post-eruptive breakdown (PEB) using the enamel defects index. To better understand the etiology of DDE, we assessed the linkages amongst exposome variables for these defects during the specific time duration for enamel mineralization of the human primary maxillary central incisor enamel crowns. In general, these two teeth develop between 13 and 14 weeks in utero and 3–4 weeks postpartum of a full-term delivery, followed by tooth eruption at about 1 year of age.

Methods: We utilized existing datasets for mother–child dyads that encompassed 12 weeks’ gestation through birth and early infancy, and child DDE outcomes from digital images of the erupted primary maxillary central incisor teeth. We applied a Bayesian modeling paradigm to assess the important predictors of EH, OP, and PEB.

Results: The results of Gibbs variable selection showed a key set of predictors: mother’s prepregnancy body mass index (BMI); maternal serum concentrations of calcium and phosphorus at gestational week 28; child’s gestational age; and both mother’s and child’s functional vitamin D deficiency (FVDD). In this sample of healthy mothers and children, significant predictors for OP included the child having a gestational period greater than 36 weeks and FVDD at birth, and for PEB included a mother’s prepregnancy BMI less than 21.5 and higher serum phosphorus concentration at week 28.

Conclusion: In conclusion, our methodology and results provide a roadmap for assessing timely biomarker measures of exposures during specific tooth development to better understand the etiology of DDE for future prevention.
@article{reed2024predictors, title = {Predictors of Developmental Defects of Enamel in Primary Maxillary Central Incisors Using Bayesian Model Selection}, author = {Reed, Susan G and Fan, Sijian and Wagner, Carol L and Lawson, Andrew B}, journal = {Caries research}, volume = {58}, number = {1}, pages = {30--38}, year = {2024}, publisher = {S. Karger AG}, }
Thesis
Improved Algorithm for Independent Component Analysis (ICA) with the Relax and Split Approximation

Sijian Fan

Emory University, 2020

Abs Bib Code

Independent component analysis (ICA) has been increasingly used to separate sources and extract features in signal processing and neuroimaging studies. To overcome its computational problems with local optima, as well as problems with non-smooth and non-convex objective functions, relax and split optimization was applied in this study and comparisons were made between the refined algorithms and the popular FastICA algorithm. A tuning parameter was used to control the relaxation and sparsity level of the Relax-Laplace method (with an objective function derived from the Laplace density), and to control the relaxation level of the Relax-logistic method (with an objective function derived from the logistic density).

We conducted a simulation study to examine the impact of the tuning parameter on accuracy and sensitivity to initialization. We found that smaller values of the tuning parameter can lead to accurate estimates of the components while having fewer issues with local optima relative to FastICA, whereas larger values can result in inaccuracies. Running 1000 times with a pool of 50 initializations, we found that the Relax-Laplace algorithm was the most accurate and consistent compared with Relax-logistic, FastICA-logistic, and FastICA-tanh.

We conducted a multi-subject analysis of functional magnetic resonance imaging (fMRI) data from the Human Connectome Project using Relax-Laplace, FastICA-logistic, and FastICA-tanh. In a pool of 50 initializations, the Relax-Laplace method returned the same result for all initializations, whereas both FastICA-logistic and FastICA-tanh converged to the argmax estimate in just over half of the initializations. Moreover, the Relax-Laplace produced sparse representations for the rs-fMRI data that highlight features of resting-state networks.
@mastersthesis{fan2020improved, title = {Improved Algorithm for Independent Component Analysis (ICA) with the Relax and Split Approximation}, author = {Fan, Sijian}, school = {Emory University}, year = {2020}, }