Gene expression profiles tracking an illness or therapy are typically used to infer gene regulatory networks, or GRNs. Using complementing prior data, integrative methodologies have successfully emerged in the past ten years to guide GRN inference from gene expression. Nevertheless, previous knowledge and validation gold standards datasets are frequently correlated and restricted to a selection of genes. New criteria are needed to robustly assess the ideal intensity of prior data integration in the inference process, as there is currently no comprehensive and impartial evaluation available.
The authors have tackled this problem for two regression-based GRN inference models: a generalised linear model estimated under a weighted LASSO penalty with stability selection (weightedLASSO) and a weighted random forest (weigthedRF). These methods are used using data from Arabidopsis thaliana’s root reaction to nitrate induction. The authors have quantified the impact of transcription factor binding motif integration on model prediction for every gene. The authors present a novel method, called DIOgene, to optimise data integration strength in a hypothesis-driven, gene-specific fashion using model prediction error and a simulated null hypothesis. This integration strategy provides good performance in both minimising prediction error and retrieving experimental interactions, and it exposes a strong diversity of optimal integration intensities between genes.
The following repository has the R code and notebooks that show how to utilise the suggested approaches: https://github.com/OceaneCsn/integrative_GRN_N_induction