Computational identification of genomic features that influence 3D chromatin domain formation
Few computational methods have been proposed to study the roles of DNA-binding proteins and functional elements in chromosome fol ding. A simple yet widely used statistical method consists in assessing enrichment of a genomic feature around 3D domain borders by Fisher’s exact or Pearson’s chi-squared tests. An important caveat of enrichment test is that it only identifies those genomic features that colocalize at domain borders, but it cannot determine which genomic features influence the domain border establishment or maintenance. For instance, two genomic features might be both found significantly enriched at domain boundaries, but only one of them might truly influence the domain border establishment or maintenance. This is due to the colocalization (correlation) between the two genomic features. Statistically speaking, correlation does not imply causation. Other works focused on the prediction of 3D domain borders using (semi) non-parametric models and identified a subset of genomic features that are the most predictive of TADs. However a genomic feature can efficiently predict 3D domain borders without being influential.
We proposed multiple logistic regression to identify those genomic features that positively or negatively influence domain border establishment or maint enance (PLOS Computational Biology). The model is flexible, and can account for statistical interactions among multiple genomic features. Using both simulated and real data, we showed that our model outperforms enrichment test and non-parametric models, such as random forests, for the identification of genomic features that influence domain borders. Using Drosophila Hi-C data at a very high resolution of 1 kb, our model suggested that, among architectural proteins, BEAF-32 and CP190 are the main positive drivers of 3D domain borders. In humans, our model identified well-known architectural proteins CTCF and cohesin, as well as ZNF143 and Polycomb group proteins as positive drivers of domain borders. The model also revealed the existence of several negative drivers that counteract the presence of domain borders including P300, RXRA, BCL11A and ELK1.