Deep feature selection: theory and application to identify enhancers and promoters

From National Research Council Canada

DOI	Resolve DOI: https://doi.org/10.1089/cmb.2015.0189
Author	Search for: Li, Yifeng; Search for: Chen, Chih-Yu; Search for: Wasserman, Wyeth W.
Format	Text, Article
Subject	Deep feature selection; Deep learning; Enhancer; Promoter
Abstract	Sparse linear models approximate target variable(s) by a sparse linear combination of input variables. Since they are simple, fast, and able to select features, they are widely used in classification and regression. Essentially they are shallow feed-forward neural networks that have three limitations: (1) incompatibility to model nonlinearity of features, (2) inability to learn high-level features, and (3) unnatural extensions to select features in a multiclass case. Deep neural networks are models structured by multiple hidden layers with nonlinear activation functions. Compared with linear models, they have two distinctive strengths: the capability to (1) model complex systems with nonlinear structures and (2) learn high-level representation of features. Deep learning has been applied in many large and complex systems where deep models significantly outperform shallow ones. However, feature selection at the input level, which is very helpful to understand the nature of a complex system, is still not well studied. In genome research, the cis-regulatory elements in noncoding DNA sequences play a key role in the expression of genes. Since the activity of regulatory elements involves highly interactive factors, a deep tool is strongly needed to discover informative features. In order to address the above limitations of shallow and deep models for selecting features of a complex system, we propose a deep feature selection (DFS) model that (1) takes advantages of deep structures to model nonlinearity and (2) conveniently selects a subset of features right at the input level for multiclass data. Simulation experiments convince us that this model is able to correctly identify both linear and nonlinear features. We applied this model to the identification of active enhancers and promoters by integrating multiple sources of genomic information. Results show that our model outperforms elastic net in terms of size of discriminative feature subset and classification accuracy.
Publication date	2016-01-22
Publisher	Mary Ann Liebert
In	Journal of Computational Biology 23, no. 5: 322–336.
Language	English
Peer reviewed	Yes
NPARC number	23000378
Export citation	Export as RIS
Report a correction	Report a correction (opens in a new tab)
Record identifier	a3e7d7b1-dc03-4ce9-8d7e-0f7608d711ae
Record created	2016-07-12
Record modified	2020-03-16

Date modified:: 2025-05-09