上QQ阅读APP看书，第一时间看更新

Deep learning in cancer genomics

Biomedical informatics includes all techniques regarding the development of data analytics, mathematical modeling, and computational simulation for the study of biological systems. In recent years, we've witnessed huge leaps in biological computing that has resulted in large, information-rich resources being at our disposal. These cover domains such as anatomy, modeling (3D printers), genomics, and pharmacology, among others.

One of the most famous success stories of biomedical informatics is from the domain of genomics. The Human Genome Project (HGP) was an international research project with the objective of determining the full sequence of human DNA. This project has been one of the most important landmarks in computational biology and has been used as a base for other projects, including the Human Brain Project, which is determined to sequence the human brain. The data that was used in this thesis is also the indirect result of the HGP.

The era of big data starts from the last decade or so, which was marked by an overflow of digital information in comparison to its analog counterpart. Just in the year 2016, 16.1 zettabytes of digital data were generated, and it is predicted to reach 163 ZB/year by 2025. As good a piece of news as this is, there are some problems lingering, especially of data storage and analysis. For the latter, simple machine learning methods that were used in normal-size data analysis won't be effective anymore and should be substituted by deep neural network learning methods. Deep learning is generally known to deal very well with these types of large and complex datasets.

Along with other crucial areas, the biomedical area has also been exposed to these big data phenomena. One of the main largest data sources is omics data such as genomics, metabolomics, and proteomics. Innovations in biomedical techniques and equipment, such as DNA sequencing and mass spectrometry, have led to a massive accumulation of -omics data.

Typically -omics data is full of veracity, variability and high dimensionality. These datasets are sourced from multiple, and even sometimes incompatible, data platforms. These properties make these types of data suitable for applying DL approaches. Deep learning analysis of -omics data is one of the main tasks in the biomedical sector as it has a chance to be the leader in personalized medicine. By acquiring information about a person's omics data, diseases can be dealt with better and treatment can be focused on preventive measures.

Cancer is generally known to be one of the deadliest diseases in the world, which is mostly due to its complexity of diagnosis and treatment. It is a genetic disease that involves multiple gene mutations. As the importance of genetic knowledge in cancer treatment is increasingly addressed, several projects to document the genetic data of cancer patients has emerged recently. One of the most well known is The Cancer Genome Atlas (TCGA) project, which is available on the TCGA research network: http://cancergenome.nih.gov/.

As mentioned before, there have been a number of deep learning implementations in the biomedical sector, including cancer research. For cancer research, most researchers usually use -omics or medical imaging data as inputs. Several research works have focused on cancer analysis. Some of them use either a histopathology image or a PET image as a source. Most of that research focuses on classification based on that image data with convolutional neural networks (CNNs).

However, many of them use -omics data as their source. Fakoor et al. classified the various types of cancer using patients' gene expression data. Due to the different dimensionality of each data from each cancer type, they used principal component analysis (PCA) first to reduce the dimensionality of microarray gene expression data.

PCA is a statistical technique used to emphasize variation and extract the most significant patterns from a dataset; principal components are the simplest of the true eigenvector-based multivariate analyses. PCA is frequently used for making data exploration easy to visualize. Consequently, PCA is one of the most used algorithms in exploratory data analysis and for making predictive models.

Then they applied sparse and stacked autoencoders to classify various cancers, including acute myeloid leukemia, breast cancer, and ovarian cancer.

For detailed information, refer to the following publication, entitled Using deep learning to enhance cancer diagnosis and classification by R. Fakoor et al. in proceedings of the International Conference on Machine Learning, 2013.

Ibrahim et al. , on the other hand, used miRNA expression data from six types of cancer genes/miRNA feature selection. They proposed a novel multilevel feature selection approach named MLFS (short for Multilevel gene/miRNA feature selection), which was based on Deep Belief Networks (DBN) and unsupervised active learning.

You can read more in the publication titled Multilevel gene/miRNA feature selection using deep belief nets and active learning (R. Ibrahim, et al.) in Proceedings 36th annual International Conference Eng. Med. Biol. Soc. (EMBC), pp. 3957-3960, IEEE, 2014.

Finally, Liang et al. clustered ovarian and breast cancer patients using multiplatform genomics and clinical data. The ovarian cancer dataset contained gene expression, DNA methylation, and miRNA expression data across 385 patients, which were downloaded from The Cancer Genome Atlas (TCGA).

You can read more more in the following publication entitled Integrative data analysis of multi-platform cancer data with a multimodal deep learning approach (by M. Liang et al.) in Molecular Pharmaceutics, vol. 12, pp. 928{937, IEEE/ACM Transaction Computational Biology and Bioinformatics, 2015.

The breast cancer dataset included GE data and corresponding clinical information, such as survival time and time to recurrence data, which was collected by the Netherlands Cancer Institute. To deal with this multiplatform data, they used multimodal Deep Belief Networks (mDBN).

First, they implemented a DBN for each of those data to get their latent features. Then, another DBN used to perform the clustering is implemented using those latent features as the input. Apart from these researchers, much research work is going on to give cancer genomics, identification, and treatment a significant boost.