
Cancer genomics dataset description
Genomics data covers all data related to DNA on living things. Although in this thesis we will also use other types of data like transcriptomic data (RNA and miRNA), for convenience purposes, all data will be termed as genomics data. Research on human genetics found a huge breakthrough in recent years due to the success of the HGP (1984-2000) on sequencing the full sequence of human DNA.
One of the areas that have been helped a lot due to this is the research of all diseases related to genetics, including cancer. Due to various biomedical analyses done on DNA, there exist various types of -omics or genomics data. Here are some types of -omics data that were crucial to cancer analysis:
- Raw sequencing data: This corresponds to the DNA coding of whole chromosomes. In general, every human has 24 types of chromosomes in each cell of their body, and each chromosome consists of 4.6-247 million base pairs. Each base pair can be coded in four different types, which are adenine (A), cytosine (C), guanine (G), and thymine (T). Therefore, raw sequencing data consists of billions of base pair data, with each coded in one of these four different types.
- Single-Nucleotide Polymorphism (SNP) data: Each human has a different raw sequence, which causes genetic mutation. Genetic mutation can cause an actual disease, or just a difference in physical appearance (such as hair color), or nothing at all. When this mutation happens only on a single base pair instead of a sequence of base pairs, it is called Single-Nucleotide Polymorphism (SNP).
- Copy Number Variation (CNV) data: This corresponds to a genetic mutation that happens in a sequence of base pairs. Several types of mutation can happen, including deletion of a sequence of base pairs, multiplication of a sequence of base pairs, and relocation of a sequence of base pairs into other parts of the chromosome.
- DNA methylation data: Which corresponds to the amount of methylation (methyl group connected to base pair) that happens to areas in the chromosome. A large amount of methylation in promoter regions of a gene can cause gene repression. DNA methylation is the reason each of our organs acts differently even though all of them have the same DNA sequence. In cancer, this DNA methylation is disrupted.
- Gene expression data: This corresponds to the number of proteins that were expressed from a gene at a given time. Cancer happens either because of high expression of an oncogene (that is, a gene that causes a tumor), low expression of a tumor suppressor gene (a gene that prevents a tumor), or both. Therefore, the analysis of gene expression data can help discover protein biomarkers in cancer. We will use this in this project.
- miRNA expression data: Corresponds to the amount of microRNA that was expressed at a given time. miRNA plays a role in protein silencing at the mRNA stage. Therefore, an analysis of gene expression data can help discover miRNA biomarkers in cancer.
There are several databases of genomics datasets, where the aforementioned data can be found. Some of them focus on the genomics data of cancer patients. These databases include:
- The Cancer Genome Atlas (TCGA): https://cancergenome.nih.gov/
- International Cancer Genome Consortium (ICGC): https://icgc.org/
- Catalog of Somatic Mutations in Cancer (COSMIC): https://cancer.sanger.ac.uk/cosmic
This genomics data is usually accompanied by clinical data of the patient. This clinical data can comprise general clinical information (for example, age or gender) and their cancer status (for example, cancer location or cancer stage). All of this genomics data itself has a characteristic of high dimensions. For example, the gene expression data for each patient is structured based on the gene ID, which reaches around 60,000 types.
Moreover, some of the data itself comes from more than one format. For example, 70% of the DNA methylation data is collected from breast cancer patients and the remaining 30% are curated from different platforms. Therefore, there are two different structures on in this dataset. Therefore, to analyze genomics data by dealing with the heterogeneity, researchers have often used powerful machine learning techniques or even deep neural networks.
Now let's see what a real-life dataset looks like that can be used for our purpose. We will be using the gene expression cancer RNA-Seq dataset downloaded from the UCI machine learning repository (see https://archive.ics.uci.edu/ml/datasets/gene+expression+cancer+RNA-Seq# for more information).

The data collection pipeline for the pan-cancer analysis project (source: "Weinstein, John N., et al. 'The cancer genome atlas pan-cancer analysis project.' Nature Genetics 45.10 (2013): 1113-1120")
This dataset is a random subset of another dataset reported in the following paper: Weinstein, John N., et al. The cancer genome atlas pan-cancer analysis project. Nature Genetics 45.10 (2013): 1113-1120. The preceding diagram shows the data collection pipeline for the pan-cancer analysis project.
The name of the project is The Pan-Cancer analysis project. It assembled data from thousands of patients with primary tumors occurring in different sites of the body. It covered 12 tumor types (see the upper-left panel in the preceding figure) including:
- Glioblastoma Multiform (GBM)
- Lymphoblastic acute myeloid leukemia (AML)
- Head and Neck Squamous Carcinoma (HNSC)
- Lung Adenocarcinoma (LUAD)
- lung Squamous Carcinoma (LUSC)
- Breast Carcinoma (BRCA)
- kidney Renal Clear Cell Carcinoma (KIRC)
- ovarian Carcinoma (OV)
- Bladder Carcinoma (BLCA)
- Colon Adenocarcinoma (COAD)
- Uterine Cervical and Endometrial Carcinoma (UCEC)
- Rectal Adenocarcinoma (READ)
This collection of data is part of the RNA-Seq (HiSeq) PANCAN dataset. It is a random extraction of gene expressions of patients having different types of tumors: BRCA, KIRC, COAD, LUAD, and PRAD.
This dataset is a random collection of cancer patients from 801 patients, each having 20,531 attributes. Samples (instances) are stored row-wise. Variables (attributes) of each sample are RNA-Seq gene expression levels measured by the illumina HiSeq platform. A dummy name (gene_XX) is given to each attribute. The attributes are ordered consistently with the original submission. For example, gene_1 on sample_0 is significantly and differentially expressed with a a value of 2.01720929003.
When you download the dataset, you will see there are two CSV files:
- data.csv: Contains the gene expression data of each sample
- labels.csv: The labels associated with each sample
Let's take a look at the processed dataset. Note we will see only a few selected features considering the high dimensionality in the following screenshot, where the first column represents sample IDs (that is, anonymous patient IDs). The rest of the columns represent how a certain gene expression occurs in the tumor samples of the patients:

Sample gene expression dataset
Now look at the labels in Figure 3. Here, id contains the sample ids and Class represents the cancer labels:

Samples are classified into different cancer types
Now you can imagine why I have chosen this dataset. Well, although we will not have so many samples, the dataset is still very high dimensional. In addition, this type of high-dimensional dataset is very suitable for applying a deep learning algorithm.
Alright. Therefore, if the features and labels are given, can we classify these samples based on features and the ground truth. Why not? We will try to solve the problem with the DL4J library. First, we have to configure our programming environment so that we can start writing our codes.