COSC 348: Lab 11

In this lab, you will re-analyse some of the real microarray data from the scientific article of Scott Pomeroy et al. Nature, vol. 451, 24 Jan 2002, pp. 436-442.

IMPORTANT: Submit the results of classification and a report in which you address the questions posed in the end of this lab manual (worth 1% of the final mark).

You will use the dataset named A1.

In this dataset, there are:

7129 columns corresponding to individual genes,
column 7130 is the class label.
40 rows corresponding to individual subjects or samples.
- 10 samples with medulloblastomas, MD (class 1)
- 10 malignant gliomas, Mglio (class 2)
- 10 samples with AT/RT (class 3)
- 4 normal cerebella, Ncer (class 4)
- 6 supratentorial PNETs (class 5)

Download, and unzip Dataset A1
(Note, this is a zipped ascii file, which is about 720 MB.)

Use the NeuCom to visualise and analyse the data. Instructions how to use this application can be found here.

Load and inspect the data file.

Loading the data file.

When you open the NeuCom, the first action is to load your data file.

In the dropdown of File options choose the option: Load data file.

When the dataset is loaded, its name will appear in the available datasets list -- view the dataset in the View&Modify function (bottom taskbar).

Open the data set and inspect it visually.

In the NeuCom format, each row corresponds to one sample (patient) and each column to one feature (i.e. gene).

The last column is the code for a tumour type, i.e. the class label.

Check for yourself how many samples/genes and classes you have in the set.

Data preprocessing: Normalisation

Because of the large range of gene expression values (which you could have seen when inspecting visually the original data file), it is recommended that you use logarithmic rather than linear normalization. Press Normalise (on the bottom taskbar) and answer yes to the question about output for classification (which is about the presence of the last column of class labels in the data file). Choose log10 normalisation method. View the log10 normalised file of data, and save it (Save button on the bottom taskbar). New file with the normalized values appears in the list of available datasets.

Gene selection

Calculation of SNR. In the Data Analysis drop down choose SNR option. In available datasets window choose the data file you want to analyse (i.e. log10 normalised data). Specify SNR analysis to produce 10 genes with the highest SNR. Save this figure (click on it to save) and write down the numbers of genes/variables for further processing.

Gene selection. Extract (on the bottom taskbar) the genes (variables) from the data file by specifying the numbers of columns, e.g. 1320 2496 ..., etc, separated by a space. Create files with N = 3, 5 and 10 selected genes. Keep the last column with class labels. Name the three created data files as 3, 5, 10 or something like this. Always keep the class label (i.e. the last or output) column in the data files. Keep the track which gene is which by creating the table where the new number is listed against the number from the original file. That is, for instance, now No. 1 is the original No. 1320, etc.

Removal of outliers. Outliers are data points that are many times (i.e. ~10x) different from the others. These are usually errors, thus remove that variable (gene) manually by erasing that column from your datasets. Fill in the missing gene with the next gene according to the SNR ratio.

Data visualisation

2D data visualisation. Visualisation is on the top taskbar/menu. Start with 2D visualisation of your selected genes. Use the single plot function and view the distribution of each variable (gene) across the samples. Manipulate the Y-Axis cursor to change the variable (i.e. gene).

Use the multiplot function. The last class-label column will provide you with the guidance how the values of gene expressions are distributed across classes. Observe how the gene expression values of selected genes vary across the classes.

Class distribution of selected genes. This function is another visualisation method to see whether selected genes clearly separate the classes.

3D visualisation. Tick the option Classification data before pressing Start. Manipulate Azimuth and Elevation to rotate the data. Find whether any combination of 3 genes can lead to a clear distinction between classes.

Clustering

K-means clustering. (Data Analysis -> Clustering -> K-means.) Tick Show cluster label on the right. The default number of clusters is 3 (in the field Number of). Perform the clustering for each pair of features (X and Y values). Do the clusters correspond to classes? Unfortunately we cannot tell. We could use the sample numbers to find out but the graph resolution does not allow us.

Principal Component Analysis

PCA can be found in Visualisation. Tick the options Data contains output and Classification D under parameters. Can PCA help to classify samples in this particular case? Check also the graph with %variance captured, which expresses the portion of variance in the data captured by individual principal components.

Classification

We will compare performance of two classification methods, SVM and MLP, on three datasets of selected genes, 3, 5 and 10 genes. Comparison will be done by means of the leave-one-out cross-validation method.

Cross-validation by means of SVM

In Modelling&Discovery dropdown menu choose cross-validation. Select the dataset. After clicking the Create Sessions action button choose Classification, then Support Vector Machine, then Linear with N/A option (or polynomial with a specified degree of the polynomial boundary), then Leave-one-out cross-validation, sequential sampling, and No normalisation, inductive (or transductive) training, no feature selection. Keep all created sessions highlighted and click Start button. When the the cross-validation is finished, press Visualise results.

Cross-validation by means of MLP

In Modelling&Discovery dropdown menu choose cross-validation. Select the dataset. After clicking the Create Sessions action button choose Classification, then Multi-Layer Perceptron, then you have to choose network parameters. Keep all the default parameters except the number of hidden neurons, which you will experiment with. In your experiments, try 5, 10 and 20 hidden neurons. Then select Leave-one-out cross-validation, sequential sampling, and No normalisation, inductive training (only inductive training will be used for MLP), no feature selection. Keep all created sessions highlighted and click Start button. When the the cross-validation is finished, press Visualise results.

COSC 348: Lab11

Microarray data analysis using NeuCom
-- challenge for developing personalized medicine

Load and inspect the data file.

Loading the data file.

Open the data set and inspect it visually.

Data preprocessing: Normalisation

Gene selection

Data visualisation

Clustering

Principal Component Analysis

Classification

Cross-validation by means of SVM

Cross-validation by means of MLP

COSC 348: Lab11

Microarray data analysis using NeuCom -- challenge for developing personalized medicine

Load and inspect the data file.

Loading the data file.

Open the data set and inspect it visually.

Data preprocessing: Normalisation

Gene selection

Data visualisation

Clustering

Principal Component Analysis

Classification

Cross-validation by means of SVM

Cross-validation by means of MLP

Microarray data analysis using NeuCom
-- challenge for developing personalized medicine