In this lab, you will re-analyse some of the real microarray data from the scientific article of Scott Pomeroy et al. Nature, vol. 451, 24 Jan 2002, pp. 436-442.
IMPORTANT: Submit the results of classification and a report in which you address the questions posed in the end of this lab manual (worth 1% of the final mark).
You will use the dataset named A1.
In this dataset, there are:
Download, and unzip
Dataset A1
(Note, this is a zipped ascii file, which is about 720 MB.)
Use the NeuCom
to visualise and analyse the data.
Instructions how to use this application can be found
here.
When you open the NeuCom, the first action is to load your data file.
In the dropdown of File options choose the option:
Load data file
.
When the dataset is loaded, its name will appear in the available datasets
list -- view the dataset in the View&Modify
function
(bottom taskbar).
In the NeuCom format, each row corresponds to one sample (patient) and each column to one feature (i.e. gene).
The last column is the code for a tumour type, i.e. the class label.
Check for yourself how many samples/genes and classes you have in the set.
Because of the large range of gene expression values
(which you could have seen when inspecting visually the original data file),
it is recommended that you use
logarithmic rather than linear normalization. Press Normalise
(on the bottom taskbar) and answer yes to the question about output for
classification (which is about the presence of the last column of class
labels in the data file). Choose log10 normalisation method. View the
log10 normalised file of data, and save it
(Save button on the bottom taskbar).
New file with the normalized values appears in the list of available datasets.
Calculation of SNR. In the Data Analysis drop down choose SNR option. In available datasets window choose the data file you want to analyse (i.e. log10 normalised data). Specify SNR analysis to produce 10 genes with the highest SNR. Save this figure (click on it to save) and write down the numbers of genes/variables for further processing.
Gene selection. Extract (on the bottom taskbar) the genes (variables) from the data file by specifying the numbers of columns, e.g. 1320 2496 ..., etc, separated by a space. Create files with N = 3, 5 and 10 selected genes. Keep the last column with class labels. Name the three created data files as 3, 5, 10 or something like this. Always keep the class label (i.e. the last or output) column in the data files. Keep the track which gene is which by creating the table where the new number is listed against the number from the original file. That is, for instance, now No. 1 is the original No. 1320, etc.
Removal of outliers. Outliers are data points that are many times (i.e. ~10x) different from the others. These are usually errors, thus remove that variable (gene) manually by erasing that column from your datasets. Fill in the missing gene with the next gene according to the SNR ratio.
2D data visualisation. Visualisation is on the top taskbar/menu. Start with 2D visualisation of your selected genes. Use the single plot function and view the distribution of each variable (gene) across the samples. Manipulate the Y-Axis cursor to change the variable (i.e. gene).
Use the multiplot function. The last class-label column will provide you with the guidance how the values of gene expressions are distributed across classes. Observe how the gene expression values of selected genes vary across the classes.
Class distribution of selected genes. This function is another visualisation method to see whether selected genes clearly separate the classes.
3D visualisation. Tick the option Classification data before pressing Start. Manipulate Azimuth and Elevation to rotate the data. Find whether any combination of 3 genes can lead to a clear distinction between classes.
K-means clustering. (Data Analysis -> Clustering -> K-means.) Tick Show cluster label on the right. The default number of clusters is 3 (in the field Number of). Perform the clustering for each pair of features (X and Y values). Do the clusters correspond to classes? Unfortunately we cannot tell. We could use the sample numbers to find out but the graph resolution does not allow us.
PCA can be found in Visualisation. Tick the options Data contains output and Classification D under parameters. Can PCA help to classify samples in this particular case? Check also the graph with %variance captured, which expresses the portion of variance in the data captured by individual principal components.
We will compare performance of two classification methods, SVM and MLP, on three datasets of selected genes, 3, 5 and 10 genes. Comparison will be done by means of the leave-one-out cross-validation method.
Note: You have to clear old and create new sessions before each experiment.
In Modelling&Discovery dropdown menu choose cross-validation. Select the dataset. After clicking the Create Sessions action button choose Classification, then Support Vector Machine, then Linear with N/A option (or polynomial with a specified degree of the polynomial boundary), then Leave-one-out cross-validation, sequential sampling, and No normalisation, inductive (or transductive) training, no feature selection. Keep all created sessions highlighted and click Start button. When the the cross-validation is finished, press Visualise results.
In Modelling&Discovery
dropdown menu choose
cross-validation. Select the dataset.
After clicking the Create Sessions action button choose Classification,
then Multi-Layer Perceptron, then you have to choose network parameters.
Keep all the default parameters except the number of hidden neurons,
which you will experiment with. In your experiments, try 5, 10 and 20
hidden neurons.
Then select Leave-one-out cross-validation, sequential sampling, and
No normalisation,
inductive training (only inductive training will be used for MLP), no
feature selection.
Keep all created sessions highlighted and click Start button.
When the the cross-validation is finished, press Visualise results.