COSC 348: Lab04

Overview: Pair-wise and multiple sequence alignments

IMPORTANT: Save the last 30 min or so for writing a 1 page report reflecting on your today's investigation and submit it. Compare the different tools, which you've used for alignment and answer the questions provided along this document. Remember, your today's lab work and this reflection are worth 1% of your final mark.

Part 1: Local pair-wise alignment with LALIGN -- dissimilar sequences

LALIGN uses code developed by X. Huang and W. Miller (Adv. Appl. Math., vol. 12, pp. 337-357, 1991). This program is part of the FASTA package of sequence analysis program. While SSEARCH reports only the best alignment between the query sequence and the library sequence, LALIGN reports a specified number of alignments between two sequences and their scores.

First, let us retrieve the sequences for comparison.
Point the browser to http://www.uniprot.org/uniprot/P05049.fasta to retrieve the 1st protein sequence in FASTA format.
Open new tab and point the browser to http://www.uniprot.org/uniprot/P08246.fasta. Notice these sequences are of different lengths, the second one is about half the length of the first one.
Can you find out what these two proteins (P05049 and P08246) are?
Open another tab and go to http://www.ch.embnet.org/software/LALIGN_form.html. Choose the local alignment option.
Choose the number of reported sub-alignments. Let it be default.
Select the substitution matrix. Let it be default.
Set the gap opening penalty. Keep the default values as they have been adjusted to the default scoring matrices.
Choose the 'plain text' input sequence format for both sequences.
Enter both sequences and their IDs into corresponding windows.
Click the 'Run lalign' button. The program will display the result.
Notes:
Waterman-Eggert score is a specific formula for calculating the alignment score.
':' means identical match, '.' means similar substitution, ' ' means mismatch.
Inspect the result and try to find those parts that make any sense to you based on material presented in the lectures. You do not have to understand everything to answer the following questions:
Which algorithm is used by the program?
What is the sequence identity between these sequences?
How many aminoacids do they share?
Are these sequences evolutionary related?

Note: We need to have E < 10^-4, in order to consider sequences to be evolutionary related.
Let us return to the previous page and select the global alignment on the latter two sequences. Try to answer previous questions by inspecting the result of global alignment.

Differences between local and global alignment:

Local alignment is used to compare distantly related sequences that share only a few common domains. Programs only align the most similar regions of the sequences and ignore the rest.
Global alignment is used for checking minor differences between two sequences, analyzing single nucleotide polymorphisms (point mutations), and comparing sequences that overlap.

Question: Why and when would you choose to carry either a global or a local alignment?

Part 2: Global pair-wise alignment with LALIGN -- similar sequences

In a new tab point the browser to http://www.uniprot.org/uniprot/P14867.fasta to retrieve the first protein sequence in FASTA format. It's the gamma-aminobutyric acid receptor subunit alpha-1 from Homo sapiens (human).
Open new tab and point the browser to http://www.uniprot.org/uniprot/Q4R534.fasta. It's the gamma-aminobutyric acid receptor subunit alpha-1 from Macaca fascicularis (Macaque monkey).
Run global alignment on these two sequences.
What is the sequence identity between these sequences?
How many aminoacids do they share?
List the aminoacids that differ between the Human and the Macaque sequences and their position index in the alignment.

Part 3: Using BLAST (Basic Local Alignment Tool) for multiple sequence alignment

BLAST is a sequence comparison tool that quickly tells us, which sequences out there are similar to our own sequence.

Open new tab and point the browser to http://www.ncbi.nlm.nih.gov/BLAST/
Choose a BLAST program to run -- i.e. the protein blast, and copy/paste your protein GABRA1 sequence from http://www.uniprot.org/uniprot/P14867.fasta.
Keep the default parameter values.
Hit the BLAST button and wait for results. An overview of the database sequences aligned to the query sequence is shown. First, the score of each alignment is indicated by one of five different colours. Red bar indicates the most similar sequences with > 200 letters match, pink bar indicates bit less good alignments (80--200 letters), etc.
Below the colourful figure is the list of sequences producing significant alignments in order of the most significant ones downwards. Clicking on U or G will take you to the NCBI gene databases where you can obtain more detailed information about aligned sequences.
Still below this list of matched sequences from the database, one can find individual pair-wise alignments for all stored sequences with your query sequence.
Scroll back to the top of the page. Click on 'Distance tree of results'. It's a phylogenetic tree based on clustering the sequences based on their mutual distance measured as number of differences between them. Query sequence is highlighted in yellow. On the right, there are colour codes for different species from which the other sequences come from.

Differences between local alignments with BLAST and LALIGN:

BLAST returns aligned sequences based on only one best local alignment. As a worst case scenario, BLAST finds only one local alignment. On the other hand, LALIGN returns as many local alignments as we specify (if they exist). The best one, the second best, etc.
BLAST is: very fast, suitable for very long sequences, and best with DNA.
LALIGN is: slower, suitable for shorter sequences, and best with proteins.

Part 4: Using BLAST to find out about the 'unknown' DNA sequence

To retrieve the query DNA/mRNA sequence in the FASTA format, in the new browser tab/window enter
http://www.ncbi.nlm.nih.gov/nuccore/31630?report=fasta&log$=seqview.
This will retrieve human gene for the gamma-aminobutyric acid receptor subunit alpha-1.
Save the sequence into a text file and insert various "mutations", "frameshifts", "deletions", etc. in order to obtain a new "unknown" DNA sequence extracted from an alien species.
Point the browser to http://www.ncbi.nlm.nih.gov/BLAST/ and choose nucleotide BLAST.
Enter the "scrambled" DNA sequence into the search window.
Hit the BLAST button and wait for results. Is there anything similar to your unknown gene found in the database? Why? Or why not?

Part 5: Use blastx to discover proteins encoded in the query DNA sequence

Go to http://www.ncbi.nlm.nih.gov/BLAST/ and choose blastx.
Enter the "scrambled" query DNA sequence into the search window.
Hit the BLAST button and wait for results. Explore which proteins in which species are predicted to be coded by your alien DNA.

The European Bioinformatics Institute (EBI) toolbox site
http://www.ebi.ac.uk/Tools/ provides a comprehensive range of tools for the field of bioinformatics for:

Similarity & Homology -- the BLAST or FASTA programs can be used to look for sequence similarity and infer homology, including implementations of Needleman-Wunsch and Smith-Waterman algorithms (under the FASTA link of programs).
Protein Functional Analysis -- InterProScan can be used to search for motifs in your protein sequence.
Sequence Analysis -- ClustalW2 a sequence alignment tool for construction of phylogenetic trees.
Structural Analysis -- MSDfold can be used to query your protein structure and compare it to those in the Protein Data Bank (PDB).
Web Services -- provide programmatic access to the various databases and retrieval/analysis services EBI provides.
Tools Miscellaneous -- Expression Profiler a set of tools for clustering, analysis and visualisation of gene expression and other genomic data.

Useful references:

List of sequence alignment software
Claverie J-M and Notredame C. (2007) Bioinformatics for Dummies, 2nd ed. Wiley, Indiana.

Cosc348 home
Cosc348 labs