COSC 348: Lab04
Overview: Pair-wise and multiple sequence alignments
IMPORTANT: Save the last 30 min or so for writing a 1 page report reflecting on your today's investigation
and submit it. Compare the different tools, which you've used for alignment and answer the questions provided along this document.
Remember, your today's lab work and this reflection are worth 1% of your final mark.
Part 1: Local pair-wise alignment with LALIGN -- dissimilar sequences
LALIGN uses code developed by X. Huang and W. Miller
(Adv. Appl. Math., vol. 12, pp. 337-357, 1991). This program is part of the
FASTA package of sequence analysis program. While SSEARCH reports only the
best alignment between the query sequence and the library sequence, LALIGN
reports a specified number of alignments between two sequences and their
scores.
-
First, let us retrieve the sequences for comparison.
-
Point the browser to
http://www.uniprot.org/uniprot/P05049.fasta
to retrieve the 1st protein sequence in FASTA format.
-
Open new tab and point the browser to
http://www.uniprot.org/uniprot/P08246.fasta.
Notice these sequences are of different lengths, the second one is about
half the length of the first one.
Can you find out what these two proteins (P05049 and P08246) are?
-
Open another tab and go to
http://www.ch.embnet.org/software/LALIGN_form.html.
Choose the local alignment option.
-
Choose the number of reported sub-alignments. Let it be default.
-
Select the substitution matrix. Let it be default.
-
Set the gap opening penalty. Keep the default values as they have been
adjusted to the default scoring matrices.
-
Choose the 'plain text' input sequence format for both sequences.
-
Enter both sequences and their IDs into corresponding windows.
-
Click the 'Run lalign' button. The program will display the result.
Notes:
Waterman-Eggert score is a specific formula for calculating the alignment score.
':' means identical match, '.' means similar substitution, ' ' means mismatch.
-
Inspect the result and try to find those parts that make any sense to you based on material presented in the lectures.
You do not have to understand everything to answer the following questions:
Which algorithm is used by the program?
What is the sequence identity between these sequences?
How many aminoacids do they share?
Are these sequences evolutionary related?
Note: We need to have E < 10-4, in order to consider sequences to
be evolutionary related.
-
Let us return to the previous page and select the global alignment on the
latter two sequences. Try to answer previous questions by inspecting the result
of global alignment.
Differences between local and global alignment:
-
Local alignment is used to compare distantly related sequences that share
only a few common domains. Programs only align the most similar regions of
the sequences and ignore the rest.
-
Global alignment is used for checking minor differences between two
sequences, analyzing single nucleotide polymorphisms (point mutations),
and comparing sequences that overlap.
Question: Why and when would you choose to carry either a global or a local alignment?
Part 2: Global pair-wise alignment with LALIGN -- similar sequences
-
In a new tab point the browser to
http://www.uniprot.org/uniprot/P14867.fasta
to retrieve the first protein sequence in FASTA format.
It's the gamma-aminobutyric acid receptor subunit alpha-1 from Homo sapiens
(human).
-
Open new tab and point the browser to
http://www.uniprot.org/uniprot/Q4R534.fasta.
It's the gamma-aminobutyric acid receptor subunit alpha-1 from Macaca
fascicularis (Macaque monkey).
-
Run global alignment on these two sequences.
What is the sequence identity between these sequences?
How many aminoacids do they share?
List the aminoacids that differ between the Human and the Macaque sequences and their position index in the alignment.
Part 3: Using BLAST (Basic Local Alignment Tool) for multiple sequence alignment
BLAST is a sequence comparison tool that quickly tells us, which sequences
out there are similar to our own sequence.
-
Open new tab and point the browser to
http://www.ncbi.nlm.nih.gov/BLAST/
-
Choose a BLAST program to run -- i.e. the protein blast, and copy/paste your
protein GABRA1 sequence from
http://www.uniprot.org/uniprot/P14867.fasta.
-
Keep the default parameter values.
-
Hit the BLAST button and wait for results. An overview of the database
sequences aligned to the query sequence is shown. First, the score of
each alignment is indicated by one of five different colours. Red bar
indicates the most similar sequences with > 200 letters match, pink bar
indicates bit less good alignments (80--200 letters), etc.
-
Below the colourful figure is the list of sequences producing significant
alignments in order of the most significant ones downwards. Clicking on U
or G will take you to the NCBI gene databases where you can obtain more
detailed information about aligned sequences.
-
Still below this list of matched sequences from the database, one can find
individual pair-wise alignments for all stored sequences with your query
sequence.
-
Scroll back to the top of the page. Click on 'Distance tree of results'.
It's a phylogenetic tree based on clustering the sequences based on their
mutual distance measured as number of differences between them. Query
sequence is highlighted in yellow. On the right, there are colour codes
for different species from which the other sequences come from.
Differences between local alignments with BLAST and LALIGN:
-
BLAST returns aligned sequences based on only one best
local alignment. As a worst case scenario, BLAST finds only one local
alignment. On the other hand, LALIGN returns as many local alignments as we
specify (if they exist). The best one, the second best, etc.
-
BLAST is: very fast, suitable for very long sequences, and best
with DNA.
-
LALIGN is: slower, suitable for shorter sequences, and best
with proteins.
Part 4: Using BLAST to find out about the 'unknown' DNA sequence
-
To retrieve the query DNA/mRNA sequence in the FASTA format, in the new
browser tab/window enter
http://www.ncbi.nlm.nih.gov/nuccore/31630?report=fasta&log$=seqview.
This will retrieve human gene for the gamma-aminobutyric acid receptor
subunit alpha-1.
-
Save the sequence into a text file and insert various "mutations",
"frameshifts", "deletions", etc. in order to obtain a new "unknown" DNA
sequence extracted from an alien species.
-
Point the browser to
http://www.ncbi.nlm.nih.gov/BLAST/
and choose nucleotide BLAST.
-
Enter the "scrambled" DNA sequence into the search window.
-
Hit the BLAST button and wait for results. Is there anything similar to
your unknown gene found in the database? Why? Or why not?
Part 5: Use blastx to discover proteins encoded in the query DNA sequence
-
Go to
http://www.ncbi.nlm.nih.gov/BLAST/
and choose blastx.
-
Enter the "scrambled" query DNA sequence into the search window.
-
Hit the BLAST button and wait for results. Explore which proteins in
which species are predicted to be coded by your alien DNA.
The European Bioinformatics Institute (EBI) toolbox site
http://www.ebi.ac.uk/Tools/
provides a comprehensive range of tools for the field of bioinformatics for:
-
Similarity & Homology -- the BLAST or FASTA programs can be
used to look for sequence similarity and infer homology, including
implementations of Needleman-Wunsch and Smith-Waterman algorithms
(under the FASTA link of programs).
-
Protein Functional Analysis -- InterProScan can be used to
search for motifs in your protein sequence.
-
Sequence Analysis -- ClustalW2 a sequence alignment tool for
construction of phylogenetic trees.
-
Structural Analysis -- MSDfold can be used to query your protein
structure and compare it to those in the Protein Data Bank (PDB).
-
Web Services -- provide programmatic access to the various
databases and retrieval/analysis services EBI provides.
-
Tools Miscellaneous -- Expression Profiler a set of tools for
clustering, analysis and visualisation of gene expression and
other genomic data.
Useful references:
Cosc348 home
Cosc348 labs