COSC 348: Lab02
Overview: Finding Data
An important part of bioinformatics is processing data, so the first
thing we need to find out is "where is the data".
The purpose of this lab is to find information about the single protein called GABRA1,
and the DNA sequence that is responsible for GABRA1 generation. In the second part,
we'll look into how to find information about the whole genome.
IMPORTANT: Save the last 30 min or so for writing a 1 page report reflecting on your
today's investion and submit it. Answering the questions provided along this
document will help you to write it, do not hesitate about speculating, it is not an
exam. Any feedback you provide regarding the lab session will be very valuable as well.
Remember, your today's lab work and this reflection are worth 1% of your final mark.
Part 1: Protein and DNA sequence retrieval
We will retrieve various information about the protein denoted by GABRA1.
First let us find out what the GABRA1 protein is: have a
read of this.
Make sure you understand what GABRA1 is -- talk to someone near
you, and see if you have the same idea.
ExPASy (SwissProt database) – retrieving the AA sequence of proteins
Now that we know what GABRA1 is, we want to find out, and store, the
sequence of Amino Acids (AA) that makes the protein.
-
Visit the ExPASy site at:
www.expasy.org/sprot
-
Search List of UniProtKB/Swiss-Prot entries for "GABRA1" in the search text box.
-
Click on the search result GBRA1_HUMAN, accession number P14867,
and explore what kind of information is available about this protein.
-
Find out how many AA are in sequence of this protein?
-
Where in the cell would you expect to find this protein?
-
Based on publications titles can you name 3-5 disorders in which mutations of GABRA1 gene are involved?
Make sure you are back on the
GBRA1_HUMAN page.
-
Find the section called Sequence and select "FASTA"
This opens a page
with the AA sequence presented in FASTA format.
Save this to a text file.
When reading the file -- use a fixed width font, such as Courier
or Courier New Font
About the FASTA format:
>Sequence_identifier_and_name | The definition line followed by the
sequence of bases, i.e.:
ARCTGKINYD.....
FASTA is the default format for many sequence analysis software.
These programs are case-sensitive. Be aware:
-
always use CAPITAL letters for protein and DNA/RNA sequences
-
always use the TEXT option of the word processing program,
i.e. Nothing else but ASCII
-
use the Courier New font for easy alignment (letters have the same width)
-
the RAW format is simply the sequence part of the FASTA format without
the definition line
-
beware of various unwanted characters like, '*', '-', <Tab>,
<Space>, etc.
They can ruin the analysis.
GenBank – Retrieving the coding DNA sequence for a protein
-
Return to the
GBRA1_HUMAN page
-
On the left, click the 'cross-references' link, to take you halfway
down the page.
-
For the retrieval of the DNA coding sequence for our protein GABRA1,
first choose GenBank from the dropdown menu on the left and then click
the link
X14766 mRNA, which will bring the GenBank entry for the human
mRNA/gene encoding the GABRA1.
-
Scroll down to the entry entitled ORIGIN, and notice that the DNA/mRNA
sequence has 1742 bases whereas the protein GABRA1 has only 456 AA.
-
How many AA would you expect to find in a protein sequence from a 1742 nucleotides long mRNA?
-
Can you explain why? What is going on?
-
Can you figure out what the 'CDS' entry means?
This format of sequence is called the GenBank format.
About the GenBank format:
GenBank is also the default format for many sequence analysis software.
It consists of 4 parts (described in detail at
http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
):
-
The LOCUS field contains a number of different data elements, including
the identification/accession number, type of sequence, e.g. mRNA, sequence
length, molecule name and definition. Version. Different versions of the
same sequence. Source and organism.
-
REFERENCE section. Publications by the authors of the sequence that
discovered the data reported in the record.
-
The FEATURES section. Information about genes and gene products, as well
as regions of biological significance within the sequence. These can include
regions of the sequence that code for proteins and RNA molecules, as well
as a number of other features. CDS is the coding region of a gene's DNA or
RNA that is composed of exons (i.e. the true coding regions).
-
ORIGIN. The sequence data begin on the line immediately below ORIGIN.
-
To view/save the sequence data only, display the record in the FASTA format.
For that go up the page to Display Settings, and from the drop down menu
choose FASTA and hit Apply.
By now, you should be ready to retrieve the same information for any gene/protein.
Try with the Human Haemoglobin subunit alpha, and see if you can answer some of the
following questions.
-
How long is the subunit?
-
What is its main function?
-
Is its malfunction associated with any known disease?
-
Can you retrieve a high-resolution X-ray structure? (Shortcut)
Part 2: Genome retrieval
Let us first see the genome in action. How viruses inject their RNA or
DNA into our cells and thus force them to make more viruses.
Life cycle of
viruses in video.
GenBank – finding and retrieving genomes of viruses
GenBank – the leading nucleotide sequence repository/database maintained
jointly by the NCBI (U.S. National Center for Biotechnology Information),
EMBL (European Molecular Biology Laboratory) and DDBJ (DNA Data Bank of
Japan).
-
Go to http://www.ncbi.nlm.nih.gov/sites/entrez?db=genome
-
You can choose the link "Viruses" to reveal the categories of known
viral genomes. Find "Influenza Virus" and then go to
"Sequences from the human A (H7N9) 2013 outbreak" link.
-
A list of all submitted H7N9 segments of its DNA is revealed. By clicking on
individual identifiers we reveal the sequences for corresponding gene
of H7N9 virus.
DNA of influenza A viruse has 10 genes (segments):
- PB2 (polymerase basic 1)
- PB1 (polymerase basic 2)
- PA (polymerase acidic)
- HA (hemagglutinin)
- NP (nucleoprotein)
- NA (neuraminidase)
- M1 and M2 (matrix)
- NS1 and NS2 (non-structural)
How important are viruses in evolution?
Did DNA Come From Viruses?
Ensembl Project – Exploring the Human Genome
Ensemble is a joint project of the European Bioinformatics Institute and
the Sanger Institute both located near Cambridge, U.K. You can spend weeks
navigating all the options here.