COSC 348: Lab02

Overview: Finding Data

An important part of bioinformatics is processing data, so the first thing we need to find out is "where is the data".

The purpose of this lab is to find information about the single protein called GABRA1, and the DNA sequence that is responsible for GABRA1 generation. In the second part, we'll look into how to find information about the whole genome.

IMPORTANT: Save the last 30 min or so for writing a 1 page report reflecting on your today's investion and submit it. Answering the questions provided along this document will help you to write it, do not hesitate about speculating, it is not an exam. Any feedback you provide regarding the lab session will be very valuable as well.
Remember, your today's lab work and this reflection are worth 1% of your final mark.

Part 1: Protein and DNA sequence retrieval

We will retrieve various information about the protein denoted by GABRA1.
First let us find out what the GABRA1 protein is: have a read of this.

Make sure you understand what GABRA1 is -- talk to someone near you, and see if you have the same idea.

ExPASy (SwissProt database) – retrieving the AA sequence of proteins

Now that we know what GABRA1 is, we want to find out, and store, the sequence of Amino Acids (AA) that makes the protein.

Visit the ExPASy site at: www.expasy.org/sprot
Search List of UniProtKB/Swiss-Prot entries for "GABRA1" in the search text box.
Click on the search result GBRA1_HUMAN, accession number P14867, and explore what kind of information is available about this protein.
- Find out how many AA are in sequence of this protein?
- Where in the cell would you expect to find this protein?
- Based on publications titles can you name 3-5 disorders in which mutations of GABRA1 gene are involved?
Make sure you are back on the GBRA1_HUMAN page.
Find the section called Sequence and select "FASTA"
This opens a page with the AA sequence presented in FASTA format.
Save this to a text file.
When reading the file -- use a fixed width font, such as Courier or Courier New Font

About the FASTA format:

>Sequence_identifier_and_name | The definition line followed by the sequence of bases, i.e.:
ARCTGKINYD.....

FASTA is the default format for many sequence analysis software. These programs are case-sensitive. Be aware:

always use CAPITAL letters for protein and DNA/RNA sequences
always use the TEXT option of the word processing program, i.e. Nothing else but ASCII
use the Courier New font for easy alignment (letters have the same width)
the RAW format is simply the sequence part of the FASTA format without the definition line
beware of various unwanted characters like, '*', '-', <Tab>, <Space>, etc. They can ruin the analysis.

GenBank – Retrieving the coding DNA sequence for a protein

Return to the GBRA1_HUMAN page
On the left, click the 'cross-references' link, to take you halfway down the page.
For the retrieval of the DNA coding sequence for our protein GABRA1, first choose GenBank from the dropdown menu on the left and then click the link X14766 mRNA, which will bring the GenBank entry for the human mRNA/gene encoding the GABRA1.
Scroll down to the entry entitled ORIGIN, and notice that the DNA/mRNA sequence has 1742 bases whereas the protein GABRA1 has only 456 AA.
- How many AA would you expect to find in a protein sequence from a 1742 nucleotides long mRNA?
- Can you explain why? What is going on?
- Can you figure out what the 'CDS' entry means?

This format of sequence is called the GenBank format.

About the GenBank format:

GenBank is also the default format for many sequence analysis software. It consists of 4 parts (described in detail at http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html ):

The LOCUS field contains a number of different data elements, including the identification/accession number, type of sequence, e.g. mRNA, sequence length, molecule name and definition. Version. Different versions of the same sequence. Source and organism.
REFERENCE section. Publications by the authors of the sequence that discovered the data reported in the record.
The FEATURES section. Information about genes and gene products, as well as regions of biological significance within the sequence. These can include regions of the sequence that code for proteins and RNA molecules, as well as a number of other features. CDS is the coding region of a gene's DNA or RNA that is composed of exons (i.e. the true coding regions).
ORIGIN. The sequence data begin on the line immediately below ORIGIN.
To view/save the sequence data only, display the record in the FASTA format. For that go up the page to Display Settings, and from the drop down menu choose FASTA and hit Apply.

By now, you should be ready to retrieve the same information for any gene/protein. Try with the Human Haemoglobin subunit alpha, and see if you can answer some of the following questions.

How long is the subunit?
What is its main function?
Is its malfunction associated with any known disease?
Can you retrieve a high-resolution X-ray structure? (Shortcut)

Part 2: Genome retrieval

Let us first see the genome in action. How viruses inject their RNA or DNA into our cells and thus force them to make more viruses. Life cycle of viruses in video.

GenBank – finding and retrieving genomes of viruses

GenBank – the leading nucleotide sequence repository/database maintained jointly by the NCBI (U.S. National Center for Biotechnology Information), EMBL (European Molecular Biology Laboratory) and DDBJ (DNA Data Bank of Japan).

Go to http://www.ncbi.nlm.nih.gov/sites/entrez?db=genome
You can choose the link "Viruses" to reveal the categories of known viral genomes. Find "Influenza Virus" and then go to "Sequences from the human A (H7N9) 2013 outbreak" link.
A list of all submitted H7N9 segments of its DNA is revealed. By clicking on individual identifiers we reveal the sequences for corresponding gene of H7N9 virus.

DNA of influenza A viruse has 10 genes (segments):

PB2 (polymerase basic 1)
PB1 (polymerase basic 2)
PA (polymerase acidic)
HA (hemagglutinin)
NP (nucleoprotein)
NA (neuraminidase)
M1 and M2 (matrix)
NS1 and NS2 (non-structural)

How important are viruses in evolution? Did DNA Come From Viruses?

Ensembl Project – Exploring the Human Genome

Ensemble is a joint project of the European Bioinformatics Institute and the Sanger Institute both located near Cambridge, U.K. You can spend weeks navigating all the options here.

Go to http://www.ensembl.org, and click on the human icon under "Popular genomes". (Notice there is a list of other genomes too).
Go to "View karyotype" and explore human chromosomes (we have 2 copies of each chromosome, one from mum and the other one from dad).
The Ensembl database can be mined in more "efficient' ways. If you are interested, browse their tutorials or go straight to the Ensembl public API.

More genomic sites

Many useful links:: http://bioweb2.pasteur.fr/intro-en.html
J. Craig Venter Institute:: http://www.jcvi.org/
U.S. Department of Energy Joint Genome Institute: http://www.jgi.doe.gov

Useful references:

Claverie J-M, Notredame C. (2007) Bioinformatics for Dummies, 2nd ed. Wiley, Indiana.
Wikipedia: http://en.wikipedia.org/wiki/GABAA_receptor

Cosc348 home
Cosc348 labs