Questions are in bold
Please use the Firefox browser. Many exercises do not work in Internet Explorer.
Answers to the questions are HERE.
Exercise 1: Query a public database
Examination of a GenBank entry
- Go to the Genbank website
- Search for “DMD AND Homo sapiens”
- Select the RefSeq database (menu on the left)
- Select transcript variant Dp427m, (accession code NM_004006)
- Explore the information, e.g. gene, coding sequence (CDS) , PubMed
- Find the variation on position 3713. Which effect has this variant?
Retrieve a sequence from the database and store it on your own computer The “fasta” format is a format for sequences, which can be read by many programs.
- Search in the Nucleotide database for ‘human mRNA HBA2’ (NM_000517) Tip: you can search on accession code
- Set Display (top left of the page) to Fasta
- Send to File, save as NM_000517.txt (This is only to show how you can save information from GenBank, we do not need this sequence anymore)
- What are the characteristics of the FASTA format?
Exercise 2: Data conversion/translation
One inconvenience of having a number of different DNA sequence analysis packages available is that they use different formats for storing more-or-less the same information. Further, most packages refuse to accept even the simplest files from one another. Until a few years ago sequences were stored in a dozen different formats.
Old school formats
(You can skip this exercise if you are only interested in NGS data)
Seqret is a program for sequence conversion, it was written originally around 1989 as a component of a sequence analysis program, it took on a life of its own as a conversion program for bioinformatics. Seqret is particularly useful as it automatically detects many sequence formats, and converts among them. Most sequence file formats allow more than one sequence per file, but some do not. Sequence formats that allow one or more sequences:
- GenBank/GB, genbank flatfile format
- EMBL, EMBL flatfile format
- DNAStrider, for common Mac program
- Pearson/Fasta, a common format used by Fasta programs and others
- Phylip3.2, sequential format for Phylip programs
Sequence formats that only allow one sequence.
- GCG, single sequence format of GCG software
- Plain/Raw, sequence data only (no name, document, numbering)
- Have a look at Sequence formats
- Now open the Seqret website
- The file NM_004014.txt (Right-click > open in new window) contains a sequence in GCG format (Dystrophin transcript variant Dp116).
- Copy and paste the sequence, choose the appropriate input (DNA), select “Unknown format” as input format and select “Fasta format” as the output format
- Change the output format and look at the different file formats
Convert between recent sequence formats
After quite some years you would expect that some formats would have been standardized, however that is not completely the case yet. Now-a-days the most frequently used formats are FASTA and FASTQ. High throughput sequencers do still have their own format. The first step in almost any analysis pipeline is to convert the sequences to a format that your analysis program can use. In this exercise you will retrieve sequences from a public high throughput sequence experiment in FASTQ format and convert it to the FASTA format that can be used in a BLAST search. The Galaxy web portal will be used for retrieval and conversion. Many programs that are usually only available for Linux systems have been integrated in this website so that they can be used from a web browser.
Download sequences of a public dataset from the NCBI Short Read Archive (SRA) via Galaxy
- Go to Galaxy
- Select from the Menu on the left >Get data >EBI SRA > and search for ”DRR000019”. More information about the experiment can be found HERE
- You can transfer the fastq file immediately to the Galaxy server from this page. Do that by selecting the “File 1” link from the appropriate column
- When the file has been transferred it will end up in the History panel on the right in green. If it is still yellow your action is scheduled. The server can be in use by many people at the same time and in that case your job will end up in the queue.
Convert FASTQ to FASTA
- The FASTQ file needs to be converted before you can use it in Galaxy. This is done with the tool “Fastq Groomer” in the “NGS: QC and manipulation” menu. Note that you can also search for a tool in the search box above the menu. Keep the default parameters and convert the DRR000019 dataset.
- After this you can select the tool “FASTQ to FASTA” from the “Convert formats” menu
- The FASTQ file should appear automatically in the dropdown menu in the center panel. If it doesn’t select the second “FASTQ to FASTA” converter. Click on “Execute” to start the data conversion
- Examine the converted file to check if it really looks like a fasta file. How many sequences does the FASTA file contain?
- Tip: if the right converter for your sequence data is not present in Galaxy you can try to use this BioPython sequence converter
Identify the first 5 sequences with BLAST
Just for demonstration purposes you will use BLAST to identify the first 5 sequences
- Use the tool “Select first lines from a dataset” from the “Text manipulation” menu
- Select the first 10 lines (one sequence entry consists of a description line and a line with the sequence) of the last dataset
- Now go to BLAST
- Select ”nucleotide blast” and copy/paste the 5 sequences in the search box. Select the ”Nucleotide collection (nr/nt)” database for the search.
- Which organism(s) are in the dataset?
Exercise 3: Compare two sequences to identify mutations
- Read more about the Sickle cell anemia gene on this website
- Which mutation causes Sickle cell disease?
The bl2seq program (BLAST) can be used to align two sequences against each other in stead of against the entire database. The sequence can be pasted into the boxes or the accession codes (unique identifier for a sequence) can be entered as input.
- Go to the BLAST website. Determine the position of the HBS mutation with “Nucleotide BLAST”. Select the “Align two or more sequences” checkbox on the BLAST form
- Type “NM_000518” in the box for sequence1, this is the mRNA sequence of normal HBB.
- Type “M25113” (mRNA of HBS) in the second box.
- Click on “Blast” and examine the output.
- At the top of the page you can alter the ‘formatting options’. Tick the “CDS feature” box and click on “Reformat” to see the protein translations of the sequences.
The sequences are retrieved from the nucleotide database (GenBank) and are placed above each other. This is called an alignment. Matching nucleotides are indicated by a pipe-sign: |. Mismatching nucleotides will not have a pipe-sign. Insertions/deletions are annotated as a dash (-) in one of the sequences. Below the alignment the protein sequence is placed when the annotation of one of the sequences is known. In this case the identifiers were given as input, so the bl2seq program could retrieve this information from GenBank.
- Identify the mutation which causes the change in amino acid.
Exercise 4: Pick primers to screen patients for the HBS mutation
Pick primers for the first exon of HBB / HBS
- Go to the UCSC genome browser and search for HBB in the “Genome browser”. Select HBB from the dropdown menu. Click on the HBB name at the left side in the genome browser to get more information about this gene (the HBB gene is highlighted in the image). Retrieve the DNA sequence by selecting ”Genomic Sequence (chr11:5,246,696-5,248,301)” from the ”Sequence and Links to Tools and Databases” box. Uncheck the ”Introns” box and select ”One fasta record per region” including 100 bp up- and downstream. Select ”Mask repeats” to N. This option does the same as the RepeatMasker program
- Copy the sequence of the first exon into Primer3. Does the resulting primerset include the region with the mutation?
- Check if the primer set is unique. Go to the UCSC website and start the ”In-silico PCR” (Tools >In-silico PCR). Copy/paste the forward and reverse primer from the previous exercise. Follow the link to go to the graphical result and inspect if the primer set contains the correct region.
- Inspect the OMIM Alleles track. If you don’t see the OMIM Alleles track activate it in the menu below the graph (Phenotype and literature). What is the rs-id of the variant corresponding with the HBS phenotype? Click on the rs-id to go to the dbSNP database and get more information about the variant.
Exercise 5: Gene finding
The ORF Finder (Open Reading Frame Finder) is a graphical analysis tool which finds all open reading frames of a selectable minimum size in a user’s sequence or in a sequence already in the database.
- Open ORF finder
- enter the accession number “NM_004006” (Dystrophin transcript variant Dp427m) into the appropriate box.
The full cDNA sequence is translated in all 6 reading frames. Remember that DNA is double stranded so there are 3 reading frames for the top strand and 3 more reading frames for the bottom strand. All the ORFs are indicated in the figure at the top. The ORF Finder is a graphical-analysis tool which allows you to find all open reading frames in a given sequence. What you are generally looking for is the largest ORF. The software has already done this for you by ranking the ORFs from largest to smallest. It also tells you which frame it came from, for example +3 means it is the third frame (3) on the top strand (+). You can retrieve the automated translated protein sequence by selecting the longest ORF.
- Copy/paste the NM_004006_point.txt sequence in ORF finder and compare the output with the previous one
- What is wrong with this transcript?
Links to databases and software
General Overview of databases and online software