Calculating and Visualizing Sequence Statistics

Open Script

This example shows how to use basic sequence manipulation techniques and computes some useful sequence statistics. It also illustrates how to look for coding regions (such as proteins) and pursue further analysis of them.

The Human Mitochondrial Genome

In this example you will explore the DNA sequence of the human mitochondria. Mitochondria are structures, called organelles, that are found in the cytoplasm of the cell in hundreds to thousands for each cell. Mitochondria are generally the major energy production center in eukaryotes, they help to degrade fats and sugars.

The consensus sequence of the human mitochondria genome has accession number NC_012920. You can getgenbank function to get the latest annotated sequence from GenBank® into the MATLAB® workspace.

mitochondria_gbk = getgenbank('NC_012920');

For your convenience, previously downloaded sequence is included in a MAT-file. Note that data in public repositories is frequently curated and updated; therefore the results of this example might be slightly different when you use up-to-date datasets.

load mitochondria

Copy just the DNA sequence to a new variable mitochondria. You can access parts of the DNA sequence by using regular MATLAB indexing commands.

mitochondria = mitochondria_gbk.Sequence;
mitochondria_length = length(mitochondria)
first_300_bases = seqdisp(mitochondria(1:300))

mitochondria_length =

       16569


first_300_bases =

  5×70 char array

    '  1  GATCACAGGT CTATCACCCT ATTAACCACT CACGGGAGCT CTCCATGCAT TTGGTATTTT'
    ' 61  CGTCTGGGGG GTATGCACGC GATAGCATTG CGAGACGCTG GAGCCGGAGC ACCCTATGTC'
    '121  GCAGTATCTG TCTTTGATTC CTGCCTCATC CTATTATTTA TCGCACCTAC GTTCAATATT'
    '181  ACAGGCGAAC ATACTTACTA AAGTGTGTTA ATTAATTAAT GCTTGTAGGA CATAATAATA'
    '241  ACAATTGAAT GTCTGCACAG CCACTTTCCA CACAGACATC ATAACAAAAA ATTTCCACCA'

You can look at the composition of the nucleotides with the ntdensity function.

figure
ntdensity(mitochondria)

This shows that the mitochondria genome is A-T rich. The GC-content is sometimes used to classify organisms in taxonomy, it may vary between different species from ~30% up to ~70%. Measuring GC content is also useful for identifying genes and for estimating the annealing temperature of DNA sequence.

Calculating Sequence Statistics

Now, you will use some of the sequence statistics functions in the Bioinformatics Toolbox™ to look at various properties of the human mitochondrial genome. You can count the number of bases of the whole sequence using the basecount function.

bases = basecount(mitochondria)

bases = 

  struct with fields:

    A: 5124
    C: 5181
    G: 2169
    T: 4094

These are on the 5'-3' strand. You can look at the reverse complement case using the seqrcomplement function.

compBases = basecount(seqrcomplement(mitochondria))

compBases = 

  struct with fields:

    A: 4094
    C: 2169
    G: 5181
    T: 5124

As expected, the base counts on the reverse complement strand are complementary to the counts on the 5'-3' strand.

You can use the chart option to basecount to display a pie chart of the distribution of the bases.

figure
basecount(mitochondria,'chart','pie');
title('Distribution of Nucleotide Bases for Human Mitochondrial Genome');

Now look at the dimers in the sequence and display the information in a bar chart using dimercount.

figure
dimers = dimercount(mitochondria,'chart','bar')
title('Mitochondrial Genome Dimer Histogram');

dimers = 

  struct with fields:

    AA: 1604
    AC: 1495
    AG: 795
    AT: 1230
    CA: 1534
    CC: 1771
    CG: 435
    CT: 1440
    GA: 613
    GC: 711
    GG: 425
    GT: 419
    TA: 1373
    TC: 1204
    TG: 513
    TT: 1004

Exploring the Open Reading Frames (ORFs)

In a nucleotide sequence an obvious thing to look for is if there are any open reading frames. An ORF is any sequence of DNA or RNA that can be potentially translated into a protein. The function seqshoworfs can be used to visualize ORFs in a sequence.

Note: In the HTML tutorial only the first page of the output is shown, however when running the example you will be able to inspect the complete mitochondrial genome using the scrollbar on the figure.

seqshoworfs(mitochondria);

If you compare this output to the genes shown on the NCBI page there seem to be slightly fewer ORFs, and hence fewer genes, than expected.

Vertebrate mitochondria do not use the Standard genetic code so some codons have different meaning in mitochondrial genomes. For more information about using different genetic codes in MATLAB see the help for the function geneticcode. The GeneticCode option to the seqshoworfs function allows you to look at the ORFs again but this time with the vertebrate mitochondrial genetic code.

In the human mitochondrial DNA sequence some genes are also started by alternative start codons [1]. Use the AlternativeStartCodons option to the seqshoworfs function to search also for these ORFs.

Notice that there are now two much larger ORFs on the third reading frame: One starting at position 4470 and the other starting at 5904. These correspond to the ND2 (NADH dehydrogenase subunit 2) and COX1 (cytochrome c oxidase subunit I) genes.

orfs = seqshoworfs(mitochondria,'GeneticCode','Vertebrate Mitochondrial',...
        'AlternativeStartCodons',true)

orfs = 

  1×3 struct array with fields:

    Start
    Stop

Inspecting Annotated Features

You can also look at all the features that have been annotated to the human mitochondrial genome. Explore the complete GenBank entry mitochondria_gbk with the featureparse function. Particularly, you can explore the annotated coding sequences (CDS) and compare them with the ORFs previously found. Use the Sequence option to the featureparse function to extract, when possible, the DNA sequences respective to each feature. The featureparse function will complement the pieces of the source sequence when appropriate.

features = featureparse(mitochondria_gbk,'Sequence',true)
coding_sequences = features.CDS;
coding_sequences_id = sprintf('%s ',coding_sequences.gene)

features = 

  struct with fields:

          source: [1×1 struct]
          D_loop: [1×1 struct]
            gene: [1×37 struct]
            tRNA: [1×22 struct]
            rRNA: [1×2 struct]
             STS: [1×28 struct]
    misc_feature: [1×1 struct]
             CDS: [1×13 struct]


coding_sequences_id =

    'ND1 ND2 COX1 COX2 ATP8 ATP6 COX3 ND3 ND4L ND4 ND5 ND6 CYTB '

ND2CDS = coding_sequences(2) % ND2 is in the 2nd position
COX1CDS = coding_sequences(3) % COX1 is in the 3rd position

ND2CDS = 

  struct with fields:

         Location: '4470..5511'
          Indices: [4470 5511]
             gene: 'ND2'
     gene_synonym: 'MTND2'
             note: 'TAA stop codon is completed by the addition of 3' A residues to the mRNA'
      codon_start: '1'
    transl_except: '(pos:5511,aa:TERM)'
     transl_table: '2'
          product: 'NADH dehydrogenase subunit 2'
       protein_id: 'YP_003024027.1'
          db_xref: {'GI:251831108'  'GeneID:4536'  'HGNC:7456'  'MIM:516001'}
      translation: 'MNPLAQPVIYSTIFAGTLITALSSHWFFTWVGLEMNMLAFIPVLTKKMNPRSTEAAIKYFLTQATASMILLMAILFNNMLSGQWTMTNTTNQYSSLMIMMAMAMKLGMAPFHFWVPEVTQGTPLTSGLLLLTWQKLAPISIMYQISPSLNVSLLLTLSILSIMAGSWGGLNQTQLRKILAYSSITHMGWMMAVLPYNPNMTILNLTIYIILTTTAFLLLNLNSSTTTLLLSRTWNKLTWLTPLIPSTLLSLGGLPPLTGFLPKWAIIEEFTKNNSLIIPTIMATITLLNLYFYLRLIYSTSITLLPMSNNVKMKWQFEHTKPTPFLPTLIALTTLLLPISPFMLMIL'
         Sequence: 'attaatcccctggcccaacccgtcatctactctaccatctttgcaggcacactcatcacagcgctaagctcgcactgattttttacctgagtaggcctagaaataaacatgctagcttttattccagttctaaccaaaaaaataaaccctcgttccacagaagctgccatcaagtatttcctcacgcaagcaaccgcatccataatccttctaatagctatcctcttcaacaatatactctccggacaatgaaccataaccaatactaccaatcaatactcatcattaataatcataatagctatagcaataaaactaggaatagccccctttcacttctgagtcccagaggttacccaaggcacccctctgacatccggcctgcttcttctcacatgacaaaaactagcccccatctcaatcatataccaaatctctccctcactaaacgtaagccttctcctcactctctcaatcttatccatcatagcaggcagttgaggtggattaaaccaaacccagctacgcaaaatcttagcatactcctcaattacccacataggatgaataatagcagttctaccgtacaaccctaacataaccattcttaatttaactatttatattatcctaactactaccgcattcctactactcaacttaaactccagcaccacgaccctactactatctcgcacctgaaacaagctaacatgactaacacccttaattccatccaccctcctctccctaggaggcctgcccccgctaaccggctttttgcccaaatgggccattatcgaagaattcacaaaaaacaatagcctcatcatccccaccatcatagccaccatcaccctccttaacctctacttctacctacgcctaatctactccacctcaatcacactactccccatatctaacaacgtaaaaataaaatgacagtttgaacatacaaaacccaccccattcctccccacactcatcgcccttaccacgctactcctacctatctccccttttatactaataatcttat'


COX1CDS = 

  struct with fields:

         Location: '5904..7445'
          Indices: [5904 7445]
             gene: 'COX1'
     gene_synonym: 'COI; MTCO1'
             note: 'cytochrome c oxidase I'
      codon_start: '1'
    transl_except: []
     transl_table: '2'
          product: 'cytochrome c oxidase subunit I'
       protein_id: 'YP_003024028.1'
          db_xref: {'GI:251831109'  'GeneID:4512'  'HGNC:7419'  'MIM:516030'}
      translation: 'MFADRWLFSTNHKDIGTLYLLFGAWAGVLGTALSLLIRAELGQPGNLLGNDHIYNVIVTAHAFVMIFFMVMPIMIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSLLLLLASAMVEAGAGTGWTVYPPLAGNYSHPGASVDLTIFSLHLAGVSSILGAINFITTIINMKPPAMTQYQTPLFVWSVLITAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGGDPILYQHLFWFFGHPEVYILILPGFGMISHIVTYYSGKKEPFGYMGMVWAMMSIGFLGFIVWAHHMFTVGMDVDTRAYFTSATMIIAIPTGVKVFSWLATLHGSNMKWSAAVLWALGFIFLFTVGGLTGIVLANSSLDIVLHDTYYVVAHFHYVLSMGAVFAIMGGFIHWFPLFSGYTLDQTYAKIHFTIMFIGVNLTFFPQHFLGLSGMPRRYSDYPDAYTTWNILSSVGSFISLTAVMLMIFMIWEAFASKRKVLMVEEPSMNLEWLYGCPPPYHTFEEPVYMKS'
         Sequence: 'atgttcgccgaccgttgactattctctacaaaccacaaagacattggaacactatacctattattcggcgcatgagctggagtcctaggcacagctctaagcctccttattcgagccgagctgggccagccaggcaaccttctaggtaacgaccacatctacaacgttatcgtcacagcccatgcatttgtaataatcttcttcatagtaatacccatcataatcggaggctttggcaactgactagttcccctaataatcggtgcccccgatatggcgtttccccgcataaacaacataagcttctgactcttacctccctctctcctactcctgctcgcatctgctatagtggaggccggagcaggaacaggttgaacagtctaccctcccttagcagggaactactcccaccctggagcctccgtagacctaaccatcttctccttacacctagcaggtgtctcctctatcttaggggccatcaatttcatcacaacaattatcaatataaaaccccctgccataacccaataccaaacgcccctcttcgtctgatccgtcctaatcacagcagtcctacttctcctatctctcccagtcctagctgctggcatcactatactactaacagaccgcaacctcaacaccaccttcttcgaccccgccggaggaggagaccccattctataccaacacctattctgatttttcggtcaccctgaagtttatattcttatcctaccaggcttcggaataatctcccatattgtaacttactactccggaaaaaaagaaccatttggatacataggtatggtctgagctatgatatcaattggcttcctagggtttatcgtgtgagcacaccatatatttacagtaggaatagacgtagacacacgagcatatttcacctccgctaccataatcatcgctatccccaccggcgtcaaagtatttagctgactcgccacactccacggaagcaatatgaaatgatctgctgcagtgctctgagccctaggattcatctttcttttcaccgtaggtggcctgactggcattgtattagcaaactcatcactagacatcgtactacacgacacgtactacgttgtagcccacttccactatgtcctatcaataggagctgtatttgccatcataggaggcttcattcactgatttcccctattctcaggctacaccctagaccaaacctacgccaaaatccatttcactatcatattcatcggcgtaaatctaactttcttcccacaacactttctcggcctatccggaatgccccgacgttactcggactaccccgatgcatacaccacatgaaacatcctatcatctgtaggctcattcatttctctaacagcagtaatattaataattttcatgatttgagaagccttcgcttcgaagcgaaaagtcctaatagtagaagaaccctccataaacctggagtgactatatggatgccccccaccctaccacacattcgaagaacccgtatacataaaatctaga'

Create a map indicating all the features found in this GenBank entry using the featureview function.

[h,l] = featureview(mitochondria_gbk,{'CDS','tRNA','rRNA','D_loop'},...
                                      [2 1 2 2 2],'Fontsize',9);
legend(h,l,'interpreter','none');
title('Homo sapiens mitochondrion, complete genome')

Extracting and Analyzing the ND2 and COX1 Proteins

You can translate the DNA sequences that code for the ND2 and COX1 proteins by using the nt2aa function. Again the GeneticCode option must be used to specify the vertebrate mitochondrial genetic code.

ND2 = nt2aa(ND2CDS,'GeneticCode','Vertebrate Mitochondrial');
disp(seqdisp(ND2))

  1  MNPLAQPVIY STIFAGTLIT ALSSHWFFTW VGLEMNMLAF IPVLTKKMNP RSTEAAIKYF
 61  LTQATASMIL LMAILFNNML SGQWTMTNTT NQYSSLMIMM AMAMKLGMAP FHFWVPEVTQ
121  GTPLTSGLLL LTWQKLAPIS IMYQISPSLN VSLLLTLSIL SIMAGSWGGL NQTQLRKILA
181  YSSITHMGWM MAVLPYNPNM TILNLTIYII LTTTAFLLLN LNSSTTTLLL SRTWNKLTWL
241  TPLIPSTLLS LGGLPPLTGF LPKWAIIEEF TKNNSLIIPT IMATITLLNL YFYLRLIYST
301  SITLLPMSNN VKMKWQFEHT KPTPFLPTLI ALTTLLLPIS PFMLMIL

COX1 = nt2aa(COX1CDS,'GeneticCode','Vertebrate Mitochondrial');
disp(seqdisp(COX1))

  1  MFADRWLFST NHKDIGTLYL LFGAWAGVLG TALSLLIRAE LGQPGNLLGN DHIYNVIVTA
 61  HAFVMIFFMV MPIMIGGFGN WLVPLMIGAP DMAFPRMNNM SFWLLPPSLL LLLASAMVEA
121  GAGTGWTVYP PLAGNYSHPG ASVDLTIFSL HLAGVSSILG AINFITTIIN MKPPAMTQYQ
181  TPLFVWSVLI TAVLLLLSLP VLAAGITMLL TDRNLNTTFF DPAGGGDPIL YQHLFWFFGH
241  PEVYILILPG FGMISHIVTY YSGKKEPFGY MGMVWAMMSI GFLGFIVWAH HMFTVGMDVD
301  TRAYFTSATM IIAIPTGVKV FSWLATLHGS NMKWSAAVLW ALGFIFLFTV GGLTGIVLAN
361  SSLDIVLHDT YYVVAHFHYV LSMGAVFAIM GGFIHWFPLF SGYTLDQTYA KIHFTIMFIG
421  VNLTFFPQHF LGLSGMPRRY SDYPDAYTTW NILSSVGSFI SLTAVMLMIF MIWEAFASKR
481  KVLMVEEPSM NLEWLYGCPP PYHTFEEPVY MKS*

You can get a more complete picture of the amino acid content with aacount.

figure
subplot(2,1,1)
ND2aaCount = aacount(ND2,'chart','bar');
title('Histogram of Amino Acid Count for the ND2 Protein');

subplot(2,1,2)
COX1aaCount = aacount(COX1,'chart','bar');
title('Histogram of Amino Acid Count for the COX1 Protein');

Notice the high leucine, threonine and isoleucine content and also the lack of cysteine or aspartic acid.

You can use the atomiccomp and molweight functions to calculate more properties about the ND2 protein.

ND2AtomicComp = atomiccomp(ND2)
ND2MolWeight = molweight(ND2)

ND2AtomicComp = 

  struct with fields:

    C: 1818
    H: 2882
    N: 420
    O: 471
    S: 25


ND2MolWeight =

   3.8960e+04

For further investigation of the properties of the ND2 protein, use proteinplot. This is a graphical user interface (GUI) that allows you to easily create plots of various properties, such as hydrophobicity, of a protein sequence. Click on the "Edit" menu to create new properties, to modify existing property values, or, to adjust the smoothing parameters. Click on the "Help" menu in the GUI for more information on how to use the tool.

proteinplot(ND2)

You can also programmatically create plots of various properties of the sequence using proteinpropplot.

figure
proteinpropplot(ND2,'PropertyTitle','Parallel beta strand')

Calculating the Codon Frequency using all the Genes in the Human Mitochondrial Genome

The codoncount function counts the number of occurrences of each codon in the sequence and displays a formatted table of the result.

codoncount(ND2CDS)

AAA - 10     AAC - 14     AAG -  2     AAT -  6     
ACA - 11     ACC - 24     ACG -  3     ACT -  5     
AGA -  0     AGC -  4     AGG -  0     AGT -  1     
ATA - 23     ATC - 24     ATG -  1     ATT -  8     
CAA -  8     CAC -  3     CAG -  2     CAT -  1     
CCA -  4     CCC - 12     CCG -  2     CCT -  5     
CGA -  0     CGC -  3     CGG -  0     CGT -  1     
CTA - 26     CTC - 18     CTG -  4     CTT -  7     
GAA -  5     GAC -  0     GAG -  1     GAT -  0     
GCA -  8     GCC -  7     GCG -  1     GCT -  4     
GGA -  5     GGC -  7     GGG -  0     GGT -  1     
GTA -  3     GTC -  2     GTG -  0     GTT -  3     
TAA -  0     TAC -  8     TAG -  0     TAT -  2     
TCA -  7     TCC - 11     TCG -  1     TCT -  4     
TGA - 10     TGC -  0     TGG -  1     TGT -  0     
TTA -  8     TTC -  7     TTG -  1     TTT -  8

Notice that in the ND2 gene there are more CTA, ATC and ACC codons than others. You can check what amino acids these codons get translated into using the nt2aa and aminolookup functions.

CTA_aa = aminolookup('code',nt2aa('CTA'))
ATC_aa = aminolookup('code',nt2aa('ATC'))
ACC_aa = aminolookup('code',nt2aa('ACC'))

CTA_aa =

    'Leu	Leucine
     '


ATC_aa =

    'Ile	Isoleucine
     '


ACC_aa =

    'Thr	Threonine
     '

To calculate the codon frequency for all the genes you can concatenate them into a single sequence before using the function codoncount. You need to ensure that the codons are complete (three nucleotides each) so the read frame of the sequence is not lost at the concatenation.

numCDS = numel(coding_sequences);
CDS = cell(numCDS,1);
for i = 1:numCDS
     seq = coding_sequences(i).Sequence;
     CDS{i} = seq(1:3*floor(length(seq)/3));
end
allCDS = [CDS{:}];
codoncount(allCDS)

AAA -  85     AAC - 132     AAG -  10     AAT -  32     
ACA - 134     ACC - 155     ACG -  10     ACT -  52     
AGA -   1     AGC -  39     AGG -   1     AGT -  14     
ATA - 167     ATC - 196     ATG -  40     ATT - 124     
CAA -  82     CAC -  79     CAG -   8     CAT -  18     
CCA -  52     CCC - 119     CCG -   7     CCT -  41     
CGA -  28     CGC -  26     CGG -   2     CGT -   7     
CTA - 276     CTC - 167     CTG -  45     CTT -  65     
GAA -  64     GAC -  51     GAG -  24     GAT -  15     
GCA -  80     GCC - 124     GCG -   8     GCT -  43     
GGA -  67     GGC -  87     GGG -  34     GGT -  24     
GTA -  70     GTC -  48     GTG -  18     GTT -  31     
TAA -   3     TAC -  89     TAG -   2     TAT -  46     
TCA -  83     TCC -  99     TCG -   7     TCT -  32     
TGA -  93     TGC -  17     TGG -  11     TGT -   5     
TTA -  73     TTC - 139     TTG -  18     TTT -  77

Use the figure option to the codoncount function to show a heat map with the codon frequency. Use the geneticcode option to overlay a grid on the figure that groups the synonymous codons according with the Vertebrate Mitochondrial genetic code. Observe the particular bias of Leucine (codons 'CTN').

figure
count = codoncount(allCDS,'figure',true,'geneticcode','Vertebrate Mitochondrial');
title('Human Mitochondrial Genome Codon Frequency')

close all

References

[1] Barrell, B.G., Bankier, A.T. and Drouin, J., "A different genetic code in human mitochondria", Nature, 282(5735):189-94, 1979.