Exploring Protein-DNA Binding Sites from Paired-End ChIP-Seq Data

Open Script

This example shows how to perform a genome-wide analysis of a transcription factor in the Arabidopsis Thaliana (Thale Cress) model organism.

For enhanced performance, it is recommended that you run this example on a 64-bit platform, because the memory footprint is close to 2 Gb. On a 32-bit platform, if you receive "Out of memory" errors when running this example, try increasing the virtual memory (or swap space) of your operating system or try setting the 3GB switch (32-bit Windows® XP only). For details, see Resolve “Out of Memory” Errors.

Introduction

ChIP-Seq is a technology that is used to identify transcription factors that interact with specific DNA sites. First chromatin immunoprecipitation enriches DNA-protein complexes using an antibody that binds to a particular protein of interest. Then, all the resulting fragments are processed using high-throughput sequencing. Sequencing fragments are mapped back to the reference genome. By inspecting over-represented regions it is possible to mark the genomic location of DNA-protein interactions.

In this example, short reads are produced by the paired-end Illumina® platform. Each fragment is reconstructed from two short reads successfully mapped, with this the exact length of the fragment can be computed. Using paired-end information from sequence reads maximizes the accuracy of predicting DNA-protein binding sites.

Data Set

This example explores the paired-end ChIP-Seq data generated by Wang et.al. [1] using the Illumina® platform. The data set has been courteously submitted to the Gene Expression Omnibus repository with accession number GSM424618. The unmapped paired-end reads can be obtained from the NCBI FTP site.

This example assumes that you:

(1) downloaded the data containing the unmapped short read and converted it to FASTQ formatted files using the NCBI SRA Toolkit.

(2) produced a SAM formatted file by mapping the short reads to the Thale Cress reference genome, using a mapper such as BWA [2], Bowtie, or SSAHA2 (which is the mapper used by authors of [1]), and,

(3) ordered the SAM formatted file by reference name first, then by genomic position.

For the published version of this example, 8,655,859 paired-end short reads are mapped using the BWA mapper [2]. BWA produced a SAM formatted file (aratha.sam) with 17,311,718 records (8,655,859 x 2). Repetitive hits were randomly chosen, and only one hit is reported, but with lower mapping quality. The SAM file was ordered and converted to a BAM formatted file using SAMtools [3] before being loaded into MATLAB.

The last part of the example also assumes that you downloaded the reference genome for the Thale Cress model organism (which includes five chromosomes). Uncomment the following lines of code to download the reference from the NCBI repository:

% getgenbank('NC_003070','FileFormat','fasta','tofile','ach1.fasta');
% getgenbank('NC_003071','FileFormat','fasta','tofile','ach2.fasta');
% getgenbank('NC_003074','FileFormat','fasta','tofile','ach3.fasta');
% getgenbank('NC_003075','FileFormat','fasta','tofile','ach4.fasta');
% getgenbank('NC_003076','FileFormat','fasta','tofile','ach5.fasta');

Creating a MATLAB® Interface to a BAM Formatted File

To create local alignments and look at the coverage we need to construct a BioMap. BioMap has an interface that provides direct access to the mapped short reads stored in the BAM formatted file, thus minimizing the amount of data that is actually loaded to the workspace. Create a BioMap to access all the short reads mapped in the BAM formatted file.

bm = BioMap('aratha.bam')

bm = 

  BioMap with properties:

    SequenceDictionary: {5x1 cell}
             Reference: [14637324x1 File indexed property]
             Signature: [14637324x1 File indexed property]
                 Start: [14637324x1 File indexed property]
        MappingQuality: [14637324x1 File indexed property]
                  Flag: [14637324x1 File indexed property]
          MatePosition: [14637324x1 File indexed property]
               Quality: [14637324x1 File indexed property]
              Sequence: [14637324x1 File indexed property]
                Header: [14637324x1 File indexed property]
                 NSeqs: 14637324
                  Name: ''

Use the getSummary method to obtain a list of the existing references and the actual number of short read mapped to each one.

getSummary(bm)

BioMap summary:
                                  Name: ''
                        Container_Type: 'Data is file indexed.'
             Total_Number_of_Sequences: 14637324
    Number_of_References_in_Dictionary: 5

            Number_of_Sequences    Genomic_Range 
    Chr1    3151847                   1  30427671
    Chr2    3080417                1000  19698292
    Chr3    3062917                  94  23459782
    Chr4    2218868                1029  18585050
    Chr5    3123275                  11  26975502

The remainder of this example focuses on the analysis of one of the five chromosomes, Chr1. Create a new BioMap to access the short reads mapped to the first chromosome by subsetting the first one.

bm1 = getSubset(bm,'SelectReference','Chr1')

bm1 = 

  BioMap with properties:

    SequenceDictionary: 'Chr1'
             Reference: [3151847x1 File indexed property]
             Signature: [3151847x1 File indexed property]
                 Start: [3151847x1 File indexed property]
        MappingQuality: [3151847x1 File indexed property]
                  Flag: [3151847x1 File indexed property]
          MatePosition: [3151847x1 File indexed property]
               Quality: [3151847x1 File indexed property]
              Sequence: [3151847x1 File indexed property]
                Header: [3151847x1 File indexed property]
                 NSeqs: 3151847
                  Name: ''

By accessing the Start and Stop positions of the mapped short read you can obtain the genomic range.

x1 = min(getStart(bm1))
x2 = max(getStop(bm1))

x1 =

  uint32

   1


x2 =

  uint32

   30427671

Exploring the Coverage at Different Resolutions

To explore the coverage for the whole range of the chromosome, a binning algorithm is required. The getBaseCoverage method produces a coverage signal based on effective alignments. It also allows you to specify a bin width to control the size (or resolution) of the output signal. However internal computations are still performed at the base pair (bp) resolution. This means that despite setting a large bin size, narrow peaks in the coverage signal can still be observed. Once the coverage signal is plotted you can program the figure's data cursor to display the genomic position when using the tooltip. You can zoom and pan the figure to determine the position and height of the ChIP-Seq peaks.

[cov,bin] = getBaseCoverage(bm1,x1,x2,'binWidth',1000,'binType','max');
figure
plot(bin,cov)
axis([x1,x2,0,100])        % sets the axis limits
fixGenomicPositionLabels   % formats tick labels and adds datacursors
xlabel('Base position')
ylabel('Depth')
title('Coverage in Chromosome 1')

It is also possible to explore the coverage signal at the bp resolution (also referred to as the pile-up profile). Explore one of the large peaks observed in the data at position 4598837.

p1 = 4598837-1000;
p2 = 4598837+1000;

figure
plot(p1:p2,getBaseCoverage(bm1,p1,p2))
xlim([p1,p2])              % sets the x-axis limits
fixGenomicPositionLabels   % formats tick labels and adds datacursors
xlabel('Base position')
ylabel('Depth')
title('Coverage in Chromosome 1')

Identifying and Filtering Regions with Artifacts

Observe the large peak with coverage depth of 800+ between positions 4599029 and 4599145. Investigate how these reads are aligning to the reference chromosome. You can retrieve a subset of these reads enough to satisfy a coverage depth of 25, since this is sufficient to understand what is happening in this region. Use getIndex to obtain indices to this subset. Then use getCompactAlignment to display the corresponding multiple alignment of the short-reads.

i = getIndex(bm1,4599029,4599145,'depth',25);
bmx = getSubset(bm1,i,'inmemory',false)
getCompactAlignment(bmx,4599029,4599145)

bmx = 

  BioMap with properties:

    SequenceDictionary: 'Chr1'
             Reference: [62x1 File indexed property]
             Signature: [62x1 File indexed property]
                 Start: [62x1 File indexed property]
        MappingQuality: [62x1 File indexed property]
                  Flag: [62x1 File indexed property]
          MatePosition: [62x1 File indexed property]
               Quality: [62x1 File indexed property]
              Sequence: [62x1 File indexed property]
                Header: [62x1 File indexed property]
                 NSeqs: 62
                  Name: ''



ans =

  35×117 char array

    'AGTT AATCAAATAGAAAGCCCCGAGGGCGCCATATCCTAGGCGC  AAACTATGTGATTGAATAAATCCTCCTCTATCTGTTGCGG   GAGGACTCCTTCTCCTTCCCCTTTTGG'
    'AGTGC  TCAAATAGAAAGCCCCGAGGGCGCCATATTCTAGGAGCCC             GAATAAATCCTCCTCTATCTGTTGCGGGTCGAGGACTCCT CTCCTGCCCCTTTTGG'
    'AGTTCAA             CCCGAGGGCGCCATATTCTAGGAGCCCAAACTATGTGATT               TATCTGTTGCGGGTCGAGGACTCCTTCTCCTTCCCCTTCT  '
    'AGTTCAATCAAATAGAAAGC               TTCTAGGAGCCCAAACTATGTGATTGAATAAATCCTCCTC                AGGACTCCTTCTCCTTCCCCTTTTGG'
    'AGTT                                  AAGGAGCCCAAAATATGTGATTGAATAAATCCACCTCTAT              GGACTCCTTCTCCTTCCCCTTTTGG'
    'AGTACAATCAAATAGAAAGCCCCGAGGGCGCCATA   TAGGAGCCCAAACTATGTGATTGAATAAATCCTCCTCTAT                   CCTTCACCTTCCCCTTTTGG'
    'CGTACAATCAAATAGAAAGCCCCGAGGGCGCCATATTC  GGAGCCCAAACTATGTGATTGAATAAATCCTCCTCTATCT                         TTCCCCTTTTGG'
    'CGTACAATCAAATAGAAAGCCCCGAGGGCGCCATATTC  GGAGCCCAAACTATGTGATTGAATAAATCCTCCTCTATCT                                     '
    'CGTACAATCAAATAGAAAGCCCCGAGGGCGCCATATTC  GGAGCCCAAGCTATGTGATTGAATAAATCCTCCTCTATCT                                     '
    'CGTACAATCAAATAGAAAGCCCCGAGGGCGCCATATTC  GGAGCCCAAACTATGTGATTGAATAAATCCTCCTCTATCT                                     '
    'AGTTCAATCAAATAGAAAGCCCCGAGGGCGCCATATTCTA GAGCCCAAACTATGTGATTGAATAAATCCTCCTCTATCTG                                    '
    'GATACAATCAAATAGAAAGCCCCGAGGGCGCCATATTCTA GAGCCCAAACTATGTGATTGAATAAATCTTCCTCTATCTG                                    '
    'GATACAATCAAATAGAAAGCCCCGAGGGCGCCATATTCTA GAGCCCAAACTATGTGATTGAATAAATCCTCCTCTATCTG                                    '
    'GATACAATCAAATAGAAAGCCCCGAGGGCGCCATATTCTA GAGCCCAAACTATGTGATTGAATAAATCCTCCTCTATCTG                                    '
    'GATACAATCAAATAGAAAGCCCCGAGGGCGCCATATTCTA GAGCCCAAATTATGTGATTGAATAAATCCTCCTCTATCTG                                    '
    ' ATACAATCAAATAGAAAGCCCCGAGGGCGCCATATTCTAG   CCCAAACTATGTGATTGAATAAATCCTCCTCTATCTGTTG                                 '
    ' ATACAATCAAATAGAAAGCCCCGAGGGCGCCATATTCTAG   CACAAACTATGTGATTGAATAAATCCTCCTCTATCTGTTG                                 '
    ' ATACAATCAAATAGAAAGCCCCGAGGGCGCCATATTCTAG    CCAAACTATGTGATTGAATAAATCCTCCTCTATCTGTTGC                                '
    ' ATACAATCAAATAGAAAGCCCCGAGGGCGCCATATTCTAG                                                                            '
    ' ATACAATCAAATAGAAAGCCCCGAGGGCGCCATATTCTCG                                                                            '
    ' ATACAATCAAATAGAAAGCCCCGGGGGCGCCATATTCTAG                                                                            '
    ' ATTGAGTCAAATAGAAAGCCCCGAGGGCGCCATATTCTAG                                                                            '
    ' ATACAATCAAATAGAAAGCCCCGAGGGCGCCATATTCTAG                                                                            '
    ' ATACAATCAAATAGAAAGCCCCGAGGGCGCCATATTCTAG                                                                            '
    ' ATACAATCAAATAGAAAGCCCCGAGGGCGCCATATTCTAG                                                                            '
    '    CAATCAAATAGAAAGCCCCGAGGGCGCCATATTCTAGGAG                                                                         '
    '    CAATCAAATAGAAAGCCCCGAGGGCGCCATATTCTAGGAG                                                                         '
    '                                      TAGGAGCCCAAACTATGTGATTGAATAAATCCTCCTCTAT                                       '
    '                                      TAGGAGCCCAAACTATGCCATTGAATAAATCCTCCGCTAT                                       '
    '                                        GGAGCCCAAGCTATGTGATTGAATAAATCCTCCTCTATCT                                     '
    '                                         GAGCCCAAACTATGTGATTGAATAAATCCTCCTCTATCTG                                    '
    '                                         GAGCCCAAACTATGTGATTGAATAAATCCTCCTCTATCTG                                    '
    '                                         GAGCCCAAACTATGTGATTGAATAAATCCTCCTCTATCTG                                    '
    '                                         GAGCCCAAACTATGTGATTGAATAAATCCTCCTCTATCTG                                    '
    '                                         GAGCCCAAACTATGTGATTGAATAAATCCTCCTCTATCTG                                    '

In addition to visually confirming the alignment, you can also explore the mapping quality for all the short reads in this region, as this may hint to a potential problem. In this case, less than one percent of the short reads have a Phred quality of 60, indicating that the mapper most likely found multiple hits within the reference genome, hence assigning a lower mapping quality.

figure
i = getIndex(bm1,4599029,4599145);
hist(double(getMappingQuality(bm1,i)))
title('Mapping Quality of the reads between 4599029 and 4599145')
xlabel('Phred Quality Score')
ylabel('Number of Reads')

Most of the large peaks in this data set occur due to satellite repeat regions or due to its closeness to the centromere [4], and show characteristics similar to the example just explored. You may explore other regions with large peaks using the same procedure.

To prevent these problematic regions, two techniques are used. First, given that the provided data set uses paired-end sequencing, by removing the reads that are not aligned in a proper pair reduces the number of potential aligner errors or ambiguities. You can achieve this by exploring the flag field of the SAM formatted file, in which the second less significant bit is used to indicate if the short read is mapped in a proper pair.

i = find(bitget(getFlag(bm1),2));
bm1_filtered = getSubset(bm1,i)

bm1_filtered = 

  BioMap with properties:

    SequenceDictionary: 'Chr1'
             Reference: [3040724x1 File indexed property]
             Signature: [3040724x1 File indexed property]
                 Start: [3040724x1 File indexed property]
        MappingQuality: [3040724x1 File indexed property]
                  Flag: [3040724x1 File indexed property]
          MatePosition: [3040724x1 File indexed property]
               Quality: [3040724x1 File indexed property]
              Sequence: [3040724x1 File indexed property]
                Header: [3040724x1 File indexed property]
                 NSeqs: 3040724
                  Name: ''

Second, consider only uniquely mapped reads. You can detect reads that are equally mapped to different regions of the reference sequence by looking at the mapping quality, because BWA assigns a lower mapping quality (less than 60) to this type of short read.

i = find(getMappingQuality(bm1_filtered)==60);
bm1_filtered = getSubset(bm1_filtered,i)

bm1_filtered = 

  BioMap with properties:

    SequenceDictionary: 'Chr1'
             Reference: [2313252x1 File indexed property]
             Signature: [2313252x1 File indexed property]
                 Start: [2313252x1 File indexed property]
        MappingQuality: [2313252x1 File indexed property]
                  Flag: [2313252x1 File indexed property]
          MatePosition: [2313252x1 File indexed property]
               Quality: [2313252x1 File indexed property]
              Sequence: [2313252x1 File indexed property]
                Header: [2313252x1 File indexed property]
                 NSeqs: 2313252
                  Name: ''

Visualize again the filtered data set using both, a coarse resolution with 1000 bp bins for the whole chromosome, and a fine resolution for a small region of 20,000 bp. Most of the large peaks due to artifacts have been removed.

[cov,bin] = getBaseCoverage(bm1_filtered,x1,x2,'binWidth',1000,'binType','max');
figure
plot(bin,cov)
axis([x1,x2,0,100])        % sets the axis limits
fixGenomicPositionLabels   % formats tick labels and adds datacursors
xlabel('Base Position')
ylabel('Depth')
title('Coverage in Chromosome 1 after Filtering')

p1 = 24275801-10000;
p2 = 24275801+10000;

figure
plot(p1:p2,getBaseCoverage(bm1_filtered,p1,p2))
xlim([p1,p2])              % sets the x-axis limits
fixGenomicPositionLabels   % formats tick labels and adds datacursors
xlabel('Base Position')
ylabel('Depth')
title('Coverage in Chromosome 1 after Filtering')

Recovering Sequencing Fragments from the Paired-End Reads

In Wang's paper [1] it is hypothesized that paired-end sequencing data has the potential to increase the accuracy of the identification of chromosome binding sites of DNA associated proteins because the fragment length can be derived accurately, while when using single-end sequencing it is necessary to resort to a statistical approximation of the fragment length, and use it indistinctly for all putative binding sites.

Use the paired-end reads to reconstruct the sequencing fragments. First, get the indices for the forward and the reverse reads in each pair. This information is captured in the fifth bit of the flag field, according to the SAM file format.

fow_idx = find(~bitget(getFlag(bm1_filtered),5));
rev_idx = find(bitget(getFlag(bm1_filtered),5));

SAM-formatted files use the same header strings to identify pair mates. By pairing the header strings you can determine how the short reads in BioMap are paired. To pair the header strings, simply order them in ascending order and use the sorting indices (hf and hr) to link the unsorted header strings.

[~,hf] = sort(getHeader(bm1_filtered,fow_idx));
[~,hr] = sort(getHeader(bm1_filtered,rev_idx));
mate_idx = zeros(numel(fow_idx),1);
mate_idx(hf) = rev_idx(hr);

Use the resulting fow_idx and mate_idx variables to retrieve pair mates. For example, retrieve the paired-end reads for the first 10 fragments.

for j = 1:10
  disp(getInfo(bm1_filtered, fow_idx(j)))
  disp(getInfo(bm1_filtered, mate_idx(j)))
end

SRR054715.sra.6849385	163	20	60	40M	AACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAA	BBBBBBBBBBCBCB?2?BBBBB@7;BBC?7=7?BCC4*)3
SRR054715.sra.6849385	83	229	60	40M	CCTATTTCTTGTGGTTTTCTTTCCTTCACTTAGCTATGGA	06BBBB=BBBBBBBBBBBBBBA6@@@9<*9BBA@>BBBBB
SRR054715.sra.6992346	99	20	60	40M	AACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAA	=B?BCB=2;BBBBB=B8BBCCBBBBBBBC66BBB=BC8BB
SRR054715.sra.6992346	147	239	60	40M	GTGGTTTTCTTTCCTTCACTTAGCTATGGATGGTTTATCT	BBCBB6B?B0B8B<'.BBBBBBBB=BBBBB6BBBBB;*6@
SRR054715.sra.8438570	163	47	60	40M	CTAAATCCCTAAATCTTTAAATCCTACATCCATGAATCCC	BC=BBBBCBB?==BBB;BB;?BBB8BCB??B-BB<*<B;B
SRR054715.sra.8438570	83	274	60	40M	TATCTTCATTTGTTATATTGGATACAAGCTTTGCTACGAT	BBBBB=;BBBBBBBBB;6?=BBBBBBBB<*9BBB;8BBB?
SRR054715.sra.1676744	163	67	60	40M	ATCCTACATCCATGAATCCCTAAATACCTAATCCCCTAAA	BBCB>4?+<BB6BB66BBC?77BBCBC@4ABB-BBBCCBB
SRR054715.sra.1676744	83	283	60	40M	TTGTTATATTGGATACAAGCTTTGCTACGATCTACATTTG	CCB6BBB93<BBBB>>@B?<<?BBBBBBBBBBBBBBBBBB
SRR054715.sra.6820328	163	73	60	40M	CATCCATGAATCCCTAAATACCTAATTCCCTAAACCCGAA	BB=08?BB?BCBBB=8BBB8?CCB-B;BBB?;;?BB8B;8
SRR054715.sra.6820328	83	267	60	40M	GTTGGTGTATCTTCATTTGTTATATTGGATACGAGCTTTG	BBBBB646;BB8@44BB=BBBB?C8BBBB=B6.9B8CCCB
SRR054715.sra.1559757	163	103	60	40M	TAAACCCGAAACCGGTTTCTCTGGTTGAAACTCATTGTGT	BBBBBCBBBBBBBBBBBCBBBB?BBBB<;?*?BBBBB7,*
SRR054715.sra.1559757	83	311	60	40M	GATCTACATTTGGGAATGTGAGTCTCTTATTGTAACCTTA	<?BBBBB?7=BBBBBBBBBBBBBB@;@>@BBBBBBBBBBB
SRR054715.sra.5658991	163	103	60	40M	CAAACCCGAAACCGGTTTCTCTGGTTGAAACTCATTGTGT	7?BBBBBB;=BBBB?8B;B-;BCB-B<49<6B8-BB?+?B
SRR054715.sra.5658991	83	311	60	40M	GATCTACATTTGGGAATGTGAGTCTCTTATTGTAACCTTA	3,<-BBCBBBBBB?=BBBBA<ABBBBBBBBB?79BBB?BB
SRR054715.sra.4625439	163	143	60	40M	ATATAATGATAATTTTAGCGTTTTTATGCAATTGCTTATT	BBBBB@,*<8BBB++2B6B;+6B8B;8+9BB0,'9B=.=B
SRR054715.sra.4625439	83	347	60	40M	CTTAGTGTTGGTTTATCTCAAGAATCTTATTAATTGTTTG	+BB8B0BBB?BBBB-BBBB22?BBB-BB6BB-BBBBBB?B
SRR054715.sra.1007474	163	210	60	40M	ATTTGAGGTCAATACAAATCCTATTTCTTGTGGTTTGCTT	BBBBBBBB;.>BB6B6',BBBCBB-08BBBBB;CB9630<
SRR054715.sra.1007474	83	408	60	40M	TATTGTCATTCTTACTCCTTTGTGGAAATGTTTGTTCTAT	BBB@AABBBCCCBBBBBBB=BBBCB8BBBBB=B6BCBB77
SRR054715.sra.7345693	99	213	60	40M	TGAGGTCAATACAAATCCTATTTCTTGTGGTTTTCTTTCT	B>;>BBB9,<6?@@BBBBBBBBBBBBBB7<9BBBBBB6*'
SRR054715.sra.7345693	147	393	60	40M	TTATTTTTGGACATTTATTGTCATTCTTACTCCTTTGGGG	BB-?+?C@>9BBBBBB6.<BBB-BBB94;A4442+49';B

Use the paired-end indices to construct a new BioMap with the minimal information needed to represent the sequencing fragments. First, calculate the insert sizes.

J = getStop(bm1_filtered, fow_idx);
K = getStart(bm1_filtered, mate_idx);
L = K - J - 1;

Obtain the new signature (or CIGAR string) for each fragment by using the short read original signatures separated by the appropriate number of skip CIGAR symbols (N).

n = numel(L);
cigars = cell(n,1);
for i = 1:n
   cigars{i} = sprintf('%dN' ,L(i));
end
cigars = strcat( getSignature(bm1_filtered, fow_idx),...
                 cigars,...
                 getSignature(bm1_filtered, mate_idx));

Reconstruct the sequences for the fragments by concatenating the respective sequences of the paired-end short reads.

seqs = strcat( getSequence(bm1_filtered, fow_idx),...
               getSequence(bm1_filtered, mate_idx));

Calculate and plot the fragment size distribution.

J = getStart(bm1_filtered,fow_idx);
K = getStop(bm1_filtered,mate_idx);
L = K - J + 1;
figure
hist(double(L),100)
title(sprintf('Fragment Size Distribution\n %d Paired-end Fragments Mapped to Chromosome 1',n))
xlabel('Fragment Size')
ylabel('Count')

Construct a new BioMap to represent the sequencing fragments. With this, you will be able explore the coverage signals as well as local alignments of the fragments.

bm1_fragments = BioMap('Sequence',seqs,'Signature',cigars,'Start',J)

bm1_fragments = 

  BioMap with properties:

    SequenceDictionary: {0×1 cell}
             Reference: {0×1 cell}
             Signature: {1156626×1 cell}
                 Start: [1156626×1 uint32]
        MappingQuality: [0×1 uint8]
                  Flag: [0×1 uint16]
          MatePosition: [0×1 uint32]
               Quality: {0×1 cell}
              Sequence: {1156626×1 cell}
                Header: {0×1 cell}
                 NSeqs: 1156626
                  Name: ''

Exploring the Coverage Using Fragment Alignments

Compare the coverage signal obtained by using the reconstructed fragments with the coverage signal obtained by using individual paired-end reads. Notice that enriched binding sites, represented by peaks, can be better discriminated from the background signal.

cov_reads = getBaseCoverage(bm1_filtered,x1,x2,'binWidth',1000,'binType','max');
[cov_fragments,bin] = getBaseCoverage(bm1_fragments,x1,x2,'binWidth',1000,'binType','max');

figure
plot(bin,cov_reads,bin,cov_fragments)
xlim([x1,x2])              % sets the x-axis limits
fixGenomicPositionLabels   % formats tick labels and adds datacursors
xlabel('Base position')
ylabel('Depth')
title('Coverage Comparison')
legend('Short Reads','Fragments')

Perform the same comparison at the bp resolution. In this dataset, Wang et.al. [1] investigated a basic helix-loop-helix (bHLH) transcription factor. bHLH proteins typically bind to a consensus sequence called an E-box (with a CANNTG motif). Use fastaread to load the reference chromosome, search for the E-box motif in the 3' and 5' directions, and then overlay the motif positions on the coverage signals. This example works over the region 1-200,000, however the figure limits are narrowed to a 3000 bp region in order to better depict the details.

p1 = 1;
p2 = 200000;

cov_reads = getBaseCoverage(bm1_filtered,p1,p2);
[cov_fragments,bin] = getBaseCoverage(bm1_fragments,p1,p2);

chr1 = fastaread('ach1.fasta');
mp1 = regexp(chr1.Sequence(p1:p2),'CA..TG')+3+p1;
mp2 = regexp(chr1.Sequence(p1:p2),'GT..AC')+3+p1;
motifs = [mp1 mp2];

figure
plot(bin,cov_reads,bin,cov_fragments)
hold on
plot([1;1;1]*motifs,[0;max(ylim);NaN],'r')
xlim([111000 114000])      % sets the x-axis limits
fixGenomicPositionLabels   % formats tick labels and adds datacursors
xlabel('Base position')
ylabel('Depth')
title('Coverage Comparison')
legend('Short Reads','Fragments','E-box motif')

Observe that it is not possible to associate each peak in the coverage signals with an E-box motif. This is because the length of the sequencing fragments is comparable to the average motif distance, blurring peaks that are close. Plot the distribution of the distances between the E-box motif sites.

motif_sep = diff(sort(motifs));
figure
hist(motif_sep(motif_sep<500),50)
title('Distance (bp) between adjacent E-box motifs')
xlabel('Distance (bp)')
ylabel('Counts')

Finding Significant Peaks in the Coverage Signal

Use the function mspeaks to perform peak detection with Wavelets denoising on the coverage signal of the fragment alignments. Filter putative ChIP peaks using a height filter to remove peaks that are not enriched by the binding process under consideration.

putative_peaks = mspeaks(bin,cov_fragments,'noiseestimator',20,...
                         'heightfilter',10,'showplot',true);
hold on
legend('off')
plot([1;1;1]*motifs(motifs>p1 & motifs<p2),[0;max(ylim);NaN],'r')
xlim([111000 114000])      % sets the x-axis limits
fixGenomicPositionLabels   % formats tick labels and adds datacursors
legend('Coverage from Fragments','Wavelet Denoised Coverage','Putative ChIP peaks','E-box Motifs')
xlabel('Base position')
ylabel('Depth')
title('ChIP-Seq Peak Detection')

Use the knnsearch function to find the closest motif to each one of the putative peaks. As expected, most of the enriched ChIP peaks are close to an E-box motif [1]. This reinforces the importance of performing peak detection at the finest resolution possible (bp resolution) when the expected density of binding sites is high, as it is in the case of the E-box motif. This example also illustrates that for this type of analysis, paired-end sequencing should be considered over single-end sequencing [1].

h = knnsearch(motifs',putative_peaks(:,1));
distance = putative_peaks(:,1)-motifs(h(:))';
figure
hist(distance(abs(distance)<200),50)
title('Distance to Closest E-box Motif for Each Detected Peak')
xlabel('Distance (bp)')
ylabel('Counts')

References

[1] Wang, Congmao, Jie Xu, Dasheng Zhang, Zoe A Wilson, and Dabing Zhang. “An Effective Approach for Identification of in Vivo Protein-DNA Binding Sites from Paired-End ChIP-Seq Data.” BMC Bioinformatics 11, no. 1 (2010): 81.

[2] Li, H., and R. Durbin. “Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform.” Bioinformatics 25, no. 14 (July 15, 2009): 1754–60.

[3] Li, H., B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, and 1000 Genome Project Data Processing Subgroup. “The Sequence Alignment/Map Format and SAMtools.” Bioinformatics 25, no. 16 (August 15, 2009): 2078–79.

[4] Jothi, R., S. Cuddapah, A. Barski, K. Cui, and K. Zhao. “Genome-Wide Identification of in Vivo Protein-DNA Binding Sites from ChIP-Seq Data.” Nucleic Acids Research 36, no. 16 (August 1, 2008): 5221–31.

[5] Hoffman, Brad G, and Steven J M Jones. “Genome-Wide Identification of DNA–Protein Interactions Using Chromatin Immunoprecipitation Coupled with Flow Cell Sequencing.” Journal of Endocrinology 201, no. 1 (April 2009): 1–13.

[6] Ramsey, Stephen A., Theo A. Knijnenburg, Kathleen A. Kennedy, Daniel E. Zak, Mark Gilchrist, Elizabeth S. Gold, Carrie D. Johnson, et al. “Genome-Wide Histone Acetylation Data Improve Prediction of Mammalian Transcription Factor Binding Sites.” Bioinformatics 26, no. 17 (September 1, 2010): 2071–75.