Bioinformatic course – day4

Today bioinformatic lab was GREAT! Though we insist in working with datasets of no relevance what so ever to me (and unknown from the beginning), I really enjoyed using Galaxy to find out where on Ch19 of mice, a X transcription factor binds.

The exercise that follow is original from “James”, a user that wrote a nice tutorial on Galaxy. Thank you James! Your tutorial was used by a lot of people today!

I WISH I could use Galaxy-and-that’s-it, but the server was very busy so I kept jumping from Galaxy to all it’s mirrors on the web. As I understood, mirrors are other servers that host basically the same program but with some variants. So I asked help to Galaxy/Test mirror and Galaxy/Cistrome mirror wherever I had to.

ChIP-seq exercises

For this exercise I used a ChIP-seq dataset for CTCF of the murine G1E_ER4 and G1E cell line. The cell line is the same, but the ER4 is treated to differentiate.

This dataset has been reduced to (mostly) contain only reads aligning to chr19:

Galaxy Dataset | G1E_ER4 CTCF (chr19)

TASK 1: Dataset loaded on Galaxy: difficult but I managed!

when I used Cistron instead, I load directly a .bam file of the very same ChIP-seq already prepared by Rikard (from the FASTq)



Mapping reads and peak calling

Step 1: First, for quality control, we will compute summary statistics on this dataset. Run the tool “NGS: QC and Manipulation > FASTQ Summary Statistics” on your dataset. When the job completes, inspect the results. The Fastaq data contains both the sequence and the quality of every single read itself. More precisely expressed in Phred quality score. Starting with Illumina 1.3 and before Illumina 1.8, the format encoded a Phred quality score from 0 to 62 using ASCII 64 to 126 (although in raw read data Phred scores from 0 to 40 only are expected).

  • How long are these reads? As the file contains 36 rows, the reads are all 36 base long. This is because the machine has shot the sequence and recorded the chip for 36 times, so in principle the lenght of all seq is the same. I can check whether every sequence shot produced a color spot (a base) in all the chip, looking at the “count” column. This is 270631 for all the 36 rows, so no base reading have ever being missed.

  • What is the median quality at the last position? 30.

I can plot the summary of the QC with some nice commands found in Galaxy. This chart show the decline of the quality of each base reading (from 1 to 36 on the X axis).

Step 2: Next we will map these reads to a reference genome. Use the “NGS: Mapping > Map with Bowtie for Illumina” tool. As the ChIP comes from mouse cells, you will need to change the reference genome build you are mapping against to “mm10” (latest version). Otherwise you can leave the default mapping options. Single-end reads comes with one single FASTq file.

This operation will try to map the sequence of the ChIP-seq agains the genome of reference. Once this is done, we can do cool stuff like getting the peaks where the sequence are more abundant!

OBS: most SAM/BAM format data is the output from aligners that read FASTQ files and assign the sequences to a position with respect to a known reference genome. With SAM/BAM file I can visualize peaks.

Bowtie is a short read aligner designed to be ultrafast and memory-efficient. It should generate a BAM/SAM file.

Sanger FASTQ → Bowtie → output (.bam/sam file)

If use Cistron with the uploaded .bam file, I cannot make the QC (QC done only with fastq files), so jump directly to the MACS peak detection

Step 3: Once reads are mapped, we will call peaks with MACS (Done on Cistrone, cuz Galaxy is still thinking…) Use the “NGS: Peak Calling > MACS” tool. You should also change the tag size to the read length you observed in Step 1 (36). Otherwise the default values should be reasonable.

In Cistron:

  • treatment file is the BAM if the ChIP.

  • input file is the file of the genome of reference. This is the negative control, as the alignment of this guy vs the genome itself should generate a base-line, no enriched in any part of the genome.

  • Format: auto

  • Effective Genome Size: mm9

  • Tag size: 36 as this is the length of my readings

Keep all the other parameter as default.

OBS: a wiggle file (Wiggle Track Format (WIG)) is a file containing the actual information to create histogram-like peaks. The peaks file actually just make out those peaks and reture just the location of the peaks, without the “intensity of it”.

Step 4: Once MACS completes it will produce two datasets. One is a report on the peak calling process. The other contains the positions of the peaks. I actually got 6 files:

  • diagnosis report

  • job log

  • wiggle

  • xls

  • summist

  • peaks

How many peaks were found? click on “peak file”: 776

Click the link to “Display at UCSC main” and you will be able to see the positions of the peaks on the genome.

This visualization mode in UCSC with my “User Track” (in black) as my peaks input (Ch 19 only selected).

zoom in: there is a peak covering ~500bp in a intron region of Lrp5 gene. Whatever that means, it’s cool!


Calling peaks with a control sample (Galaxy only – I used already a control sample (genome) in Cistron)

Next, we will incorporate an input DNA control, import the following dataset into your history:

Galaxy Dataset | G1E_ER4 input (chr19) –>(Reduced demo dataset, chr19 only)

Step 1: Map the input DNA control to mm9 using Bowtie

Step 2: Load the MACS tool again. Select your previous CTCF dataset for ChIP-seq tag file, but now select the mapped input DNA for “ChIP-seq control file”. How many peaks are called this time? What is the effect of using the input control?

Create a workflow and reuse

Step 1: At the top of the History panel, click “Options” and select “Extract Workflow”. Here you have the chance to select which jobs will be included in the workflow. Click “Uncheck all” and the select the two “Map with Bowtie” jobs and the last “MACS” job. (As I haven’t perform any Bowtie in Cistrom, in my workflow there is only the MACS job, so I’ll save only that)

Step 2: Import the following datasets — CTCF ChIP and control for the G1E line:

Galaxy Dataset | G1E CTCF

Galaxy Dataset | G1E input

I downloaded on my computer the files and I uploaded them on the server of Cistrom

Step 3: At the bottom of the tools menu, select “Workflows > All Workflows”, this will show the workflow list. Select the workflow you just created. You will be able to select input datasets for the two Bowtie steps, select the CTCF and input datasets. Click “Run Workflow”.

I have to run the bowtie separately as I never run then before (I got the .bam file to start). Ok, apparently Galaxy/Cistrom doesn’t support bowtie to convert fastq into BAM. I gonna try Galaxy/test.

  • upload the G1E datasets

  • mapping with bowtie for illumina → this shit doesn’t work. It doen’t dysplay to me the reference genome I can choose from

  • mapping with BWA fro illumina → ok! this created two file for CTCF and input of .sam

  • run MACS with tag (treatment) and control file (input). I got two files:

    • html report

    • peaks

Visualize data peaks file on UCSC: black bars is the ChIP signal

Posted in Science and tagged . Bookmark the permalink. RSS feed for this post. Leave a trackback.

Leave a Reply

Swedish Greys - a WordPress theme from Nordic Themepark.

SciencePlug is Stephen Fry proof thanks to caching by WP Super Cache

%d bloggers like this: