Skip to content

Test data

Mette Bentsen edited this page Mar 25, 2020 · 3 revisions

Obtaining test data

Data for the TOBIAS test commands found in this wiki can be obtained using TOBIAS DownloadData:

$ TOBIAS DownloadData --bucket data-tobias-2020
$ mv data-tobias-2020/ test_data/

This downloads the test-data (~700 MB) from the loosolab S3-storage server and moves the data to the test_data/ directory.

Source of ATAC-seq data

The source of the test data is the paper "Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position", Buenrostro et al. 2013, Nature Methods link. This paper applied ATAC-seq to the GM12878 lymphoblastoid cell line (derived from B cells) and to CD4+ positive T cells at three time points. The raw data from the study (study accession PRJNA207663) in the format of .fastqs were downloaded from the following urls:

sample_title experiment_accession fastq files
GM12878_ATACseq_50k_Rep1 SRX298000 read1,read2
GM12878_ATACseq_50k_Rep2 SRX298001 read1,read2
GM12878_ATACseq_50k_Rep3 SRX298002 read1,read2
GM12878_ATACseq_50k_Rep4 SRX298003 read1,read2
CD4+_ATACseq_Day1_Rep1 SRX298007 read1,read2
CD4+_ATACseq_Day1_Rep2 SRX298008 read1,read2
CD4+_ATACseq_Day2_Rep1 SRX298009 read1,read2
CD4+_ATACseq_Day2_Rep2 SRX298010 read2,read2
CD4+_ATACseq_Day3_Rep1 SRX298011 read1,read2
CD4+_ATACseq_Day3_Rep2 SRX298012 read1,read2

Data processing

Mapping

All samples were mapped using STAR. Single replicates were merged using samtools merge to condition .bam-files to yield Bcell.bam, Tcell_day1.bam, Tcell_day2.bam and Tcell_day3.bam. To keep file sizes minimal, a random subset of reads were chosen for each replicate using samtools view -s <fraction>. For the sake of the examples, the Tcell samples were further merged to one .bam-file Tcell.bam.

Peak-calling

Peak-calling was performed per replicate using MACS2 with parameters --nomodel --shift -100 --extsize 200 --broad. The file merged_peaks.bed represents peaks merged across the Bcell and Tcell conditions.

Annotation file

The .gtf-file used for annotation was downloaded from Ensembl (link). Chromosome prefix "chr" was added and the file was further subset to chr4.

Annotation of peaks to nearest genes

Annotation of peaks in merged_peaks.bed was performed using UROPA as shown here:

$ uropa --bed merged_peaks.bed --gtf transcripts_chr4.gtf --show_attributes gene_id gene_name --feature_anchor start --distance 20000 10000 --feature gene 

The test files are obtained with:

$ cut -f 1-6,16-17 merged_peaks_finalhits.txt | head -n 1 > merged_peaks_annotated_header.txt
$ cut -f 1-6,16-17 merged_peaks_finalhits.txt | tail -n +2 > merged_peaks_annotated.bed

Motifs

The file motifs.jaspar contains 83 motifs from the JASPAR 2020 vertebrate database (download here. The motifs found in test_data/individual_motifs/ were obtained using TOBIAS FormatMotifs --task split.

Blacklist

The file blacklist.bed is a subset of the Boyle-lab blacklist (available here) containing only chr4 regions.

TOBIAS output files

Additional files are obtained using the test commands throughout this wiki.