Skip to content

BINDetect

Mette Bentsen edited this page Jan 26, 2023 · 10 revisions

Background

To make predictions on specific transcription factor binding, we need to combine footprint scores with the information of transcription factor binding motifs. This will enable us to estimate the binding positions of individual transcription factors across the genome. Simultanously, if we have multiple conditions, we are interested in the differential binding of transcription factors, both globally and at single sites. TOBIAS BINDetect is a tool to integrate these different sources of information to predict transcription factor binding across multiple conditions:


Example command

To run transcription factor binding site prefiction and differential binding between Bcell and Tcells, use:

$ TOBIAS BINDetect --motifs test_data/motifs.jaspar --signals test_data/Bcell_footprints.bw test_data/Tcell_footprints.bw --genome test_data/genome.fa.gz --peaks test_data/merged_peaks_annotated.bed --peak_header test_data/merged_peaks_annotated_header.txt --outdir BINDetect_output --cond_names Bcell Tcell --cores 8

~1 minute (two conditions, 6000 peaks, 83 motifs)

BINDetect can also be used to predict binding for a single condition (which will turn off any estimation of differential binding):

$ TOBIAS BINDetect --motifs test_data/motifs.jaspar --signals test_data/Bcell_footprints.bw --genome test_data/genome.fa.gz --peaks test_data/merged_peaks_annotated.bed --peak_header test_data/merged_peaks_annotated_header.txt --outdir BINDetect_single_output --cond_names Bcell --cores 8

Or for a timeline of ATAC-seq experiments:

$ TOBIAS BINDetect --motifs test_data/motifs.jaspar --signals test_data/Tcell_day{1,2,3}_footprints.bw --genome test_data/genome.fa.gz --peaks test_data/merged_peaks_annotated.bed --peak_header test_data/merged_peaks_annotated_header.txt --outdir BINDetect_Tcell_output --time-series --cores 8

Input parameters

  • --signals
    List of signal bigwigs with scores representing protein binding for each biological condition (higher scores = more evidence of binding). This can for example be coverage tracks or footprint scores calculated with TOBIAS ScoreBigwig.

  • --motifs
    File containing motifs in either PFM, JASPAR or MEME format. These are the motifs which will be used to scan for binding sites.

  • --genome
    The fasta file containing the full genome sequence for the given organism. Must fit to the names/lengths of the chromosomes in --signals bigwigs.

  • --peaks
    The peaks representing open chromatin regions across all conditions. It is therefore also important that --signals were calculated across the same --peaks.

Full input parameters can be found by running TOBIAS BINDetect --help.


Output

In the output folder given in --outdir, the following files and folder structure will be created:

  • <outdir>/<TF>/
    For each motif in --motifs, there will be a directory containing results from the scanning for this motif. The value of <TF> is given by the --naming parameter.

    • <outdir>/<TF>/<TF>_overview.{txt,xlsx}
      This is an overview of all motif occurrences (TFBS) in --peaks for <TF>. The file exists in .txt (tab delimitered) and .xlsx format for easy filtering/sorting/etc. The columns of these files are:

      Column name(s) Explanation
      TFBS_chr/TFBS_start/TFBS_end/TFBS_name The genomic location and name of the TF.
      TFBS_score The score of the TF motif against the genomic sequence i.e. how well does the motif fit to the sequence.
      TFBS_strand Whether the motif hit is on the +/- strand.
      peak_chr/peak_start/peak_end The genomic location of the peak in which the TFBS lies.
      additional_<i> Any additional columns given within --peaks. The names of these can be controlled by the --peak-header argument.
      <condition>_score The footprint score within <condition> for this specific TFBS.
      <condition>_bound A boolean indicator whether this binding sites was predicted bound (1) or unbound (0) within <condition>.
      <condition1>_<condition2>_log2fc The log2 fold change between the footprint scores of the two conditions. Tells you whether the TFBS was predicted to be more/less bound between the conditions.
    • <outdir>/<TF>/beds/
      The beds-directory contains bedfiles for all sites as well as bound/unbound splits per condition. The _all-file contains all scores from --signals whereas the bound/unbound files contains only the score for the given condition in the last column. Values of <condition> are given by the --cond_names parameter.
      - <outdir>/<TF>/beds/<TF>_all.bed
      - <outdir>/<TF>/beds/<TF>_<condition>_bound.bed
      - <outdir>/<TF>/beds/<TF>_<condition>_unbound.bed

  • <outdir>/bindetect_results.{txt,xlsx}
    This file contains results from the total bindetect run. Each line is a TF and columns are:

    Column name(s) Explanation
    output_prefix The prefix as estimated by --naming. This column corresponds to the names of the output TF directories.
    name The name of each motif.
    motif_id The unique motif id of each motif.
    cluster Motif clustering based on the overlap of all identified TFBS per motif. The clusters are named according to one representative TF from each cluster.
    total_tfbs Number of binding sites found in input --peaks.
    <condition>_mean_score The mean of footprint scores for all TFBS for this motif. This does not necessarily represent the TFs mostly bound, but rather which TFs have high scoring footprints (e.g. have clear footprints). Thus, it is helpful to compare between conditions to get the change rather than the absolute footprint scores.
    <condition>_bound Number of sites predicted bound in the given condition. This is estimated based on the distribution of scores, and is therefore very dependent on how well the threshold for bound/unbound was set. It can therefore happen that a transcription factor has more bound sites in condition1 than in condition2, but has a negative <condition1>_<condition2>_change score, which would support more bound sites in condition2. In this case, the _change score is the more correct metric to use.
    <condition1>_<condition2>_change The differential binding score for the TF between the two conditions. Negative values imply more bound in condition2.
    <condition1>_<condition2>_pvalue The pvalue of the statistical test against a background model. This can be very small due to the large number of transcription factor binding sites found, so this should always be considered in combination with the <condition1>_<condition2>_change column.
    <condition1>_<condition2>_highlighted* A True/False value indicating whether the TF was highlighted in the respective BINDetect volcano plot. The top TFs are highlighted following the criteria: -log10(p-value) above the 95% quantile and/or differential binding scores smaller/larger than the 5% and 95% quantiles (top 5% in each direction).

* new in TOBIAS 0.14.0

  • <outdir>/bindetect_figures.pdf
    A multi-page PDF containing an overview of score-distributions for each condition as well as bindetect volcano-plots for each condition-comparison.

  • <outdir>/TF_distance_matrix.txt
    Distance matrix used to cluster the transcription factors in the bindetect_figures-dendrograms. This is based on the overlap of individual transcription factor binding sites.

  • *<outdir>/bindetect_*.html (*new in 0.11.0) Interactive differential plots (same data as shown in the 'bindetect_figures.pdf'-plots) with motif-logos included as hover over.