Skip to content

Latest commit

 

History

History
134 lines (102 loc) · 5.21 KB

README.md

File metadata and controls

134 lines (102 loc) · 5.21 KB

sfFinder

sub-family Finder: Finds similar family protein sequences in UniRef database, and, also generates HMM profiles and Fasta sequences from identified subfamilies. The profiles can be used to prospect new proteins from that family in metagenomic datasets using the HMMs through SAM package or the Fasta sequences through similarity searches through DIAMOND aligner.

Dependencies

Installation

  • Unpack flowerpower.tar.gz:
tar -zxvf flowerpower.tar.gz
  • Copy the update patch (flowerpower.patch) to the flowerpower directory and apply the patch:
patch -s -p0 < flowerpower.patch
  • The
  • Change mode of Perl (*.pl) and Python (*.py) scripts on flowerpower directory:
chmod a+x *.pl *.py
  • Verify if all the executable paths are on $PATH environment variable;
  • Create a symlink for BPG_utilities/bpg and BPG_utilities/pfacts003 in flowerpower directory (must be in the $PATH);
  • Create a symlink for BPG_utilities/make_nr_at_100_with_dict.py in flowerpower directory (must be in the $PATH);
  • Create a symlink for BPG_utilities/make_nr_at_100_with_dict.py in flowerpower directory (must be in the $PATH);

Usage

sfFinder.sh

The input is the protein family name, the formatted protein sequence(s) fasta file and the output directory:

sfFinder.sh FamilyName /path_to/FamilyProteins.fa /path_to/output/FamilyName

If you don't have a fasta file, but only the accession numbers, you can use esearch from NCBI e-utilities:

For example:

rm -f ./family.fa
esearch -db protein -query BAB33284.1 | efetch -db protein -format fasta >> ./ALKB.fa
esearch -db protein -query ABM79805.1 | efetch -db protein -format fasta >> ./ALKB.fa

... or use the script "scripts/getProteinByID.pl" for this kind of entry:

:::

ALKB_gi|89889739|ref|ZP_01201250.1| alkane-1-monooxygenase [Flavobacteria bacterium BBFL7] ALKB_gi|13358852|dbj|BAB33284.1| alkane hydroxylase A [Acinetobacter sp. M-1] ALKB_gi|13358856|dbj|BAB33287.1| alkane hydroxylase B [Acinetobacter sp. M-1] ALKB_gi|123967456|gb|ABM79805.1| oxygenase component of xylene monooxygenase [Sphingobium yanoikuyae] ALKB_gi|37360912|dbj|BAC98365.1| alkane hydroxylase [Alcanivorax borkumensis] :::

./scripts/getProteinByID.pl -i /path_to/familyentries.txt /path_to/outdir
ls /path_to/outdir/ALKB.fa

mkFProfiles.sh

If you prefer, for multiple family entries, you can use the following pipeline to execute automatically the getProteinByID.pl and, also, generates the profiles using sfFinder.sh:

mkFProfiles.sh /path_to/familyentries.txt /path_to/output_profile_database

sfMapper.sh

This tool maps the protein family and subfamily information for a dataset of unknown proteins. You can use protein alignment search using "diamond":

sfMapper.sh diamond /path_to/unknown_proteins.fa /path_to/output_profile_database /path_to/output_mapping

... or Hidden Markov Model (HMM) profiles search using SAM "hmmscore":

sfMapper.sh diamond /path_to/unknown_proteins.fa /path_to/output_profile_database /path_to/output_mapping

sfCombine.sh

This tool combines mapping results using diamond and hmmscore modes.

sfCombine.sh /path_to/unknown_proteins.fa /path_to/output_profile_database /path_to/output_mapping /path_to/output_combination

Example

In the sfFinder "bin/" directory you can see the script run_example.sh. You can run it to see if the pipeline is running. So, you can compare with the provided output.

./run_example.sh

For making a comparison you need first to uncompress the two directories ("output/" and "mapping/") to another directory ("/tmp", for example):

tar -xzvf ../example/output.tar.gz -C /tmp/
tar -xzvf ../example/mapping.tar.gz -C /tmp/