Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

small RNAseq output format definition #10

Open
lpantano opened this issue May 3, 2017 · 13 comments
Open

small RNAseq output format definition #10

lpantano opened this issue May 3, 2017 · 13 comments

Comments

@lpantano
Copy link
Contributor

lpantano commented May 3, 2017

Hi all again,

cc: @ThomasDesvignes @mhalushka @mlhack @keilbeck @BastianFromm @ivlachos @TJU-CMC

After giving some time to think, I realized we could ask slightly different for a solution the naming problem. I will do a separate issue (tomorrow) for miRNA+isomiR naming since we all agree are super related and we can discuss there. Meanwhile, I think we can give a thought on:

Do we need a standard output for small RNAseq data? if yes, which?

Advantages:

  • results comparable at the moment, through different labs, pipelines etc. This will promote sharing.
  • easily way to make benchmarking analysis to determine the best pipeline or find errors
  • In the case to have a centralized DB, like miRGeneDB, is pretty easy to incorporate new information, after QC analysis
  • Promote tools developments for specific analysis, like visualization, de-novo discovery..., QC reports, meta-analysis...like in other fields, RNAseq, variant calling...
  • Promote using different tools that are focused on different parts. Like, analyzing with miRAnalizer but then go to R and perform further analysis.

All that is true if we set a minimum mandatory information that should be in the file format.

Information that should be included:

  • Commands used for the analysis (like in BAM/VCF)
  • Position on the genome/database where the small RNA is
  • Exactly small RNA sequence can be recovered
  • QC of the alignment/reads
  • Type of small RNA: miRNA, tRNA, etc
  • Name of the small RNA, miRNA name, or isomiR, or similar. Accept multiple names.
  • Abundance, raw, normalized ...
  • What database used for the detection
  • If predicted or known
  • Multiple samples in the same file showing information that is specific for each sample if needed
  • Secondary structure of precursor
  • It should have a API/toolkit associated to promote usage
  • PASS/FAIL filter to allow having all data in case of re-analysis (like BAM have un-mapped or VCF false variation)

(Please add other information the format should have if you agree on this)

I will develop more as a comment in the issue, my first idea was a version of VCF files. They are focused on variation, and can be adapted to small RNA variation using the different fields. Beside the INFO column allow us to create any specific field needed. And there are already many tools supporting this file format that can encourage to get people using it.

Please, add any of your ideas and remember you can use the vote system on the issue page to give support, or, just reply. Feel free to modify this issue to add more information.

** Deadline: May21 **

@lpantano
Copy link
Contributor Author

lpantano commented May 9, 2017

I will give a first shot. This may be related to the other issue about isomiR naming, but I realized that it was more related to a format definition for small RNA data. I will focus on miRNA, to start with something that I can go deeply, but can be applied to tRNA fragments.

I propose VCF format as output of miRNA/isomiR annotation for small RNAseq data:

  • More information about VCF format can be read here: https://samtools.github.io/hts-specs/VCFv4.2.pdf
  • 1 column: Chrom: it can be at genomic or precursor level. The file should be able to have information on both in other columns to convert coordinates
  • 2 column: Pos: 5' position of small RNA/isomiR/miRNA
  • 3 column: ID: miRNA or isomiR name (if a convention found)
  • 4 column: REF: nothing, or maybe canonical seed, other idea?
  • 5 column: ALT: CIGAR like annotation of changes respect to miRNA. i.e 3A11eUU meaning change at pos 4, and UU as nucleotides extension.
  • If canonical decided: show trimming events
  • If no canonical decided: show just nucleotide mutation, nucleotide addition at 3', InDels, and size.
  • 6 column: QUAL: some quality value showing how good the detection is
  • 7 column: FILTER: PASS or REJECT as used in variant callling based on information contained in the VCF file like num reads, num samples detected etc
  • 8 column: INFO: we can add here all the info is the same for all samples, like name in miRBase, mirGeneDB, any other database, and algorithm is used to score, seed regions is annotated, miRNA family, etc
  • 9 column: SAMPLE: one column per sample that can have multiple information like number reads supporting the sequence, compared to canonical sequence, canonical seed sequence or anything that changes among sample.

I won't go into all possibilities inside column 7 and 9, If people like the idea we can define the mandatory fields that should be there that would help quality here. As well, tools can be developed to map genomic to miRNA coordinates or otherwise around, and from difference databases.

The idea is that all the information is always here, and different methods can be used to PASS or REJECT sequences considered as possible true isomiRs, or relevant, or filter or any manipulation you can think of.

Tools already ready to manipulate VCF:

  • pyvcf
  • vcftools
  • vawk: awk similar command that understand VCF format
  • samtools
  • bedtools
    ...

Please, comment about the idea, or possible improvements or alternatives or just vote :)

@gurgese
Copy link
Collaborator

gurgese commented May 23, 2017

Hi @lpantano @BastianFromm and all the other,

In my opinion the tabular format is a good choice for reports the miRNA/isomiR annotation and for this a customized VCF format can be a viable solution.

To this end, I can share with you my personal experience with this kind of problem and what I developed as a solution.

I developed, during my PhD, a tool called isomiR-SEA capable to maps the input tags on miRNAs reported inmiRBase.
This process is executed using a custom alignment algorithm that classify the tags as exact miRNAs or isoforms at run-time during the alignment procedure.
To make short the story, at the end of the analysis isomiR-SEA provide 2 main output files:

  1. <out_result_mature_21_tag_unique.txt> file stores all tags mapped on a single miRNA sequence.
  2. <out_result_mature_21_tag_ambigue_selected.txt> collects multimapped tags selected through the analysis of the mapping score.

Both are structurated as tabular where all the information for post-analysis filtering and classification are stored.

In the following I will list the meaning of the principal columns:
• tag_sequence - The original tag sequence;
• tag_quality - Average tag quality computed for each nucleotide of the tag sequence;
• #count_tags - Number of identical reads supporting the tag sequence;
• mirna_name - Info relative to the miRNA that associated to the tag (data from miRNA database);
• mirna_seq - The miRNA sequence;
• begin_ungapped_mirna - The miRNA nt position where the miRNA-tag alignment begin;
• begin_ungapped_tag - The tag nt position where the miRNA-tag alignment begin;
• align_score - The alignment score computed with isomiR-SEA;
• mir_tag_size_diff - difference in size between miRNA seq and tag seq;
• All the other fields are flags (saved during the alignment phase) that provide informations about the isoforms identified and the interaction sites conserved. Here the list of these flags -> mirna_exact, iso_5p, iso_snp, iso_multi_snp, iso_3p, offset_site, suppl_compens_site, cen-
tral_site

The file collecting tags selected from the multi-mapped have three more fields:
a) mir2-mir1_align_score - This value is the score difference between the miR-tag aliggned with the second higher score and the score of the best aligned miR-tag;
b) num_mir_family - It represents the number of different miRNAs families, on which the tag has been mapped;
c) num_of_mapped_miRNAs - It is the number of miRNAs attributed to the same tag.

Using these info it is possible to discriminate between 17 isomiR types ( http://media.springernature.com/full/springer-static/image/art%3A10.1186%2Fs12859-016-0958-0/MediaObjects/12859_2016_958_Fig2_HTML.gif ).
Moreover, it is possible to check if the aligned tags conserve the miRNA-mRNA interaction sites considered as important (seed, offset, supplementary and central).

Here you can find more dettails about the isomiR classification and some pictures that support what I wrote in this post -> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-0958-0

@BastianFromm
Copy link
Collaborator

BastianFromm commented May 23, 2017 via email

@ThomasDesvignes
Copy link
Member

Hi Gianvito,
isomiR-SEA looks like a great tool! And your classification of isomiR variations is a very interesting way to sort them up. I'll have to try it!
Concerning the format of output, I'm definitively not an expert and can't really provide insights on which format is best but people I work with (@peterbatzel and @jasonsydes) suggested to me that "a GFF3 format would likely be a better fit than VCF format. GFF3 apparently allows any type of custom 'tag' to be added, making it ideal for obscure or very specific problems (maybe including counts, normalized counts, tissues, putative secondary structure when relevant, etc). VCF is somewhat similar to that, allowing custom entries in the INFO and SAMPLE fields. One major drawback of VCF is that it is specifically designed for single genomic positions (one position per line). On the other hand, GFF3 entries (lines) specify start and stop position, allowing any length feature to be described. GFF3 also already has built into it the notion of parenthood, thinking of annotating primary, precursor, and isomiRs all in the same file with relationship explicitly pronounced. In addition Lorena's suggestion of a CIGAR like tag for isomiRs would still work in GFF3 format and we believe is a great idea."
That's all I can provide on this end... I'll direct the discussion to my colleagues if they want to chime in.

@keilbeck
Copy link

We maintain the GFF3 spec thru the sequence ontology and would be happy to work with this group to accommodate your needs.

@lpantano
Copy link
Contributor Author

Hi,

I see the point of GFF3. I thought about GFF3 as a format for data that was more curated, but I am up to try it, for sure.

I know the main problem of VCF is that is designed for SNPs, but actually they are using it even for translocation that are in different chroms. You just adapt the position to be something else, not just a single position.

Has somebody used GFF3 to add information by sample? Because I think it would be usable to have that, like VCF have the SAMPLES columns, where you can put count data, or anything related to each sample to that feature.

Or can we add 1 column at the end of the GFF for each sample and called the format differently if this is incompatible with GFF?

Thanks for all the feedback.

@keilbeck
Copy link

Column 9 of Gff3 is where you can add your own annotation. You would not need to add an extra column I don't think. Parsers generally expect 9 cols and there is some flexibility to the last one. If this is spec'd out well, we can add it to the documentation

@BastianFromm
Copy link
Collaborator

BastianFromm commented May 29, 2017 via email

@lpantano
Copy link
Contributor Author

Thanks for all the ideas.

I am happy to try the GFF3 format. I will create a new document inside this repo to describe better each column to get to a consensus. I think this is a very good example of community effort that will lead to a usable concept in the future.

I'll start with a draft, so you can continue with the ideas posted here! I'll ping you when It's done tomorrow.

Thanks!

@haebhardt
Copy link
Collaborator

The new format looks good. I noted two things: column 9 is getting really full now. Is there space to add additional columns and subdivide the information contained in col 9 across multiple columns which would be easier to parse?

Second comment: the isomiRs are not really represented thus far in
https://github.com/miRTop/incubator/blob/master/format/definition.md
There are lots of isomiRs, e.g. additional nts (not included in genome), truncations of original miR sequence, nts edtiting. Is there space to add a column like that?

@keilbeck
Copy link

GFF3 is a 9 column format.
https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

Column 9: "attributes"
A list of feature attributes in the format tag=value. Multiple tag=value pairs are separated by semicolons. URL escaping rules are used for tags or values containing the following characters: ",=;". Spaces are allowed in this field, but tabs must be replaced with the %09 URL escape. Attribute values do not need to be and should not be quoted. The quotes should be included as part of the value by parsers and not stripped.

@lpantano
Copy link
Contributor Author

lpantano commented Jun 23, 2017 via email

@haebhardt
Copy link
Collaborator

The 'type' column distinguishing isomiR / ref_miR sounds reasonable.

I am happy to contribute my input for isomiR attributes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants