Skip to content

Commit

Permalink
Revised README
Browse files Browse the repository at this point in the history
  • Loading branch information
CJ101192 committed Feb 3, 2018
1 parent 3ccb241 commit ba24e5c
Showing 1 changed file with 10 additions and 6 deletions.
16 changes: 10 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,12 @@
MashMap
========================================================================

MashMap is a fast and approximate software for mapping genome assembly or long reads (PacBio/ONT) to reference genome(s). It maps a query sequence against a reference region if its estimated alignment identity is above a specified threshold. It does not compute the alignments explicitly, but rather estimates a *k*-mer based [Jaccard similarity](https://en.wikipedia.org/wiki/Jaccard_index) using a combination of [Winnowing](http://www.cs.princeton.edu/courses/archive/spr05/cos598E/bib/p76-schleimer.pdf) and [MinHash](https://en.wikipedia.org/wiki/MinHash). This is then converted to an estimate of sequence identity using the [Mash](http://mash.readthedocs.org) distance. An appropriate *k*-mer sampling rate is automatically determined given minimum local alignment length and identity thresholds. The efficiency of the algorithm improves as both of these thresholds are increased.
MashMap implements a fast and approximate algorithm for computing local alignment boundaries between long DNA sequences. It can be useful for mapping genome assembly or long reads (PacBio/ONT) to reference genome(s). Given a minimum alignment length and an identity threshold for the desired local alignments, Mashmap computes alignment boundaries and identity estimates using *k*-mers. It does not compute the alignments explicitly, but rather estimates a *k*-mer based [Jaccard similarity](https://en.wikipedia.org/wiki/Jaccard_index) using a combination of [Minimizers](http://www.cs.princeton.edu/courses/archive/spr05/cos598E/bib/p76-schleimer.pdf) and [MinHash](https://en.wikipedia.org/wiki/MinHash). This is then converted to an estimate of sequence identity using the [Mash](http://mash.readthedocs.org) distance. An appropriate *k*-mer sampling rate is automatically determined using the given minimum local alignment length and identity thresholds. The efficiency of the algorithm improves as both of these thresholds are increased.

Unlike traditional mappers, MashMap does not compute exact sequence alignments. Instead it approximates mapping positions and identities using only *k*-mers. As a result, MashMap is both extremely fast and memory efficient, enabling rapid query mapping to large reference databases like NCBI RefSeq. We describe the full algorithm associated with Mashmap (v1.0), and report on speed, scalability, and accuracy of the software here: ["A fast approximate algorithm for mapping long reads to large reference databases"](https://doi.org/10.1007/978-3-319-56970-3_5).

We have extended this software to approximate local alignments as well. This would be useful for fast mapping of genomes, assembly contigs or reads to reference genomes. Based on the specified minimum length requirements for local alignments, it segments the query sequence accordingly to guarantee reporting of the requested local alignment boundaries, with high probability. It can now map a human genome assembly to the human reference genome in about one minute total execution time and < 4 GB memory using just 8 CPU threads, achieving more than an order of magnitude improvement in both runtime and memory over alternative methods. In future, we plan to add an optional alignment support to generate base-to-base alignments.
As an example, Mashmap can map a human genome assembly to the human reference genome in about one minute total execution time and < 4 GB memory using just 8 CPU threads, achieving more than an order of magnitude improvement in both runtime and memory over alternative methods. We describe the algorithms associated with Mashmap, and report on speed, scalability, and accuracy of the software in the publications listed [below](#publications). Unlike traditional mappers, MashMap does not compute exact sequence alignments. In future, we plan to add an optional alignment support to generate base-to-base alignments.

## Installation
Follow [`INSTALL.txt`](INSTALL.txt) to compile and install MashMap. We also provide dependency-free linux and OSX binaries with each release for user convenience.
Follow [`INSTALL.txt`](INSTALL.txt) to compile and install MashMap. We also provide dependency-free linux and OSX binaries for user convenience through the [latest release](https://github.com/marbl/MashMap/releases).

## Usage

Expand All @@ -34,7 +32,7 @@ For most of the use cases, default values should be appropriate. However, differ

* Minimum segment length (-s, --segLength) : Default is 5,000 bp. Sequences below this length are ignored. Mashmap provides guarantees on reporting local alignments of length twice this value.

* Filtering options (-f, --filter_mode) : Similar to [delta-filter](http://mummer.sourceforge.net/manual/#filter) in nucmer, different filtering options are provided that are suitable for long read or assembly mapping. `-f map` is suitable for reporting the best mappings for long reads, whereas `-f one-to-one` is suitable for reporting orthologous mappings among all computed assembly to genome mappings.
* Filtering options (-f, --filter_mode) : Mashmap implements a [plane-sweep](https://en.wikipedia.org/wiki/Sweep_line_algorithm) based algorithm to perform the alignment filtering. Similar to [delta-filter](http://mummer.sourceforge.net/manual/#filter) in nucmer, different filtering options are provided that are suitable for long read or assembly mapping. Option `-f map` is suitable for reporting the best mappings for long reads, whereas `-f one-to-one` is suitable for reporting orthologous mappings among all computed assembly to genome mappings.

## Visualize

Expand All @@ -47,3 +45,9 @@ We provide a perl [script](scripts) for generating dot-plots to visualize mappin
## Release
Use the [latest release](https://github.com/marbl/MashMap/releases) for a stable version.
## <a name=“publications”></a>Publications
- **Chirag Jain, Sergey Koren, Alexander Dilthey, Adam M. Phillippy, and Srinivas Aluru**. "A Fast Adaptive Algorithm for Computing Whole-Genome Homology Maps". *BioRxiv*, 2018.
- **Chirag Jain, Alexander Dilthey, Sergey Koren, Srinivas Aluru, and Adam M. Phillippy**. "A fast approximate algorithm for mapping long reads to large reference databases." In *International Conference on Research in Computational Molecular Biology*, Springer, Cham, 2017.

0 comments on commit ba24e5c

Please sign in to comment.