Skip to content

Subcommand: phylogenetic kmeans

Lucas Czech edited this page Sep 26, 2022 · 11 revisions

Run Phylogenetic k-means clustering on a set of samples.

Usage: gappa analyze phylogenetic-kmeans [options]

Options

Input
--jplace-path Required. TEXT:PATH(existing)=[] ...
List of jplace files or directories to process. For directories, only files with the extension .jplace[.gz] are processed.
Settings
--k Required. TEXT
Number of clusters to find. Can be a comma-separated list of multiple values or ranges for k, such as "1-5,8,10,12"
--write-overview-file FLAG
If provided, a table file is written that summarizes the average distance and variance of the clusters for each k. Useful for elbow plots.
--point-mass FLAG
Treat every pquery as a point mass concentrated on the highest-weight placement. In other words, ignore all but the most likely placement location (the one with the highest LWR), and set its LWR to 1.0.
--ignore-multiplicities FLAG
Set the multiplicity of each pquery to 1.0. The multiplicity is the equvalent of abundances for placements, and hence ignored with this flag.
--bins UINT=0
Bin the masses per-branch in order to save time and memory, with only minor differences in the cluster assignments. Default is 0, that is, no binning. If set, we recommend to use 50 bins or more.
Color
--color-list TEXT=BuPuBk
List of colors to use for the palette. Can either be the name of a color list, a file containing one color per line, or an actual comma-separated list of colors. Colors can be specified in the format #rrggbb using hex values, or by web color names.
--reverse-color-list FLAG
If set, the order of colors of the --color-list is reversed.
--log-scaling FLAG
If set, the sequential color list is logarithmically scaled instead of linearily.
Output
--out-dir TEXT=.
Directory to write output files to.
--file-prefix TEXT=pkmeans_
File prefix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
--file-suffix TEXT
File suffix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
Tree Output
--write-newick-tree FLAG
If set, the tree is written to a Newick file. This format cannot store color information.
--write-nexus-tree FLAG
If set, the tree is written to a Nexus file. This can for example be opened in FigTree.
--write-phyloxml-tree FLAG
If set, the tree is written to a Phyloxml file. This can for example be used in Archaeopteryx.
--write-svg-tree FLAG
If set, the tree is written to a SVG file. This gives a file for vector graphics editors.
Newick Tree Output
--newick-tree-branch-length-precision INT=6 Needs: --write-newick-tree
Number of digits to print for branch lengths in Newick format.
--newick-tree-quote-invalid-chars FLAG Needs: --write-newick-tree
If set, node labels that contain characters that are invalid in the Newick format (i.e., spaces and :;()[],{}) are put into quotation marks. If not set (default), these characters are instead replaced by underscores, which changes the names, but works better with most downstream tools.
Svg Tree Output
--svg-tree-shape TEXT:{circular,rectangular}=circular Needs: --write-svg-tree
Shape of the tree.
--svg-tree-type TEXT:{cladogram,phylogram}=cladogram Needs: --write-svg-tree
Type of the tree, either using branch lengths (phylogram), or not (cladogram).
--svg-tree-stroke-width FLOAT=5 Needs: --write-svg-tree
Svg stroke width for the branches of the tree.
--svg-tree-ladderize FLAG Needs: --write-svg-tree
If set, the tree is ladderized.
Global Options
--allow-file-overwriting FLAG
Allow to overwrite existing output files instead of aborting the command.
--verbose FLAG
Produce more verbose output.
--threads UINT
Number of threads to use for calculations.
--log-file TEXT
Write all output to a log file, in addition to standard output to the terminal.

Description

The command runs a Phylogenetic k-means clustering on a set of jplace files (called samples). The aim is to group samples that are similar to each other regarding the Phylogenetic KR distance. This is for example useful to find structure in a set of samples from different locations or points in time.

Details

Values for k

It is often not obvious what the "natural" number of clusters of a set of samples is. To this end, it makes sense to try different values for k and explore how the clustering changes. Then, techniques like the Elbow Method can be used to estimate a reasonable number of clusters. See below for more on that.

To this end, the option --k accepts multiple values, separated by commas, as well as ranges of numbers, specified via a dash. This is similar to how specific pages can be selected in common software before printing.

Example: --k 1-6,10,15

Output Format

For each specified k, the result of the clustering is written to an assignment table, which lists for each sample the cluster number it was grouped into, as we as the distance (Phylogenetic KR distance) from the sample to the centroid of the cluster. The cluster numbers are zero based, and thus span the range [0, k-1].

Centroid Trees

If furthermore an output tree format is specified (via one of the ---write-...-tree options), the centroids of each cluster are visualized as mass trees. That is, the average mass distribution of all samples that were assigned to a cluster is calculated and visualized on the tree. This is useful to explore what each cluster represents - that is, how the samples were clustered.

Multiple k and Overview File

If multiple values for k are specified (see above), the option --write-overview-file can be used to write an overview table that lists for each value of k the average distance and variance from each sample to its assigned cluster centroid. This table can directly be visualized to create plots such as the Elbow Method.

Citation

When using this method, please do not forget to cite

Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Genesis and Gappa: Processing, Analyzing and Visualizing Phylogenetic (Placement) Data. Bioinformatics, 2020. doi:10.1093/bioinformatics/btaa070

Lucas Czech, Alexandros Stamatakis. Scalable Methods for Analyzing and Visualizing Phylogenetic Placement of Metagenomic Samples. PLOS ONE, 2019. doi:10.1371/journal.pone.0217050

Clone this wiki locally