Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multicolor visualizations of large dot plot matrices #14

Merged
merged 27 commits into from
Oct 11, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
60f9f4b
ENH: support multicolor viz in spy()
fedarko Sep 15, 2023
be05502
ENH/BUG/DOC: spy verbose; use nbcmap bg colr; tut
fedarko Sep 15, 2023
3d5f1e1
DOC: gray->green for RC cells in Gamer Mode
fedarko Sep 15, 2023
c87ef7c
PERF/MNT: be consistent with deleting SAs in make
fedarko Sep 16, 2023
1e49769
DOC: additional tutorial note abt kmer size
fedarko Oct 5, 2023
7822621
DOC: describe "exactness" of matrix in README
fedarko Oct 10, 2023
93ed4fd
DOC: minor wording changes
fedarko Oct 10, 2023
8c9dcc8
DOC: note time taken for E. coli dotplot in README
fedarko Oct 10, 2023
cd7fb89
DOC: tidy up tutorial re nbspy; viz perf in README
fedarko Oct 10, 2023
e6fa4bb
ENH: Add force_binary option to viz_spy
fedarko Oct 10, 2023
ab2d22f
DOC: fancy schmancy ntbk
fedarko Oct 10, 2023
c38fa74
DOC/ENH: viz docs; verbose for viz_imshow()
fedarko Oct 10, 2023
4c44c94
DOC: mention force_binary in tutorial
fedarko Oct 10, 2023
1e4a645
DOC: better wording abt force param
fedarko Oct 10, 2023
88c2122
ENH: add draw_order param to viz_spy
fedarko Oct 10, 2023
9eda99c
TST: viz_spy() with a bad draw_order
fedarko Oct 10, 2023
c03e1ad
TST: more viz_spy() testing
fedarko Oct 10, 2023
89291d9
TST/MNT: spin-off and test _get_yarr()
fedarko Oct 10, 2023
7a0c9b8
TST: viz_spy() with binary matrix
fedarko Oct 10, 2023
fa4257a
TST: more viz verbose tests
fedarko Oct 10, 2023
e73158b
TST: more viz testing, incl some pyplot stuff
fedarko Oct 10, 2023
9e3ce17
DOC: rerun tut
fedarko Oct 10, 2023
d072670
DOC/PERF: benchmarking documentation w/ binary&fb
fedarko Oct 11, 2023
f0c9044
PERF/DOC: small viz_spy perf notes
fedarko Oct 11, 2023
6bfc352
DOC/PERF: rerun benchmarking ntbk! :O
fedarko Oct 11, 2023
92ef115
DOC/PERF: more 100 Mbp notes
fedarko Oct 11, 2023
53d7c7e
DOC: update time taken for 100 Mbp thing in README
fedarko Oct 11, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 13 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,15 @@
wotplot is a small Python library for creating and visualizing
[dot plot matrices](https://en.wikipedia.org/wiki/Dot_plot_(bioinformatics)).

Notably, wotplot creates the _exact_ dot plot matrix, describing
(given some _k_ ≥ 1) every single
[_k_-mer](https://en.wikipedia.org/wiki/K-mer) match between two sequences.
Many tools for visualizing dot plots create only an approximation of this matrix
(containing only the "best" matches) in order to save time; wotplot uses a few
optimizations to make the creation and visualization of the
exact dot plot matrix feasible even for entire prokaryotic genomes. Having this
exact matrix can be useful for a variety of downstream analyses.

## Quick examples

### Small dataset
Expand Down Expand Up @@ -56,9 +65,11 @@ from matplotlib import pyplot
# (skipping the part where I loaded the genomes into memory as e1s and e2s...)

# Create the matrix (leaving binary=True by default)
# This takes about 3 minutes on a laptop with 8 GB of RAM
em = wotplot.DotPlotMatrix(e1s, e2s, 20, verbose=True)

# Visualize the matrix using matplotlib's spy() function
# This takes about 2 seconds on a laptop with 8 GB of RAM
fig, ax = pyplot.subplots()
wotplot.viz_spy(
em, markersize=0.01, title="Comparison of two $E. coli$ genomes ($k$ = 20)", ax=ax
Expand All @@ -70,7 +81,7 @@ fig.set_size_inches(8, 8)

![Output dotplot from the above example](https://github.com/fedarko/wotplot/raw/main/docs/img/ecoli_example_dotplot.png)

When visualizing a binary matrix (and/or using `viz_spy()`), the default
When visualizing a binary matrix, the default
colorscheme uses black cells (⬛) to indicate matches (forward,
reverse-complementary, or palindromic)
and white cells (⬜) to indicate no matches.
Expand Down Expand Up @@ -124,7 +135,7 @@ for some very informal benchmarking results performed on a laptop with ~8 GB of

Even on this system, the library can handle reasonably large sequences: in the biggest example,
the notebook demonstrates computing the dot plot of two random 100 Mbp sequences
(using _k_ = 20) in 54 minutes and 12.45 seconds.
(using _k_ = 20) in ~50 minutes.
Dot plots of shorter sequences (e.g. 100 kbp or less) usually take only a few seconds to
compute, at least for reasonably large values of _k_.

Expand Down
1,988 changes: 1,500 additions & 488 deletions docs/Benchmarking.ipynb

Large diffs are not rendered by default.

255 changes: 203 additions & 52 deletions docs/Tutorial.ipynb

Large diffs are not rendered by default.

11 changes: 11 additions & 0 deletions wotplot/_logging.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
import time


def get_logger(verbose):
t0 = time.time()

def _mlog(s):
if verbose:
print(f"{time.time() - t0:,.2f}s: {s}", flush=True)

return _mlog
16 changes: 7 additions & 9 deletions wotplot/_make.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import time
from pydivsufsort import divsufsort
from ._scipy_sm_constructor_getter import get_sm_constructor
from ._logging import get_logger

DNA = "ACGT"
RCDNA = "TGCA"
Expand Down Expand Up @@ -280,11 +280,7 @@ def _make(s1, s2, k, yorder="BT", binary=True, verbose=False):
ss1 and ss2 are versions of s1 and s2, respectively, converted to
strings.
"""
t0 = time.time()

def _mlog(s):
if verbose:
print(f"{time.time() - t0:,.2f}s: {s}", flush=True)
_mlog = get_logger(verbose)

# First, verify that the SciPy version installed is good
smc = get_sm_constructor()
Expand Down Expand Up @@ -318,9 +314,10 @@ def _mlog(s):
s1, s2, k, s1_sa, s2_sa, matches, yorder=yorder, binary=binary
)
_mlog(f"found {len(matches):,} forward match cell(s).")
# I'm not sure if this is needed, but we might as well be very clear that
# "hey this big chunky suffix array is now unnecessary please garbage
# collect it"
# I'm not sure if this makes a difference (is Python smart enough to
# immediately garbage-collect s2_sa at this point?), but we might as
# well be very clear that "hey this big chunky suffix array is now
# unnecessary please garbage collect it"
del s2_sa

_mlog("computing ReverseComplement(s2)...")
Expand All @@ -341,6 +338,7 @@ def _mlog(s):
s2isrc=True,
)
_mlog(f"found {len(matches):,} total match cell(s).")
del s1_sa
del rcs2_sa
density = 100 * (len(matches) / (mat_shape[0] * mat_shape[1]))
_mlog(f"density = {density:.2f}%.")
Expand Down
175 changes: 152 additions & 23 deletions wotplot/_viz.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,36 @@
import numpy as np
from matplotlib import pyplot
from ._make import FWD, REV, BOTH
from ._logging import get_logger

# Colormap based on Figure 6.20 in Chapter 6 of Bioinformatics Algorithms
# Colormaps based on Figure 6.20 in Chapter 6 of Bioinformatics Algorithms
# (Compeau & Pevzner), ed. 2
NBCMAP = {
NBCMAP_255 = {
0: [255, 255, 255],
1: [255, 0, 0],
-1: [0, 0, 255],
2: [100, 0, 100],
FWD: [255, 0, 0],
REV: [0, 0, 255],
BOTH: [100, 0, 100],
}
NBCMAP_HEX = {
0: "#ffffff",
FWD: "#ff0000",
REV: "#0000ff",
BOTH: "#640064",
}

DRAW_ORDER = (FWD, REV, BOTH)


def _get_yarr(yorder):
if yorder == "BT":
# -->, which gets turned into an up arrow when we rotate the yax label
yarr = "\u2192"
elif yorder == "TB":
# <--, which gets turned into a down arrow when we rotate the yax label
yarr = "\u2190"
else:
raise ValueError(f"Unrecognized yorder?: {yorder}")
return yarr


def style_viz_ax(ax, m, title=None):
Expand Down Expand Up @@ -38,14 +60,7 @@ def style_viz_ax(ax, m, title=None):
ax.set_yticklabels([])

ax.set_xlabel(f"$s_1$ ({len(m.s1):,} nt) \u2192", fontsize=18)
if m.yorder == "BT":
# -->, which gets turned into an up arrow when we rotate the yax label
yarr = "\u2192"
elif m.yorder == "TB":
# <--, which gets turned into a down arrow when we rotate the yax label
yarr = "\u2190"
else:
raise ValueError(f"Unrecognized yorder?: {m.yorder}")
yarr = _get_yarr(m.yorder)
ax.set_ylabel(f"$s_2$ ({len(m.s2):,} nt) {yarr}", fontsize=18)

if title is not None:
Expand All @@ -58,13 +73,23 @@ def _create_fig_and_ax_if_needed(ax=None):
return None, ax


def viz_spy(m, markersize=0.5, color="black", title=None, ax=None, **kwargs):
def viz_spy(
m,
markersize=0.5,
force_binary=False,
color="black",
nbcmap=NBCMAP_HEX,
draw_order=DRAW_ORDER,
title=None,
ax=None,
verbose=False,
**kwargs,
):
"""Visualizes a DotPlotMatrix object using matplotlib's spy().

This should be much more performant than viz_imshow(). However, there are
some limitations: match cells can only be drawn with a single color, even
for matrices that are not binary; and you may want to adjust the markersize
based on the size of your matrix and desired resolution.
This should be much more performant than viz_imshow(). However, the use of
a fixed markersize can require manual adjustment based on the size of your
matrix and desired resolution.

Parameters
----------
Expand All @@ -75,11 +100,35 @@ def viz_spy(m, markersize=0.5, color="black", title=None, ax=None, **kwargs):
Size of the markers drawn to represent each match cell. You may want to
adjust this depending on the size of your matrix.

force_binary: bool
If the input matrix is not binary, you can set this parameter to True
to force the visualization of it as a binary matrix (i.e. with each
match cell being the same color). For large matrices, this can save a
few seconds.

color: color
The color to use for each match cell. This should be in a format
accepted by matplotlib; see
The color to use for each match cell. Must be in a format accepted by
matplotlib; see
https://matplotlib.org/stable/gallery/color/color_demo.html for
details.
details. (Unlike the colors we use in imshow(), RGB triplets where the
entries range from 0 to 255 are not allowed here.) Only used if
visualizing a binary matrix and/or if force_binary is True.

nbcmap: dict
Maps 0, 1, -1, and 2 to colors (of the same possible formats as the
above "color" parameter). Only used if visualizing a matrix that is
not binary and if force_binary is False.

draw_order: iterable
If the input matrix is not binary, we will draw the matrix's match
cells as distinct colors by calling spy() multiple times (once per
match type). If your markersize is large enough that adjacent cells
in the matrix can overlap, then the order in which we call spy()
will impact which colors are drawn "on top" of others in the
visualization (with later colors ending up on top). You can change the
drawing order by adjusting this parameter (the default draws forward
matches, then reverse-complementary matches, then palindromic matches).
I don't think this should make a big difference in most cases.

title: str or None
If this is not None, then it'll be set as the title of the plot.
Expand All @@ -88,6 +137,10 @@ def viz_spy(m, markersize=0.5, color="black", title=None, ax=None, **kwargs):
If this is not None, then we'll add the visualization within this
Axes object.

verbose: bool
If True, prints information about time taken. Useful for performance
benchmarking.

**kwargs
Will be passed to spy().

Expand All @@ -99,10 +152,67 @@ def viz_spy(m, markersize=0.5, color="black", title=None, ax=None, **kwargs):
if an axes object was provided (i.e. "ax is not None"), then we won't
return anything.
"""
_mlog = get_logger(verbose)

fig, ax = _create_fig_and_ax_if_needed(ax)
ax.spy(m.mat, markersize=markersize, color=color, **kwargs)
if not m.binary and not force_binary:
if len(draw_order) != 3 or set(draw_order) != set(DRAW_ORDER):
raise ValueError(
f"draw_order must include exactly 3 elements ({FWD}, {REV}, "
f"and {BOTH} in any order)."
)
# With viz_imshow(), we manually add one RGB triplet per cell
# (for both match and non-match cells), meaning we don't have to do
# anything special to set the color of zero cells. In this function,
# however, we need to explicitly say "okay, set the background color to
# whatever nbcmap[0] is."
#
# spy() doesn't, as far as I can tell, have an easy way to set the
# background color. We can use set_facecolor(), though -- see
# https://stackoverflow.com/a/23645437.
#
# The most common scenario, I think, will be that the user has default
# matplotlib styles set (in which the default background color is
# white) and nbcmap[0] is set to its default (also of white). To save
# time, I check to see that this is the case -- if so, I don't bother
# calling set_facecolor(). For all other cases, though, I will call
# set_facecolor(). (We could try to avoid more redundant uses of this
# function by converting nbcmap[0] to an RGB triplet to simplify
# comparison with the output of get_facecolor(), but I'm not sure how
# much time that approach would even save...)
if nbcmap[0] != NBCMAP_HEX[0] or ax.get_facecolor() != (1, 1, 1, 1):
_mlog(f"Setting background color to {nbcmap[0]}...")
ax.set_facecolor(nbcmap[0])
_mlog("Done setting background color.")

# PERF: we could speed this up by only calling spy() for match cell
# types that actually exist in the matrix (e.g. if there aren't any
# palindromic matches, skip that spy() call). The most efficient way to
# do this, I think, would be assigning properties to each matrix during
# construction that describe the match cell types this matrix has --
# however, calling spy() with an empty set of values is relatively
# quick, so this isn't super important.
for val in draw_order:
_mlog(f'Visualizing "{val}" cells with spy()...')
# Filter the matrix to just the cells of a certain match type.
# https://stackoverflow.com/a/22077616
# This is somewhat inefficient -- ideally we'd do the filtering in
# one pass, or somehow make use of the information we already have
# from matrix construction about where these match cells are.
ax.spy(
m.mat.multiply(m.mat == val),
markersize=markersize,
color=nbcmap[val],
**kwargs,
)
_mlog(f'Done visualizing "{val}" cells.')
else:
_mlog("Visualizing all match cells with spy()...")
ax.spy(m.mat, markersize=markersize, color=color, **kwargs)
_mlog("Done visualizing all match cells.")
_mlog("Slightly restyling the visualization...")
style_viz_ax(ax, m, title)
_mlog("Done.")
if fig is not None:
return fig, ax

Expand All @@ -120,7 +230,15 @@ def _convert_to_colors(dm, nbcmap):
return cm.astype("uint8")


def viz_imshow(m, cmap="gray_r", nbcmap=NBCMAP, title=None, ax=None, **kwargs):
def viz_imshow(
m,
cmap="gray_r",
nbcmap=NBCMAP_255,
title=None,
ax=None,
verbose=False,
**kwargs,
):
"""Visualizes a DotPlotMatrix object using matplotlib's imshow().

IMPORTANT NOTE: This will convert the sparse matrix contained in the
Expand Down Expand Up @@ -151,6 +269,10 @@ def viz_imshow(m, cmap="gray_r", nbcmap=NBCMAP, title=None, ax=None, **kwargs):
If this is not None, then we will add the visualization within this
Axes object and not bother creating a new figure and axes object.

verbose: bool
If True, prints information about time taken. Useful for performance
benchmarking.

**kwargs
Will be passed to imshow().

Expand All @@ -167,15 +289,22 @@ def viz_imshow(m, cmap="gray_r", nbcmap=NBCMAP, title=None, ax=None, **kwargs):
The default nbcmap is based on Figure 6.20 in Chapter 6 of Bioinformatics
Algorithms (Compeau & Pevzner), edition 2.
"""
_mlog = get_logger(verbose)
fig, ax = _create_fig_and_ax_if_needed(ax)

_mlog("Converting the matrix to dense format...")
dense_mat = m.mat.toarray()
if not m.binary:
_mlog("Converting the matrix from numbers to colors...")
dense_mat = _convert_to_colors(dense_mat, nbcmap)
_mlog("Calling imshow()...")
ax.imshow(dense_mat, **kwargs)
else:
_mlog("Calling imshow()...")
ax.imshow(dense_mat, cmap=cmap, **kwargs)
_mlog("Slightly restyling the visualization...")
style_viz_ax(ax, m, title)
_mlog("Done.")

# Only return fig and ax if _create_fig_and_ax_if_needed() created them
# (i.e. the user did not provide their own "ax" object)
Expand Down
Loading
Loading