Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add cigar utility method get_alignment_offsets #188

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

yfarjoun
Copy link
Contributor

  • add cigar utility method get_alignment_offsets to find the offsets of the aligned segment in the read
  • tests included

Copy link

@TimD1 TimD1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, once the ruff/mypy CI stuff passes

Get the starting and ending offset for the alignment based on the CIGAR string.

Arguments:
reverse: bool Whether to count from the end of the read sequence.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit: reverse (bool): Whether... for consistency with other docstrings

Copy link

codecov bot commented Sep 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.43%. Comparing base (c13d3cc) to head (5709d98).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #188      +/-   ##
==========================================
+ Coverage   89.35%   89.43%   +0.08%     
==========================================
  Files          18       18              
  Lines        2094     2111      +17     
  Branches      464      468       +4     
==========================================
+ Hits         1871     1888      +17     
  Misses        146      146              
  Partials       77       77              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@yfarjoun
Copy link
Contributor Author

not sure what's going on the mkdocs. I had to comment out some lines to get it to compile.... if someone knows how to fix this please let me know...

Copy link
Contributor

@msto msto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be a good addition to the package, thanks for adding it!

I think the documentation could be a bit clearer, and it looks like there's a bug when you have adjacent hard/soft clips

fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
0-based, open-ended offsets (to the read sequence.)
If 'reverse' is True, the offsets count from the end of the read sequence.

# TODO: FIGURE OUT WHY the following three lines causes mkdocs to fail
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIXME please fix and remove the TODO before merging

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put it like this because I was unable to resolve...when I uncomment mkdocs barfs....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/griffe/docstrings/google.py", line 474, in _read_returns_section
    annotation = annotation.slice.elements[index]
                 ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^
IndexError: list index out of range

fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
tests/fgpyo/sam/test_cigar.py Outdated Show resolved Hide resolved
tests/fgpyo/sam/test_cigar.py Outdated Show resolved Hide resolved
tests/fgpyo/sam/test_cigar.py Outdated Show resolved Hide resolved
tests/fgpyo/sam/test_cigar.py Outdated Show resolved Hide resolved
yfarjoun and others added 4 commits September 18, 2024 22:24
…of the aligned segment in the read

- tests included
Co-authored-by: Matt Stone <matt@fulcrumgenomics.com>
Co-authored-by: Matt Stone <matt@fulcrumgenomics.com>
@yfarjoun
Copy link
Contributor Author

Thanks for the detailed review @msto , much appreciated!

mkdocs is giving me grief and I was unable to make it play nice.... if someone knows why it barfs when I change the indentation or remove the final "comment" line... please let me know. this is the best (passing) state I could put it into....

@@ -368,7 +368,7 @@ class CigarOp(enum.Enum):
code (int): The `~pysam` cigar operator code.
character (int): The single character cigar operator.
consumes_query (bool): True if this operator consumes query bases, False otherwise.
consumes_target (bool): True if this operator consumes target bases, False otherwise.
consumes_reference (bool): True if this operator consumes reference bases, False otherwise.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an unrelated fix. if deemed imporatant I could extract it to a separate PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yfarjoun This uses target intentionally - not all alignments are between a read and a reference - what if you align two reads vs. each other, or two contigs vs. each other?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tfenne While not all alignments are between a read and a reference, the CIGAR specification uses the terms "query" and "reference". I agree with Yossi, I find it clearer to have our documentation consistent with spec.

Consumes query” and “consumes reference” indicate whether the CIGAR operation causes the
alignment to step along the query sequence and the reference sequence respectively

Screenshot 2024-09-19 at 11 23 12 AM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I was correcting a broken link.... if this is intentional, I'll bring it back and let's haved the discussion in a different PR (if at all)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% on different PR. Though I'll also argue there for maintaining "target".

Little known fact is that CIGAR actually predates SAM/BAM and Google believes it was first documented as part of exonerate which uses "query" and "target" 😛

@@ -368,7 +368,7 @@ class CigarOp(enum.Enum):
code (int): The `~pysam` cigar operator code.
character (int): The single character cigar operator.
consumes_query (bool): True if this operator consumes query bases, False otherwise.
consumes_target (bool): True if this operator consumes target bases, False otherwise.
consumes_reference (bool): True if this operator consumes reference bases, False otherwise.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yfarjoun This uses target intentionally - not all alignments are between a read and a reference - what if you align two reads vs. each other, or two contigs vs. each other?

@@ -547,6 +547,48 @@ def length_on_target(self) -> int:
"""Returns the length of the alignment on the target sequence."""
return sum([elem.length_on_target for elem in self.elements])

def get_alignment_offsets(self, reverse: bool = False) -> Tuple[int, int]:
"""
Get the starting and ending offsets for the alignment based on the CIGAR string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole docstring needs to be out-dented to start at the same column as the """.

Copy link
Contributor Author

@yfarjoun yfarjoun Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this and mkdocs responds: 🤮

<SNIP>
^^^^^^^^^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/children.html.jinja", line 1, in top-level template code
    {% extends "_base/children.html.jinja" %}
    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/_base/children.html.jinja", line 73, in top-level template code
    {% include class|get_template with context %}
^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/class.html.jinja", line 1, in top-level template code
    {% extends "_base/class.html.jinja" %}
    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/_base/class.html.jinja", line 105, in top-level template code
    {% block contents scoped %}
^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/_base/class.html.jinja", line 179, in block 'contents'
    {% block children scoped %}
^^^^^^^^^^^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/_base/class.html.jinja", line 186, in block 'children'
    {% include "children"|get_template with context %}
^^^^^^^^^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/children.html.jinja", line 1, in top-level template code
    {% extends "_base/children.html.jinja" %}
    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/_base/children.html.jinja", line 94, in top-level template code
    {% include function|get_template with context %}
^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/function.html.jinja", line 1, in top-level template code
    {% extends "_base/function.html.jinja" %}
    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/_base/function.html.jinja", line 105, in top-level template code
    {% block contents scoped %}
^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/_base/function.html.jinja", line 112, in block 'contents'
    {% block docstring scoped %}
^^^^^^^^^^^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/_base/function.html.jinja", line 112, in block 'docstring'
    {% block docstring scoped %}
^^^^^^^^^^^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/jinja2/environment.py", line 487, in getattr
    return getattr(obj, attribute)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/functools.py", line 993, in __get__
    val = self.func(instance)
          ^^^^^^^^^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/griffe/dataclasses.py", line 119, in parsed
    return self.parse()
           ^^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/griffe/dataclasses.py", line 136, in parse
    return parse(self, parser or self.parser, **(options or self.parser_options))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/griffe/docstrings/parsers.py", line 41, in parse
    return parsers[parser](docstring, **options)  # type: ignore[operator]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/griffe/docstrings/google.py", line 809, in parse
    section, offset = reader(docstring, offset=offset + 1, **options)  # type: ignore[operator]
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/griffe/docstrings/google.py", line 474, in _read_returns_section
    annotation = annotation.slice.elements[index]
                 ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^
IndexError: list index out of range

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@@ -547,6 +547,48 @@ def length_on_target(self) -> int:
"""Returns the length of the alignment on the target sequence."""
return sum([elem.length_on_target for elem in self.elements])

def get_alignment_offsets(self, reverse: bool = False) -> Tuple[int, int]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is quite ambiguous given that we're dealing with an alignment between a query and a target. How about one of the following:

get_query_alignment_offsets
get_query_alignment_bounds
get_query_alignment_range

Also: consider returning a range object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean a python range ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neat idea!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functions prefixed with get_ are usually reserved for true property getters.

Can we drop the prefix too?

Suggested change
def get_alignment_offsets(self, reverse: bool = False) -> Tuple[int, int]:
def alignment_offsets(self, reverse: bool = False) -> Tuple[int, int]:

https://google.github.io/styleguide/pyguide.html#315-getters-and-setters

@@ -547,6 +547,48 @@ def length_on_target(self) -> int:
"""Returns the length of the alignment on the target sequence."""
return sum([elem.length_on_target for elem in self.elements])

def get_alignment_offsets(self, reverse: bool = False) -> Tuple[int, int]:
"""
Get the starting and ending offsets for the alignment based on the CIGAR string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Get the starting and ending offsets for the alignment based on the CIGAR string.
Get the 0-based, end-exclusive positions of the first and last aligned base in the query.

Comment on lines 555 to 556
reverse: If True, count from the end of the read sequence.
Otherwise (default behavior), count from the beginning of the read sequence.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unclear to me - does it just count from the end? Or does it also swap the order of "start" vs. "end"? I think you're saying that:

cigar.get_alignment_offsets(reverse=True) == cigar.reverse.get_alignment_offsets()

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.

A tuple (start, end), defining the start and end offsets of the aligned part
of the read. These offsets are 0-based and open-ended, with respect to the
beginning of the read sequence. (If 'reverse' is True, the offsets are with
respect to the end of the read sequence.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
respect to the end of the read sequence.)
respect to the end of the query sequence.)

Comment on lines 563 to 567
If the Cigar contains no alignment operators that consume sequence bases, or
only clipping operators, the start and end offsets will be the same value
(indicating an empty region). This shared value will be the offset to the first
base consumed by a non-clipping operator or the length of the read sequence if
there is no such base.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that either a) this function should return Optional[...] and return None when invoked on a cigar without alignment operations, or raise an error in that situation. There is no aligned start/end if there are no alignment operations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue in favor of Optional vs raising an error, but would be comfortable either way

Comment on lines +582 to +585
elif not element.operator.is_clipping:
# We are within the alignment
alignment_began = True
end_offset += element.length_on_query
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So ... I think you have to consider edge-case cigars here. What does this function report for:

  • 76D
  • 76I
  • 10P76S
  • 50S1000N50S

tests/fgpyo/sam/test_cigar.py Outdated Show resolved Hide resolved
tests/fgpyo/sam/test_cigar.py Outdated Show resolved Hide resolved
@yfarjoun yfarjoun changed the title - add cigar utility method get_alignment_offsets Feat: add cigar utility method get_alignment_offsets Sep 19, 2024
@@ -547,6 +547,47 @@ def length_on_target(self) -> int:
"""Returns the length of the alignment on the target sequence."""
return sum([elem.length_on_target for elem in self.elements])

def query_alignment_offsets(self, reverse: bool = False) -> Optional[range]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tfenne I disagree with the suggestion to use a range object.

  1. This is an unconventional way to use range
  2. The stated purpose of the function is to return the start and end positions of the alignment within the query sequence. I assume that in most contexts where this function will be used, the start and end positions will be used to slice the query sequence. range cannot be used to slice a string, nor does it support tuple unpacking.

I think it would be more ergonomic to return a tuple[int, int], though I wouldn't mind seeing a use case where range provides advantages. (Nor would I be opposed to using a NamedTuple with start and end attributes)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not opinionated here, but am noticing that range does have start-stop access, so in that regard it will be equivalent to the named NamedTuple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(almost stop!=end)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but you can't unpack the range

i.e.

# this is valid
from typing import NamedTuple

class Offsets(NamedTuple):
    start: int
    end: int
    
offsets = Offsets(start=1, end=200)

start, end = offsets

# this is invalid
offsets = range(1, 200)
start, end = offsets

So when calling the method, you can either accept the tuple or unpack it immediately, depending on which makes more sense in your context

e.g. I imagine most use cases will look like the following

start, end = cigar.alignment_offsets()
aligned_sequence = query[start:end]

Args:
reverse: If True, count from the end of the query. i.e. find the offsets
using the reversed elements of the cigar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

@msto msto changed the title Feat: add cigar utility method get_alignment_offsets feat: add cigar utility method get_alignment_offsets Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants