Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the coordinate systems in use #42

Open
ccwang002 opened this issue Aug 5, 2024 · 1 comment
Open

Question about the coordinate systems in use #42

ccwang002 opened this issue Aug 5, 2024 · 1 comment

Comments

@ccwang002
Copy link

Thanks for developing the genomic tool stack in python. As a user from R/Bioconductor ecosystem, I would like to get some clarifications on the coordinate system in use.

Here is the current description about the coordinate system from BiocPy's documentation

An IRanges holds a start position and a width, and is most typically used to represent coordinates along some genomic sequence. The interpretation of the start position depends on the application; for sequences, the start is usually a 1-based position, ...

>>> from iranges import IRanges
>>> starts = [-2, 6, 9, -4, 1, 0, -6, 10]
>>> widths = [5, 0, 6, 1, 4, 3, 2, 3]
>>> ir = IRanges(starts, widths)
>>> print(ir)
IRanges object with 8 ranges and 0 metadata columns
               start              end            width
    <ndarray[int32]> <ndarray[int32]> <ndarray[int32]>
[0]               -2                3                5
[1]                6                6                0
[2]                9               15                6
[3]               -4               -3                1
[4]                1                5                4
[5]                0                3                3
[6]               -6               -4                2
[7]               10               13                3

This seems to indicate that the BiocPy's coordinate system is 1-based and half open interval, e.g. [start, end).

However, this behavior is different to R's IRanges, where the coordinate system uses a 1-based closed interval, e.g. [start, end]:

> library(IRanges)
> starts = c(-2, 6, 9, -4, 1, 0, -6, 10)
> width = c(5, 0, 6, 1, 4, 3, 2, 3)
> ir <- IRanges(start = starts, width=width)
> ir
IRanges object with 8 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]        -2         2         5
  [2]         6         5         0
  [3]         9        14         6
  [4]        -4        -4         1
  [5]         1         4         4
  [6]         0         2         3
  [7]        -6        -5         2
  [8]        10        12         3

Wanted to clarify if the design is intentional. If so, could we note the different behaviors between BiocPy's and R's IRanges in the documentation? I was wondering if we could also create some utility functions to convert across the coordinate systems, so it's easy to port the existing R scripts that expects a 1-based closed interval.

Thank you again for making the tool!

@jkanche
Copy link
Member

jkanche commented Aug 5, 2024

Hi, Thank you for reporting this. My best guess is, this is mostly related to how we compute the ends (start + width, start + width -1), we should subtract 1 base to keep the intervals closed and conform to R's IRanges. I'll also perform some tests to make sure thats the case with the methods as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants