Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] Documentation and docstring seem incorrect for LanceFragment.to_batches #2866

Open
bnorick opened this issue Sep 11, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@bnorick
Copy link

bnorick commented Sep 11, 2024

Didn't dig into the cause of this, but the documentation and docstring (below) for LanceFragment.to_batches have the wrong function signature and details.

Fragment.to_batches(self, Schema schema=None, columns=None, Expression filter=None, int batch_size=_DEFAULT_BATCH_SIZE, int batch_readahead=_DEFAULT_BATCH_READAHEAD, int fragment_readahead=_DEFAULT_FRAGMENT_READAHEAD, FragmentScanOptions fragment_scan_options=None, bool use_threads=True, MemoryPool memory_pool=None)

Read the fragment as materialized record batches.

Parameters
----------
schema : Schema, optional
    Concrete schema to use for scanning.
columns : list of str, default None
    The columns to project. This can be a list of column names to
    include (order and duplicates will be preserved), or a dictionary
    with {new_column_name: expression} values for more advanced
    projections.

    The list of columns or expressions may use the special fields
    `__batch_index` (the index of the batch within the fragment),
    `__fragment_index` (the index of the fragment within the dataset),
    `__last_in_fragment` (whether the batch is last in fragment), and
    `__filename` (the name of the source file or a description of the
    source fragment).

    The columns will be passed down to Datasets and corresponding data
    fragments to avoid loading, copying, and deserializing columns
    that will not be required further down the compute chain.
    By default all of the available columns are projected. Raises
    an exception if any of the referenced column names does not exist
    in the dataset's Schema.
filter : Expression, default None
    Scan will return only the rows matching the filter.
    If possible the predicate will be pushed down to exploit the
    partition information or internal metadata found in the data
    source, e.g. Parquet statistics. Otherwise filters the loaded
    RecordBatches before yielding them.
batch_size : int, default 131_072
    The maximum row count for scanned record batches. If scanned
    record batches are overflowing memory then this method can be
    called to reduce their size.
batch_readahead : int, default 16
    The number of batches to read ahead in a file. This might not work
    for all file formats. Increasing this number will increase
    RAM usage but could also improve IO utilization.
fragment_readahead : int, default 4
    The number of files to read ahead. Increasing this number will increase
    RAM usage but could also improve IO utilization.
fragment_scan_options : FragmentScanOptions, default None
    Options specific to a particular scan and fragment type, which
    can change between different scans of the same dataset.
use_threads : bool, default True
    If enabled, then maximum parallelism will be used determined by
    the number of available CPU cores.
memory_pool : MemoryPool, default None
    For memory allocations, if required. If not specified, uses the
    default pool.

Returns
-------
record_batches : iterator of RecordBatch

The actual function is

    def to_batches(
        self,
        *,
        columns: Optional[Union[List[str], Dict[str, str]]] = None,
        batch_size: Optional[int] = None,
        filter: Optional[Union[str, pa.compute.Expression]] = None,
        limit: Optional[int] = None,
        offset: Optional[int] = None,
        with_row_id: bool = False,
        batch_readahead: int = 16,
    ) -> Iterator[pa.RecordBatch]:
        ...

I won't comment on all the differences, but note for example the lack of a fragment_scan_options argument.

@wjones127 wjones127 added the bug Something isn't working label Sep 11, 2024
@wjones127
Copy link
Contributor

Hmm I think it's because we are inheriting from the PyArrow dataset. We might want to stop that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants