Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overflow issue with Fairseq Preprocess for large datasets #5532

Open
henrycharlesworth opened this issue Aug 7, 2024 · 0 comments
Open

Overflow issue with Fairseq Preprocess for large datasets #5532

henrycharlesworth opened this issue Aug 7, 2024 · 0 comments

Comments

@henrycharlesworth
Copy link

henrycharlesworth commented Aug 7, 2024

🐛 Bug

I realise no one is maintaining this anymore, but just for anyone who might come across a similar issue which was hard to debug:

With the default binarized dataset type in fairseq preprocess (mmap), it is possible to get integer overflow errors when processing big datasets. The key snippet of code is in fairseq/data/indexed_dataset.py:

@staticmethod
def _get_pointers(sizes):
    dtype_size = dtype().itemsize
    address = 0
    pointers = []

    for size in sizes:
        pointers.append(address)
        address += size * dtype_size

    return pointers

for some reason, when using multiple workers it is possible for some of the values in sizes to be np.int32, rather than int. I have not worked out why this is. However, for large enough datasets this can lead to integer overflow (as address becomes type np.int32 rather than int).

The fix is just to change:

address += int(size * dtype_size)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant