Overflow issue with Fairseq Preprocess for large datasets #5532

henrycharlesworth · 2024-08-07T09:06:18Z

🐛 Bug

I realise no one is maintaining this anymore, but just for anyone who might come across a similar issue which was hard to debug:

With the default binarized dataset type in fairseq preprocess (mmap), it is possible to get integer overflow errors when processing big datasets. The key snippet of code is in fairseq/data/indexed_dataset.py:

@staticmethod
def _get_pointers(sizes):
    dtype_size = dtype().itemsize
    address = 0
    pointers = []

    for size in sizes:
        pointers.append(address)
        address += size * dtype_size

    return pointers

for some reason, when using multiple workers it is possible for some of the values in sizes to be np.int32, rather than int. I have not worked out why this is. However, for large enough datasets this can lead to integer overflow (as address becomes type np.int32 rather than int).

The fix is just to change:

address += int(size * dtype_size)

The text was updated successfully, but these errors were encountered:

henrycharlesworth added bug needs triage labels Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overflow issue with Fairseq Preprocess for large datasets #5532

Overflow issue with Fairseq Preprocess for large datasets #5532

henrycharlesworth commented Aug 7, 2024 •

edited

Loading

Overflow issue with Fairseq Preprocess for large datasets #5532

Overflow issue with Fairseq Preprocess for large datasets #5532

Comments

henrycharlesworth commented Aug 7, 2024 • edited Loading

🐛 Bug

henrycharlesworth commented Aug 7, 2024 •

edited

Loading