Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG REPORT: Add buggy test case (files created by pbzip2) #7

Merged
merged 3 commits into from
Aug 5, 2019

Conversation

luispedro
Copy link

This is more of a bug report than a real PR in that it adds a test case that fails, but I have no fix ATM.

Originally reported as a bug in NGLess (see ngless-toolkit/ngless#116). After the original report, @unode provided the following analysis:

If using pbzip2 the parallel version of bzip2 to create the files,
ngless is able to consume the files up to a certain size. In the
test-case I setup locally a Fastq file with 9724 lines, (266413 bytes
compressed, 900170 uncompressed) causes ngless to fail with
BZ2_bzDecompress: -1. Regular unix bzip2 is able to decompress the file
without problems.

On the other hand if using regular bzip2, tried as many as 90000 lines
and ngless is still able to consume the files without error.

There is more detail at ngless-toolkit/ngless#116

Originally reported as a bug in NGLess (see
ngless-toolkit/ngless#116). After the original
report, @unode provided the following analysis:

> If using pbzip2 the parallel version of bzip2 to create the files,
> ngless is able to consume the files up to a certain size. In the
> test-case I setup locally a Fastq file with 9724 lines, (266413 bytes
> compressed, 900170 uncompressed) causes ngless to fail with
> BZ2_bzDecompress: -1. Regular unix bzip2 is able to decompress the file
> without problems.
>
> On the other hand if using regular bzip2, tried as many as 90000 lines
> and ngless is still able to consume the files without error.
@snoyberg
Copy link
Owner

snoyberg commented Aug 1, 2019

It sounds likely that this is an issue stemming from the underlying C library, not from the Haskell code itself. Interested in trying to update to the latest version of the underlying library and see if that fixes this?

@unode
Copy link

unode commented Aug 1, 2019

Not much has changed in the underlying lib.
The patch at https://gist.github.com/unode/d4369136214c831bd66e628261049900 highlights the differences (split in two commits to highlight code vs boilerplate changes)

@luispedro
Copy link
Author

I played a bit more with this, the cbits are only used on Windows; on Linux, the library is linked against the system's bz2, so it should not matter.

@luispedro
Copy link
Author

I now think that this is caused by the Haskell interface not correctly handling the case where bzip files are concatenated together (while other tools appear to produce the concatenation of the inputs).

Curiously enough, I had reported this exact phenomenon on the gzip conduit snoyberg/conduit#254

@luispedro
Copy link
Author

A further airport hacking session confirmed that it's about multiple concatenated streams in a single file.

Depending on how delayed my flight is, I will fix this in a bit or after the week-end.

Unlike with the gzip conduit¸ I propose to change the API so that a new decompress1 function extracts a single stream, and decompress/bunzip2 extracts all. This is because, unlike in the gzip case, there is no backwards-compatibility argument: currently it crashes on multiple streams.

sample5.bz2 is simply "cat sample1.bz2 sample1.bz2" (and sample5.ref is
"cat sample1.ref sample1.ref"). This is handled by bzip2 tools
(including the official tool and wrapper such as the Python wrapper).
@luispedro luispedro mentioned this pull request Aug 2, 2019
if ret == c'BZ_STREAM_END
then do
dataIn <- liftIO $ peek $ p'bz_stream'next_in ptr
unread <- liftIO $ S.packCStringLen (dataIn, fromEnum availIn)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fromIntegral is more idiomatic I think.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually like fromEnum as more clearly converting to Int, but I'll take your lead as it's your project.

next <- await
case next of
Nothing -> return ()
Just bs -> do
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on data source, it's theoretically possible that at end of stream we may receive an empty ByteString. This code would treat such a chunk as the start of a new stream. It may be better to integrate this logic more closely with the multi stream detection above.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair catch. I'll change the logic to ignore empty ByteStrings

The bzip2 utilities accept files that are concatenations of bzip2
streams. Previously, the Haskell wrapper would throw an error in this
case.

This adds the decompress1 conduit which extracts just one stream if
desired.
@luispedro
Copy link
Author

Force pushed the requested changes.

@snoyberg snoyberg merged commit 584fad9 into snoyberg:master Aug 5, 2019
@snoyberg
Copy link
Owner

snoyberg commented Aug 5, 2019

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants