-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug with parallel jobs? #7
Comments
Thanks for the report.
Which version/commit of libpst are you running?
What is the exact command you are running in the two cases?
I'm unable to reproduce this with the sample files I have already.
If you could attach a sample without any private data in it that lets
you reproduce the issue, that would be good.
Are you able to use `git bisect` to track down when this started?
…--
bye,
pabs
https://bonedaddy.net/pabs3/
|
It seems this bug doesn't depend on the version of libpst, but the version of debian. I built the most recent and version 0.6.71 on a debian 11 server and both versions had this behavior while building the same source on debian 10 worked fine. I can reproduce it with the minimal debian 11 container image from https://images.linuxcontainers.org/ and lxc. After:
I get:
The last one with |
I can reproduce it with the file you linked now. Thanks for the debugging, that is a very strange bug. Perhaps it is a compiler or optimsation issue. I will try to figure out what is going on. |
It definitely isn't affected by the optimisation level. Seems to happen with GCC 9-12 but not GCC 8 on Debian buster. |
I'll continue looking at this tomorrow, but if you have any more ideas for debugging, that would be great. |
Yea looks like this could be hard to debug, I "fixed" by just running with |
Good point, I'll take a look at this today.
…--
bye,
pabs
https://bonedaddy.net/pabs3/
|
Looks like you are onto something, building the binary on buster and then copying it to bullseye breaks the parallel extraction even though it still runs fine. |
Upgrading |
I'm doing a debbisect run for this issue of unstable between the
release date of buster and the release date of bullseye. This relies on
downloads from the Debian wayback machine, which has rate limits, so
the run could take a while to complete, but it will pinpoint the exact
package difference that will cause the problem.
https://manpages.debian.org/man/debbisect
https://wiki.debian.org/BisectDebian
…--
bye,
pabs
https://bonedaddy.net/pabs3/
|
According to the debbisect process, upgrading libc6 from 2.28-10 to 2.29-1 broken the jobs feature. |
I checked this issue on a Fedora 36 live image with glibc 2.35 and found that it still occurs there. I'll discuss it further with the Debian and upstream glibc maintainers. |
Seems like a tricky bug, thanks for looking into it!
|
Yeah, the next step is to bisect glibc to figure out what commit caused
the issue and then decide if libpst or glibc needs to be fixed. I got
some build/bisect tips from the Debian glibc maintainers yesterday,
so I will be looking at doing this some time this week.
…--
bye,
pabs
https://bonedaddy.net/pabs3/
|
After a false start I managed to redo the bisect and got this bad commit:
This is the git bisect log, which you can replay to copy/visualise the results locally:
|
I note that |
I used a simple program to check that freopen always has reset the file position, so that isn't the problem and the issue seems more complicated than I thought. |
I am now running the command with the debug log enabled, with and without -j0 in both Debian buster and bullseye, then I'll compare the logs and see if I can see what happened. |
The builds have completed but weirdly the bullseye run didn't result in
the error, maybe enabling the debug log fixes the issue somehow.
…--
bye,
pabs
https://bonedaddy.net/pabs3/
|
Confirmed, dropping the debug log introduces the error again... |
After enabling some commented out printf calls, I noticed that one of the readpst child processes crashed with |
Looks like maybe my printf patch caused the SIGSEGV. |
Noticed another thing, only one folder seems affected and it seems the missing items have been skipped. @amulepeweichan can you confirm that you only ever get the issue with one folder and its the same folder as me?
|
OK, a second run got that folder affected and another one too, nevermind. The skip count increase is exactly the same as the missing item though. |
These debug messages are printed for each extra skipped item:
|
The reasons for each unparsed item seem different. I am getting the feeling that the code is reading data from an incorrect randomised location.
|
Aha, managed to reproduce it with a full debug log, so enabling that doesn't fix the issue. Good, that would have been weird :) |
Looking at the full debug log, it seems to get completely different data and if I use hexed to search for the bad data, it is at 0x6AF321C, while the good data is at 0x6600, so my hunch about the read location being wrong seems correct. |
@amulepeweichan are you able to reproduce this with any other PST files? The size of the logs with this one makes it challenging to analyse and debug. I tried with the smallest ones on that site and didn't get the error. |
Yes, I think it happens with most of the PST files on
https://files.enlacehacktivista.org/Pronico%20mail/
I noticed the bug when unpacking them all with readpst on one server
resulted in 2.7 TB of email total, and on another server just 1.7 TB.
And then on that server I ran readpst again with -j0 and got the same 2.7TB
…On 8/16/22 23:11, Paul Wise wrote:
@amulepeweichan are you able to reproduce this with any other PST files? The size of the logs with this one makes it challenging to analyse and debug. I tried with the smallest ones on that site and didn't get the error.
|
Since this bug can cause data loss and is going to be very hard to
debug and fix, I think I'm going to disable the -j option until I can
figure out what exactly is going on.
…--
bye,
pabs
https://bonedaddy.net/pabs3/
|
Edited the code to delete the |
Same for the |
I can reproduce it with the file you linked now.
Thanks for the debugging, that is a very strange bug.
Perhaps it is a compiler or optimsation issue.
I will try to figure out what is going on.
…--
bye,
pabs
https://bonedaddy.net/pabs3/
|
I can still reproduce this in Debian bookworm (GCC 12 glibc 2.35) and sid (GCC 12 glibc 2.36). |
I discovered that switching from |
I added some debugging to the pst_reopen function and none of the
function calls including freopen are failing in the bad case.
…--
bye,
pabs
https://bonedaddy.net/pabs3/
|
This comment was marked as outdated.
This comment was marked as outdated.
I discovered that switching from freopen() to fseek(0) does not fix the issue and can cause segfaults. |
I discovered that switching from freopen() to fflush()+fclose()+fopen() also does fix the issue. |
I now tried to extract mails from .pst by readpst (from Fedora 37 libpst-0.6.76-9.fc37.x86_64 package), and it seem behave strangely, because number of from .PST extracted messages varies, both when using different -e / -m / -M switches and even when using the same switch repeatedly(!).
I got results:
Also, isn't it weird that the run time is around two seconds with -e and -M switch, but around two minutes with -m switch?
(it seems as readpst uses multiple CPU threads with -e/-M switch and only one with -m switch, but even then the differences are very big) Also, according to readpst man page it appears that number of extracted files with '-m' should be double (.EML + .MSG) the number with -e/-M - but is bigger - maybe because there are some (maybe undocumented) files with 'temp_file_attachment' name. |
Try it with parallel mode disabled `-j 0` and it should work fine.
I haven't been able to figure out how to fix the parallel mode yet.
…--
bye,
pabs
https://bonedaddy.net/pabs3/
|
Hi Paul, thanks for your help!
6681 extracted files (-m switch) == 2 * 3304 (-e or -M switch) + 73 'temp_file_attachment' files There are still minor uncertainties:
Thanks, Franta |
Can you file two new issues for these two items? They are off-topic on this issue.
I'm not entirely sure, but I think the temp_file_attachment things are
a bug in the code that writes .msg files, it first creates temporary
attachment files, then saves them into the .msg files but does not
clean up the temporary attachment files. The correct fix here would be
to make it write directly to the .msg files instead.
The difference in speed of the -m option is likely because it writes
double the data (.eml files plus .msg files) and the .msg files are in
Microsoft Outlook binary format and .msg files are created by the
libgsf library, and libgsf probably isn't optimised very well.
…--
bye,
pabs
https://bonedaddy.net/pabs3/
|
I revisited this bug using the strace stack trace suggestion from the
glibc folks and found that this seems to be caused by readpst child
processes doing a seek on a parent process file descriptor.
https://sourceware.org/pipermail/libc-help/2023-August/006430.html
The correct fix would appear to be to close the file and open it again.
Also, the fflush+fclose+fopen case actually fails the same as freopen.
If libc-help confirms I'll be making a commit and backporting it.
…--
bye,
pabs
https://bonedaddy.net/pabs3/
|
This fixes a data loss problem in the readpst parallel jobs support. glibc since 2.29 and musl both seek before closing the file descriptor. Since the file descriptor is still linked to the file description created and still used by the parent process, the seek changes the file position of the file description of the file descriptor of the parent process, which means the parent process reads data from the wrong location and soon encounters errors, stops processing the data and misses out on processing the messages that occur later in the PST file. Since sharing a file description between parent and child processes is inherently unsafe and there may be other C libraries that seek even before fclose(), this is only a workaround and a different design is required. Workaround-for: #7
I've pushed a workaround for this bug to git. After some other fixes I'll make a new release and also backport it to other releases. At some point the readpst parallel jobs feature will need to be redesigned, probably the new design will need to use mmap. |
…wards glibc since 2.29 and musl both change the file position indicator before reopening the file in the child process. Workaround-for: pst-format#7
This avoids leaving the temporary file behind in the directory. Suggested-in: #7 (comment)
@fhanzlik FYI, added a change to delete those temporary files. Not writing them in the first place should be possible, but will require an incremental API for handling attachments, so I have deferred that. |
…wards glibc since 2.29 and musl both change the file position indicator before reopening the file in the child process. Workaround-for: pst-format#7
…wards glibc since 2.29 and musl both change the file position indicator before reopening the file in the child process. Workaround-for: pst-format#7
…wards glibc since 2.29 and musl both change the file position indicator before reopening the file in the child process. Workaround-for: pst-format#7
…wards glibc since 2.29 and musl both change the file position indicator before reopening the file in the child process. Workaround-for: pst-format#7
When I run
readpst -M -b file.pst
it extracts less mail, and a slightly different amount each time, than when it's run with the -j 0 option. An older libpst 0.71 extracts all mail in parallel without -j 0 so maybe there's some concurrency bug that got introduced since then? I can send you pst files if you have trouble reproducing, but for me this happens on basically every .pst file.The text was updated successfully, but these errors were encountered: