Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: readpst should produce identical output for identical mails (?switch-controlled? boundary behavior) #9

Open
fhanzlik opened this issue Apr 21, 2023 · 4 comments

Comments

@fhanzlik
Copy link

I now have task: extract mails from several (10+) .PST files (all from one account, collected over the past 15 years or so as bakups), remove duplicities and convert mails into MAILDIR structure.
My idea was to extract individual messages from these .PST (using the libpst/readpst) to separate trees, then delete duplicities (using eg. fdupes) and then join result.

In real (apart from the problem of different number of extracted files when processing one .pst file repeatedly - issue #7 touch it), I ran into the problem of detecting the identical/duplicit messages - because readpst now generate internal message boundaries as random strings. Thus even identical messages not appears so:

$ diff /home/mail/outlook-r2020/archive.pst.mdi/.Doručená\ pošta/cur/1681064600.005298:2,S /home/mail/outlook-r2023/outlook.pst.mdi/.Doručená\ pošta/cur/1681059416.005051:2,S 
38c38
<       boundary="--boundary-LibPST-iamunique-1906170776_-_-"
---
>       boundary="--boundary-LibPST-iamunique-1627685354_-_-"  
41c41
< ----boundary-LibPST-iamunique-1906170776_-_-
---
> ----boundary-LibPST-iamunique-1627685354_-_-  
112782c112782
< ----boundary-LibPST-iamunique-1906170776_-_-
---
> ----boundary-LibPST-iamunique-1627685354_-_-  

Perhaps should be somehow (some switch for this behavior) possible to generate predictable and same in all mails boundaries strings - so the same mails would also be presented by the same message files (in terms of content, not file names).

Thanks in advance, Franta Hanzlík

@pabs3
Copy link
Member

pabs3 commented Apr 21, 2023 via email

@fhanzlik
Copy link
Author

Hi Paul,
I can help with some testing, maybe even scripting or creating an RPM package or even contributing some money to support the project (I didn't find how here), but programming will probably be beyond my capabilities and abilities - I apologize.
Franta Hanzlik

@pabs3
Copy link
Member

pabs3 commented Apr 22, 2023 via email

@pabs3 pabs3 closed this as completed in c0189cd Aug 14, 2023
pabs3 added a commit that referenced this issue Aug 14, 2023
Fall back on the current time when the item creation date is missing.

This makes the mail file output deterministic most of the time,
which means it is easier to compare the results.

Leave a FIXME about using the attendee/owner critical change property for
DTSTAMP, since extracting those from the PST data isn't supported yet.

Partially-fixes: #9
@pabs3
Copy link
Member

pabs3 commented Aug 14, 2023

@fhanzlik this feature has been implemented in git, could you test it? It works for me but I'd like a second set of eyes and data before closing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants