Skip to content
This repository has been archived by the owner on Jun 30, 2021. It is now read-only.

smd can destroy the entire mail spool #18

Open
anarcat opened this issue Jun 29, 2021 · 2 comments
Open

smd can destroy the entire mail spool #18

anarcat opened this issue Jun 29, 2021 · 2 comments

Comments

@anarcat
Copy link
Contributor

anarcat commented Jun 29, 2021

I have seen at least three cases of smd attempting to destroy my entire mail spool. In the three cases, I have mostly recovered thanks to backups, notmuch, and the distributed nature of SMD/email. But it's nerve-wracking and, well, really, really bad.

I still don't quite understand what happens. I can't find it in the logs anymore, but I remember seeing something about smd tripping over an empty directory inside the .smd directory. I wrote about his on my blog here:

https://anarc.at/blog/2021-03-22-email-crash/
https://anarc.at/blog/2021-06-29-another-mail-crash/

I think the fundamental problem is that smd-pull and smd-push's locking mechanisms might have a bug. I was able to start both processes at once by furiously hitting "reload" in my email setup (essentially telling systemd to repeately run smd-pull.service), which eventually started at the same time as smd-push. Here's a log from systemd:

jun 28 16:30:04 curie systemd[2845]: Starting pull emails with syncmaildir...
jun 28 16:30:09 curie systemd[2845]: smd-pull.service: Succeeded.
jun 28 16:30:09 curie systemd[2845]: Finished pull emails with syncmaildir.
jun 28 16:30:09 curie systemd[2845]: Starting push emails with syncmaildir...
jun 28 16:30:09 curie systemd[2845]: Starting pull emails with syncmaildir...
jun 28 16:30:09 curie systemd[2845]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
jun 28 16:30:09 curie systemd[2845]: smd-pull.service: Failed with result 'exit-code'.
jun 28 16:30:09 curie systemd[2845]: Failed to start pull emails with syncmaildir.
jun 28 16:30:22 curie systemd[2845]: Starting pull emails with syncmaildir...
jun 28 16:30:22 curie smd-pull[391140]: Already running.
jun 28 16:30:22 curie smd-pull[391140]: If this is not the case, remove /home/anarcat/.smd/lock by hand.
jun 28 16:30:22 curie smd-pull[391140]: any: smd-pushpull@localhost: TAGS: error::context(locking) probable-cause(another-instance-is-running) human-intervention(necessary) suggested-actions(run(kill 390135) run(rm /home/anarcat/.smd/lock))
jun 28 16:30:22 curie systemd[2845]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
jun 28 16:30:22 curie systemd[2845]: smd-pull.service: Failed with result 'exit-code'.
jun 28 16:30:22 curie systemd[2845]: Failed to start pull emails with syncmaildir.
jun 28 16:31:14 curie systemd[2845]: smd-push.service: Main process exited, code=killed, status=15/TERM
jun 28 16:31:14 curie systemd[2845]: smd-push.service: Failed with result 'signal'.
jun 28 16:31:14 curie systemd[2845]: Stopped push emails with syncmaildir.
jun 28 16:31:14 curie systemd[2845]: smd-push.service: Consumed 6.600s CPU time.
jun 28 16:31:22 curie systemd[2845]: Starting push emails with syncmaildir...
jun 28 16:31:23 curie smd-push[391563]: Found lockfile of a dead instance. Ignored.
jun 28 16:31:30 curie smd-push[392598]: register: smd-client@smd-server-register: TAGS: stats::new-mails(0), del-mails(0), bytes-received(0), xdelta-received(216)
jun 28 16:31:30 curie systemd[2845]: smd-push.service: Succeeded.
jun 28 16:31:30 curie systemd[2845]: Finished push emails with syncmaildir.
jun 28 16:31:30 curie systemd[2845]: smd-push.service: Consumed 4.727s CPU time.
jun 28 16:31:38 curie systemd[2845]: Starting pull emails with syncmaildir...
jun 28 16:32:22 curie smd-pull[393092]: default: smd-client@localhost: TAGS: stats::new-mails(0), del-mails(51458), bytes-received(0), xdelta-received(247)
jun 28 16:32:22 curie systemd[2845]: smd-pull.service: Main process exited, code=killed, status=15/TERM
jun 28 16:32:22 curie systemd[2845]: smd-pull.service: Failed with result 'signal'.
jun 28 16:32:22 curie systemd[2845]: Stopped pull emails with syncmaildir.
jun 28 16:32:22 curie systemd[2845]: smd-pull.service: Consumed 12.632s CPU time.

The first thing to notice is that, at 16:30:09, two jobs are started in parallel. This is a bug in the systemd configuration, which shouldn't allow that in the first place. But considering that smd has its own locking primitives, I consider it's also a bug there. At 16:30:22, you see me triggering another run, and then smd-pull notices its other smd-pull process running and aborts. But it's still running because it's actively destroying all mail on the remote server!!

At 16:31:14 you can see I notice the problem and kill the process. And then at 16:32:22 you can see the timer i forgot to disable fired and destroyed thousands of emails locally.

Out fo 300k emails, I lost 220k in this case.

I'll send a PR fixing the systemd configuration, but I can't help but think there should be safeguards in SMD preventing it from destroying everything like this.

anarcat added a commit to anarcat/syncmaildir that referenced this issue Jun 29, 2021
@gares
Copy link
Owner

gares commented Jun 30, 2021

Maybe my implementation of a lock file

check_lockfile() {

is buggy. I don't know if there is a sound lock implementation in shell script.

@anarcat
Copy link
Contributor Author

anarcat commented Jun 30, 2021

that does look "racey": if two programs run this, one could check for the lockfile existence ([ -f $LOCKFILE ]) right before the other creates it (echo $$ > $LOCKFILE) not find it, and then overwrite the other program's lockfile...

typically, you would call flock to do atomic locking, but it doesn't work in all situations either (e.g. NFS). locking is hard, particularly from shell...

https://manpages.debian.org/flock

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants