Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #666 handle SIGTERM #712

Open
wants to merge 7 commits into
base: rolling
Choose a base branch
from
Open

Conversation

ciandonovan
Copy link

Fixes #666

SIGTERM

Instead of installing signal handlers with the OS directly, ROS2 Launch uses Python's signal.set_wakeup_fd to queue and forward signals to handlers registered from the async event loop.

Although one of these handlers has been set for SIGTERM, it never gets called since when that signal gets delivered, the kernel terminates the process immediately. The process never changes the default disposition of that signal. Delivery of SIGINT however, doesn't cause the kernel to terminate the process, and instead that signal get dutifully forwarded to the appropriate handler. This is despite there being no difference in the ROS2 Launch code in setting them both up.

After running an strace on a one-line Python program and seeing that a handler was still automatically installed for SIGINT, I came across this StackOverflow post. Depending on whether a Python script is run in some combination of foreground process and in an interactive shell, a handler will be installed for SIGINT for use in the KeyboardInterrupt exception. It just so happens that it also gets forwarded by signal.set_wakeup_fd, and since the handler prevents the program from being automatically terminated, it can actually do so.

This is why the handler for SIGINT by chance works, but the one for SIGTERM does not.

The fix for this was to simply install the default interrupt signal handler on any signal the async event loop requested handled, so that they would no longer cause the program to terminate and instead be forwarded to signal.set_wakeup_fd.

SIGQUIT

SIGQUIT signals are often generated by the user hitting 'CTRL+' in an interactive shell. However, when the terminal driver receives those characters, it doesn't just send a SIGQUIT to the running process, but to the entire foreground process group.

By default, the ROS2 Client Libraries don't register a handler for SIGQUIT, as they do for SIGINT and SIGTERM, so the ROS2 node will be terminated immediately and its core dumpted on receipt. So there is little reason for ROS2 Launch to catch this signal or the purposes of forwarding it on to its children.

Also, not catching SIGQUIT means that if there is a problem with the Python signal handling, then the user still has available a shortcut on their keyboards to unilaterally kill the process.

The signal handler for SIGQUIT was removed.

Future work

This Pull Request allows ROS2 Launch to handle SIGTERMs and initiate shutdown, but sends SIGINTs instead of SIGTERMs to the child processes. This will usually be fine since the ROS2 Client Libraries install handlers for both SIGINT and SIGTERM by default, and is no worse than the current situation, but it would be more flexible to allow users to send whatever signals they want to their nodes.

To do this would require removing the due_to_sigint and send_sigint arguments of many functions and replacing them with an argument passing the name of the signal to be forwarded. Care would need to be taken of other signals besides SIGINT such as SIGQUIT and SIGTSTP that also tend to get sent to the entire process group to avoid duplicates. Currently if the child node doesn't shutdown within 5 seconds an "escalated" SIGTERM is sent - but if SIGTERM was the initial signal then should it progress straight to SIGKILL? Or wait 10 seconds?

If others feel this is worth developing let me know and I'll work on another PR, but I feel this is in good shape now and will already be of benefit to others upstream.

@wjwwood - mentioned in the relevant ToDo for this feature

@ciandonovan
Copy link
Author

ciandonovan commented Jun 18, 2023

I had to modify test_signal_management.py to account for the fact I'm using signal.default_int_handler by catching and ignoring KeyboardInterrupt like in the main code.

Maybe some sort of global scope noop function might be more appropriate, needs to be global otherwise the assert won't pass with separate noop functions in the main and test code.

I'd argue however that the changes to the tests would be justified even without the rest of this Pull Request. The current upstream behaviour is that signal.default_int_handler is still installed implicitly, but only for SIGINT, and a KeyboardInterrupt catch is also present to stop it from exiting the program immediately. I've just extended this behaviour to cover any arbitrary signal, and explicitly installed a handler for SIGTERM too.

launch/launch/launch_service.py Outdated Show resolved Hide resolved
launch/launch/launch_service.py Outdated Show resolved Hide resolved
@ciandonovan
Copy link
Author

@fujitatomoya has been a long time since I worked on this, some other regressions were introduced during the rebase also. Agree with your suggestions, but might rework the patches more in general over the next week, take a fresh look at it.
Glad to see interest in getting this resolved, happy to work with you if you have further suggestions or ideas on approach. maybe aligning with that's being done in ros2/ros2cli#895?

@fujitatomoya
Copy link

@ciandonovan i believe your suggestion posted on here makes sense. launch and run process does not really know how to handle the signals, so that they should forward the signals to the application logic to shutdown or finalize the application. application by here could be, storing some data using memory cache, accessing physical hardware device or sensors. in these cases, application logics need to receive the signals to shutdown or finalize it properly, otherwise it could be left unexpected state or error result.

@ciandonovan
Copy link
Author

@fujitatomoya @wjwwood @clalancette is there a case for catching and handling any signals other than SIGINT and SIGTERM?

Currently only SIGINT is handled as no signal handler on the process level is ever installed for anything else, despite the current code assuming so.

My pull request as it stands catches both SIGINT and SIGTERM, and forwards SIGINT to the child processes in both cases. I wrote it to handle any signals installed to the AsyncSafeSignalManager.handle(), but this flexibility might not be required? Also, it doesn't forward any signal if launch is in "noninteractive" mode, which shouldn't be the case for SIGTERM at least as it has different semantics to SIGINT (which is commonly sent to the entire foreground process group on Ctrl-C from the terminal).

If it is the case that only those two signals are of interest, I'll adjust to handle both accordingly - both forwarded when nonintereactive, and only SIGTERM forwarded when interactive. The ROS 2 Client Libraries already handle both signals gracefully so shouldn't introduce any unexpected issues by forwarding SIGTERM.

By default, and only in certain circumstances, SIGINT is the only signal
with a handler installed. All others will perform their function defined
by SIG_DFL, which is to ignore is some cases, and to terminate in others. In
either case, Python will never see them and be unable to pass them to
the async handler.

Signed-off-by: <[email protected]>
Signed-off-by: Cian Donovan <[email protected]>
SIGQUIT signals are often generated by the user hitting 'CTRL+\' in an
interactive shell. However, when the terminal driver receives those
characters, it doesn't just send a SIGQUIT to the running process, but
to the entire foreground process group.

By default, the ROS2 Client Libraries don't register a handler for
SIGQUIT, as they do for SIGINT and SIGTERM, so the ROS2 node will be
terminated immediately and its core dumpted on receipt. So there is
little reason for ROS2 Launch to catch this signal for the purposes of forwarding it on to its children.

Also, not catching SIGQUIT means that if there is a problem with the
Python signal handling, then the user still has available a shortcut on
their keyboards to unilaterally kill the process.

Signed-off-by: <[email protected]>
Signed-off-by: Cian Donovan <[email protected]>
Allows arbitrary signal handlers to easily be added in future without
writing new functions, DRY.

Also allows at least one signal of each type, printing a warning when
more than one is received. This is to allow in future each signal type
to be passed to the children of launch, so a user can manually escalate
the kill signals.

Signed-off-by: Cian Donovan <[email protected]>
…t_int_handler to match how signals are now handled in Launch

Signed-off-by: Cian Donovan <[email protected]>
@fujitatomoya
Copy link

@ciandonovan i am not sure what signals launch process should deal, because launch does and provides much more than ros2 run that is just a wrapper to execute the command from the arguments and parameters. that is the main reason that i just forward the signals to child process how they want to deal with ros2/ros2cli#895. in general sigint and sigterm need to be handled, but there could be specific signal that launch should handle... that is something i would like to know too.

send_sigint? True : False
+----------------+--------+---------+
|                | SIGINT | SIGTERM |
+----------------+--------+---------+
| interactive    | False  | True    |
| noninteractive | True   | True    |
+----------------+--------+---------+

Launch either detects, through checking if there's a controlling terminal,
or the value of the flag `--noninteractive`, to determine if launch was
launched interactively or not.

If it was launched interactively, then the receipt of SIGINT
is implicitly assumed to have come from the user pressing Ctrl-C
and their terminal sending a SIGINT to the entire foreground process
group. In such case, it is redundant to also send SIGINT from launch to
the child ROS nodes, however it should be noted that even if another
signal was sent there would be no ill-effects.

In all other cases, assume the child nodes were not themselves
signalled, so manually send SIGINT, and later escalate to SIGTERM after
timeout, to the nodes.

Signed-off-by: Cian Donovan <[email protected]>
@ciandonovan
Copy link
Author

    send_sigint? True : False
    +----------------+--------+---------+
    |                | SIGINT | SIGTERM |
    +----------------+--------+---------+
    | interactive    | False  | True    |
    | noninteractive | True   | True    |
    +----------------+--------+---------+

Launch either detects, through checking if there's a controlling terminal, or the value of the flag --noninteractive, to determine if launch was launched interactively or not.

If it was launched interactively, then the receipt of SIGINT is implicitly assumed to have come from the user pressing Ctrl-C
and their terminal sending a SIGINT to the entire foreground process group. In such case, it is redundant to also send SIGINT from launch to the child ROS nodes, however it should be noted that even if another signal was sent there would be no ill-effects.

In all other cases, assume the child nodes were not themselves signalled, so manually send SIGINT, and later escalate to SIGTERM after timeout, to the nodes.

@ciandonovan
Copy link
Author

@fujitatomoya from reading the Linux signal man page it's hard to make a case for handling any signals other than SIGINT and SIGTERM.

The signals SIGABRT, SIGALRM, SIGCHLD, SIGIO, SIGLOST, SIGPIPE, SIGPOLL, SIGURG, SIGVTALRM, are raised in response to actions taken by the program itself by design.
The signals SIGEMT, SIGPROF, SIGTRAP are only relevant for in-built debugging and profiling (not relevant here).
The signals SIGBUS, SIGFPE, SIGILL, SIGSEGV, SIGSYS are unlikely to be raised in a Python program.
The signals SIGCONT, SIGSTOP, SIGTSTP, SIGTTIN, SIGTTOU shouldn't be handled as is their actions are normal expected behavior.
The other signals are either unused, synonyms, irrelevant unhandleable, or would be delivered to all processes anyway.

I had considered SIGHUP, but then again if a user explicitly wants their remote processes to continue after a terminal disconnect, then that's already possible with the well-known nohup, so doesn't make sense to alter that default behavior.

Further, only SIGINT and SIGTERM are currently handled by the ROS 2 Client Libraries used by the nodes, so forwarding arbitrary other signals to them would just cause them to immediately terminate anyway, which is not desired.

I think there could be a bigger conversation around changing some of the existing semantics of Launch, in particular I don't think the current behavior of sending a SIGINT, then a SIGTERM after timeout makes much sense. The POSIX standard doesn't specify any order of precedence between them, and rcl handles them both identically. There is no escalation in practice, nor do the signals' semantics indicate so either. There's also code for handling SIGKILL which just isn't possible as that's an uncatchable signal. However, that should probably be addressed in a separate pull request.

As it stands, this pull request is essentially a bugfix. Its behavior is now as it seems it was designed for from reading the docs, the code's comments, and its structure. SIGTERM was never actually handled before at all, as no handler was installed, and now it works quite nicely both being run as a systemd service and in a Podman/Docker container. Also works nice as a Podman container running inside a systemd service ;)

Set is unordered and nonduplicative, better fits storing whether at
least one of each unique signal has been received.

Resolve `signum` into the signal's name as a string once instead of
twice. Cleanup `due_to_sigint` ternary assignment.

Signed-off-by: Cian Donovan <[email protected]>
Copy link
Contributor

@hidmic hidmic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ciandonovan thanks for cleaning after me :)

About signal escalation vs. signal forwarding. launch was designed to work with non-ROS aware executables too. SIGINT and SIGTERM handling isn't equivalent in general. It does make sense, however, to bypass some of the signal escalation stages based on the signal that triggered it. Essentially skip the SIGINT stage on SIGTERM induced shutdowns, as you suggest somewhere above. HTH.

@@ -235,6 +235,7 @@ def handle(
:return: previous handler if any, otherwise None
"""
signum = signal.Signals(signum)
signal.signal(signum, signal.default_int_handler)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ciandonovan thanks for catching this! I dropped the ball here, my bad.

That said, I don't think this is the complete fix. We should only override default signal handlers (chaining non-default ones), and we should restore whatever was in place on AsyncSafeSignalManager exit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case I'm not overriding any default handlers, as SIGINT is already handled by signal.default_int_handler, and SIGTERM isn't handled at all, so I'm just setting it to be handled by the same function.

Ideally both would be handled by some global _noop function that just passes, letting signal.set_wakeup_fd pick it up and do its job. Not sure the best way to implement that though. There was a _noop function in test_signal_management.py previously, but the assert fails if a signal handler was installed for SIGTERM as it's pointed to a different handler address.

Copy link
Contributor

@hidmic hidmic May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case I'm not overriding any default handlers, as SIGINT is already handled by signal.default_int_handler, and SIGTERM isn't handled at all, so I'm just setting it to be handled by the same function.

Right, but we don't really know which signal number we will be required to handle, and even if we did (or assumed so or constrained it to be such and such), there is also the possibility that some other library, used by launch or using launch, may have set their own as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what should we do here? Is it ok to just override any existing signal handler with when this is called? How does this interact with signal.set_wakeup_fd? What if someone sets their own signal handler after initializing launch, will our handling still function?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only existing signal handler at this point is the one the Python binary sets itself for SIGINT to generate the KeyboardInterrupt exception, which I don't change as I point all handlers towards that.
The only other possibility is that someone could have created a handler within their launchfile? Not sure if that's even possible though, and would be terrible practice anyway.
Signal handlers they install in their nodes are of course separate, as signals are a per-process property, and launch is but the parent of the node children.
In terms of signal.set_wakeup_fd, nothing extra needs to be done. Installing the signal handlers here simply prevents the kernel from immediately terminating the process as is the common default disposition, and allows Python to then forward the signal through the fd.

return func(*args, **kwargs)
except KeyboardInterrupt:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ciandonovan hmm, this will skip the entire test with a green check. The asyncio loop won't deliver any handlers if it is interrupted.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify why, or what it's interrupting? My understanding is that the assert in the finally section will always get run no matter what exceptions occur so long as they're handled.

Copy link
Contributor

@hidmic hidmic May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, those assert will be executed (and likely succeed), but all the assert in the test_async_safe_signal_manager body after the first signal is raised won't because a KeyboardInterrupt exception will have been raised during asyncio.run_until_complete execution.

self.__logger.warning(base_msg)
ret = self._shutdown(
reason='ctrl-c (SIGINT)', due_to_sigint=True, force_sync=True
reason=base_msg, due_to_sigint=due_to_sigint, force_sync=True
Copy link
Contributor

@hidmic hidmic May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ciandonovan meta: sending a shutdown event is not the same as pulling the brakes on the event loop, but I can see how the previous implementation can cause problems when managing launch itself. Most process managers use SIGTERM. I guess you could argue that SIGKILL escalation is appropriate if a shutdown isn't enough.

@wjwwood
Copy link
Member

wjwwood commented May 16, 2024

We've been discussing this pr and the ros2run signal handling pr (ros2/ros2cli#899) in our Jazzy release channels and I think the plan is to make this a bugfix shortly after the release, and after it has some time to soak in Rolling.

So we'll try to get these merged, and then backport after the release as a bugfix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

Successfully merging this pull request may close these issues.

Incorrect signal handling
5 participants