Fix #666 handle SIGTERM #712

ciandonovan · 2023-06-18T03:44:50Z

Fixes #666

SIGTERM

Instead of installing signal handlers with the OS directly, ROS2 Launch uses Python's signal.set_wakeup_fd to queue and forward signals to handlers registered from the async event loop.

Although one of these handlers has been set for SIGTERM, it never gets called since when that signal gets delivered, the kernel terminates the process immediately. The process never changes the default disposition of that signal. Delivery of SIGINT however, doesn't cause the kernel to terminate the process, and instead that signal get dutifully forwarded to the appropriate handler. This is despite there being no difference in the ROS2 Launch code in setting them both up.

After running an strace on a one-line Python program and seeing that a handler was still automatically installed for SIGINT, I came across this . Depending on whether a Python script is run in some combination of foreground process and in an interactive shell, a handler will be installed for SIGINT for use in the KeyboardInterrupt exception. It just so happens that it also gets forwarded by signal.set_wakeup_fd, and since the handler prevents the program from being automatically terminated, it can actually do so.

This is why the handler for SIGINT by chance works, but the one for SIGTERM does not.

The fix for this was to simply install the default interrupt signal handler on any signal the async event loop requested handled, so that they would no longer cause the program to terminate and instead be forwarded to signal.set_wakeup_fd.

SIGQUIT

SIGQUIT signals are often generated by the user hitting 'CTRL+' in an interactive shell. However, when the terminal driver receives those characters, it doesn't just send a SIGQUIT to the running process, but to the entire foreground process group.

By default, the ROS2 Client Libraries don't register a handler for SIGQUIT, as they do for SIGINT and SIGTERM, so the ROS2 node will be terminated immediately and its core dumpted on receipt. So there is little reason for ROS2 Launch to catch this signal or the purposes of forwarding it on to its children.

Also, not catching SIGQUIT means that if there is a problem with the Python signal handling, then the user still has available a shortcut on their keyboards to unilaterally kill the process.

The signal handler for SIGQUIT was removed.

Future work

This Pull Request allows ROS2 Launch to handle SIGTERMs and initiate shutdown, but sends SIGINTs instead of SIGTERMs to the child processes. This will usually be fine since the ROS2 Client Libraries install handlers for both SIGINT and SIGTERM by default, and is no worse than the current situation, but it would be more flexible to allow users to send whatever signals they want to their nodes.

To do this would require removing the due_to_sigint and send_sigint arguments of many functions and replacing them with an argument passing the name of the signal to be forwarded. Care would need to be taken of other signals besides SIGINT such as SIGQUIT and SIGTSTP that also tend to get sent to the entire process group to avoid duplicates. Currently if the child node doesn't shutdown within 5 seconds an "escalated" SIGTERM is sent - but if SIGTERM was the initial signal then should it progress straight to SIGKILL? Or wait 10 seconds?

If others feel this is worth developing let me know and I'll work on another PR, but I feel this is in good shape now and will already be of benefit to others upstream.

@wjwwood - mentioned in the relevant ToDo for this feature

ciandonovan · 2023-06-18T23:07:25Z

I had to modify test_signal_management.py to account for the fact I'm using signal.default_int_handler by catching and ignoring KeyboardInterrupt like in the main code.

Maybe some sort of global scope noop function might be more appropriate, needs to be global otherwise the assert won't pass with separate noop functions in the main and test code.

I'd argue however that the changes to the tests would be justified even without the rest of this Pull Request. The current upstream behaviour is that signal.default_int_handler is still installed implicitly, but only for SIGINT, and a KeyboardInterrupt catch is also present to stop it from exiting the program immediately. I've just extended this behaviour to cover any arbitrary signal, and explicitly installed a handler for SIGTERM too.

launch/launch/launch_service.py

ciandonovan · 2024-04-26T14:02:41Z

@fujitatomoya has been a long time since I worked on this, some other regressions were introduced during the rebase also. Agree with your suggestions, but might rework the patches more in general over the next week, take a fresh look at it.
Glad to see interest in getting this resolved, happy to work with you if you have further suggestions or ideas on approach. maybe aligning with that's being done in ros2/ros2cli#895?

fujitatomoya · 2024-04-26T15:52:28Z

@ciandonovan i believe your suggestion posted on here makes sense. launch and run process does not really know how to handle the signals, so that they should forward the signals to the application logic to shutdown or finalize the application. application by here could be, storing some data using memory cache, accessing physical hardware device or sensors. in these cases, application logics need to receive the signals to shutdown or finalize it properly, otherwise it could be left unexpected state or error result.

ciandonovan · 2024-05-01T17:12:10Z

@fujitatomoya @wjwwood @clalancette is there a case for catching and handling any signals other than SIGINT and SIGTERM?

Currently only SIGINT is handled as no signal handler on the process level is ever installed for anything else, despite the current code assuming so.

My pull request as it stands catches both SIGINT and SIGTERM, and forwards SIGINT to the child processes in both cases. I wrote it to handle any signals installed to the AsyncSafeSignalManager.handle(), but this flexibility might not be required? Also, it doesn't forward any signal if launch is in "noninteractive" mode, which shouldn't be the case for SIGTERM at least as it has different semantics to SIGINT (which is commonly sent to the entire foreground process group on Ctrl-C from the terminal).

If it is the case that only those two signals are of interest, I'll adjust to handle both accordingly - both forwarded when nonintereactive, and only SIGTERM forwarded when interactive. The ROS 2 Client Libraries already handle both signals gracefully so shouldn't introduce any unexpected issues by forwarding SIGTERM.

By default, and only in certain circumstances, SIGINT is the only signal with a handler installed. All others will perform their function defined by SIG_DFL, which is to ignore is some cases, and to terminate in others. In either case, Python will never see them and be unable to pass them to the async handler. Signed-off-by: <[email protected]> Signed-off-by: Cian Donovan <[email protected]>

SIGQUIT signals are often generated by the user hitting 'CTRL+\' in an interactive shell. However, when the terminal driver receives those characters, it doesn't just send a SIGQUIT to the running process, but to the entire foreground process group. By default, the ROS2 Client Libraries don't register a handler for SIGQUIT, as they do for SIGINT and SIGTERM, so the ROS2 node will be terminated immediately and its core dumpted on receipt. So there is little reason for ROS2 Launch to catch this signal for the purposes of forwarding it on to its children. Also, not catching SIGQUIT means that if there is a problem with the Python signal handling, then the user still has available a shortcut on their keyboards to unilaterally kill the process. Signed-off-by: <[email protected]> Signed-off-by: Cian Donovan <[email protected]>

Allows arbitrary signal handlers to easily be added in future without writing new functions, DRY. Also allows at least one signal of each type, printing a warning when more than one is received. This is to allow in future each signal type to be passed to the children of launch, so a user can manually escalate the kill signals. Signed-off-by: Cian Donovan <[email protected]>

Signed-off-by: Cian Donovan <[email protected]>

…t_int_handler to match how signals are now handled in Launch Signed-off-by: Cian Donovan <[email protected]>

fujitatomoya · 2024-05-01T19:10:05Z

@ciandonovan i am not sure what signals launch process should deal, because launch does and provides much more than ros2 run that is just a wrapper to execute the command from the arguments and parameters. that is the main reason that i just forward the signals to child process how they want to deal with ros2/ros2cli#895. in general sigint and sigterm need to be handled, but there could be specific signal that launch should handle... that is something i would like to know too.

send_sigint? True : False +----------------+--------+---------+ | | SIGINT | SIGTERM | +----------------+--------+---------+ | interactive | False | True | | noninteractive | True | True | +----------------+--------+---------+ Launch either detects, through checking if there's a controlling terminal, or the value of the flag `--noninteractive`, to determine if launch was launched interactively or not. If it was launched interactively, then the receipt of SIGINT is implicitly assumed to have come from the user pressing Ctrl-C and their terminal sending a SIGINT to the entire foreground process group. In such case, it is redundant to also send SIGINT from launch to the child ROS nodes, however it should be noted that even if another signal was sent there would be no ill-effects. In all other cases, assume the child nodes were not themselves signalled, so manually send SIGINT, and later escalate to SIGTERM after timeout, to the nodes. Signed-off-by: Cian Donovan <[email protected]>

ciandonovan · 2024-05-02T12:17:24Z

    send_sigint? True : False
    +----------------+--------+---------+
    |                | SIGINT | SIGTERM |
    +----------------+--------+---------+
    | interactive    | False  | True    |
    | noninteractive | True   | True    |
    +----------------+--------+---------+

Launch either detects, through checking if there's a controlling terminal, or the value of the flag --noninteractive, to determine if launch was launched interactively or not.

If it was launched interactively, then the receipt of SIGINT is implicitly assumed to have come from the user pressing Ctrl-C
and their terminal sending a SIGINT to the entire foreground process group. In such case, it is redundant to also send SIGINT from launch to the child ROS nodes, however it should be noted that even if another signal was sent there would be no ill-effects.

In all other cases, assume the child nodes were not themselves signalled, so manually send SIGINT, and later escalate to SIGTERM after timeout, to the nodes.

ciandonovan · 2024-05-02T12:48:27Z

@fujitatomoya from reading the Linux signal man page it's hard to make a case for handling any signals other than SIGINT and SIGTERM.

The signals SIGABRT, SIGALRM, SIGCHLD, SIGIO, SIGLOST, SIGPIPE, SIGPOLL, SIGURG, SIGVTALRM, are raised in response to actions taken by the program itself by design.
The signals SIGEMT, SIGPROF, SIGTRAP are only relevant for in-built debugging and profiling (not relevant here).
The signals SIGBUS, SIGFPE, SIGILL, SIGSEGV, SIGSYS are unlikely to be raised in a Python program.
The signals SIGCONT, SIGSTOP, SIGTSTP, SIGTTIN, SIGTTOU shouldn't be handled as is their actions are normal expected behavior.
The other signals are either unused, synonyms, irrelevant unhandleable, or would be delivered to all processes anyway.

I had considered SIGHUP, but then again if a user explicitly wants their remote processes to continue after a terminal disconnect, then that's already possible with the well-known nohup, so doesn't make sense to alter that default behavior.

Further, only SIGINT and SIGTERM are currently handled by the ROS 2 Client Libraries used by the nodes, so forwarding arbitrary other signals to them would just cause them to immediately terminate anyway, which is not desired.

I think there could be a bigger conversation around changing some of the existing semantics of Launch, in particular I don't think the current behavior of sending a SIGINT, then a SIGTERM after timeout makes much sense. The POSIX standard doesn't specify any order of precedence between them, and rcl handles them both identically. There is no escalation in practice, nor do the signals' semantics indicate so either. There's also code for handling SIGKILL which just isn't possible as that's an uncatchable signal. However, that should probably be addressed in a separate pull request.

As it stands, this pull request is essentially a bugfix. Its behavior is now as it seems it was designed for from reading the docs, the code's comments, and its structure. SIGTERM was never actually handled before at all, as no handler was installed, and now it works quite nicely both being run as a systemd service and in a Podman/Docker container. Also works nice as a Podman container running inside a systemd service ;)

Set is unordered and nonduplicative, better fits storing whether at least one of each unique signal has been received. Resolve `signum` into the signal's name as a string once instead of twice. Cleanup `due_to_sigint` ternary assignment. Signed-off-by: Cian Donovan <[email protected]>

hidmic

@ciandonovan thanks for cleaning after me :)

About signal escalation vs. signal forwarding. launch was designed to work with non-ROS aware executables too. SIGINT and SIGTERM handling isn't equivalent in general. It does make sense, however, to bypass some of the signal escalation stages based on the signal that triggered it. Essentially skip the SIGINT stage on SIGTERM induced shutdowns, as you suggest somewhere above. HTH.

hidmic · 2024-05-09T16:29:07Z

launch/launch/utilities/signal_management.py

@@ -235,6 +235,7 @@ def handle(
        :return: previous handler if any, otherwise None
        """
        signum = signal.Signals(signum)
+        signal.signal(signum, signal.default_int_handler)


@ciandonovan thanks for catching this! I dropped the ball here, my bad.

That said, I don't think this is the complete fix. We should only override default signal handlers (chaining non-default ones), and we should restore whatever was in place on AsyncSafeSignalManager exit.

In this case I'm not overriding any default handlers, as SIGINT is already handled by signal.default_int_handler, and SIGTERM isn't handled at all, so I'm just setting it to be handled by the same function.

Ideally both would be handled by some global _noop function that just passes, letting signal.set_wakeup_fd pick it up and do its job. Not sure the best way to implement that though. There was a _noop function in test_signal_management.py previously, but the assert fails if a signal handler was installed for SIGTERM as it's pointed to a different handler address.

In this case I'm not overriding any default handlers, as SIGINT is already handled by signal.default_int_handler, and SIGTERM isn't handled at all, so I'm just setting it to be handled by the same function.

Right, but we don't really know which signal number we will be required to handle, and even if we did (or assumed so or constrained it to be such and such), there is also the possibility that some other library, used by launch or using launch, may have set their own as well.

So what should we do here? Is it ok to just override any existing signal handler with when this is called? How does this interact with signal.set_wakeup_fd? What if someone sets their own signal handler after initializing launch, will our handling still function?

The only existing signal handler at this point is the one the Python binary sets itself for SIGINT to generate the KeyboardInterrupt exception, which I don't change as I point all handlers towards that.
The only other possibility is that someone could have created a handler within their launchfile? Not sure if that's even possible though, and would be terrible practice anyway.
Signal handlers they install in their nodes are of course separate, as signals are a per-process property, and launch is but the parent of the node children.
In terms of signal.set_wakeup_fd, nothing extra needs to be done. Installing the signal handlers here simply prevents the kernel from immediately terminating the process as is the common default disposition, and allows Python to then forward the signal through the fd.

hidmic · 2024-05-09T16:36:23Z

launch/test/launch/utilities/test_signal_management.py

                return func(*args, **kwargs)
+            except KeyboardInterrupt:
+                pass


@ciandonovan hmm, this will skip the entire test with a green check. The asyncio loop won't deliver any handlers if it is interrupted.

Could you clarify why, or what it's interrupting? My understanding is that the assert in the finally section will always get run no matter what exceptions occur so long as they're handled.

Right, those assert will be executed (and likely succeed), but all the assert in the test_async_safe_signal_manager body after the first signal is raised won't because a KeyboardInterrupt exception will have been raised during asyncio.run_until_complete execution.

hidmic · 2024-05-09T16:40:35Z

launch/launch/launch_service.py

                    self.__logger.warning(base_msg)
                    ret = self._shutdown(
-                        reason='ctrl-c (SIGINT)', due_to_sigint=True, force_sync=True
+                        reason=base_msg, due_to_sigint=due_to_sigint, force_sync=True


@ciandonovan meta: sending a shutdown event is not the same as pulling the brakes on the event loop, but I can see how the previous implementation can cause problems when managing launch itself. Most process managers use SIGTERM. I guess you could argue that SIGKILL escalation is appropriate if a shutdown isn't enough.

wjwwood · 2024-05-16T20:38:04Z

We've been discussing this pr and the ros2run signal handling pr (ros2/ros2cli#899) in our Jazzy release channels and I think the plan is to make this a bugfix shortly after the release, and after it has some time to soak in Rolling.

So we'll try to get these merged, and then backport after the release as a bugfix.

ciandonovan requested review from adityapande-1995, methylDragon and wjwwood as code owners June 18, 2023 03:44

ciandonovan force-pushed the rolling branch 7 times, most recently from 3ee156e to 708f098 Compare June 18, 2023 22:53

ciandonovan force-pushed the rolling branch from 6b1d541 to 708f098 Compare June 20, 2023 15:29

clalancette assigned adityapande-1995 Jun 29, 2023

ciandonovan changed the title ~~Fix #666 signal handling issue~~ Fix #666 handle SIGTERM Jul 9, 2023

ciandonovan mentioned this pull request Jul 10, 2023

ros2 launch with ComposableNodes doesn't terminate on SIGTERM ros2/launch_ros#223

Closed

ciandonovan mentioned this pull request Oct 12, 2023

Ros_launch.sh and example hsp-iit/tour-guide-robot#45

Merged

ciandonovan force-pushed the rolling branch from 708f098 to ae85f7a Compare April 24, 2024 20:57

fujitatomoya reviewed Apr 25, 2024

View reviewed changes

launch/launch/launch_service.py Outdated Show resolved Hide resolved

launch/launch/launch_service.py Outdated Show resolved Hide resolved

ciandonovan added 5 commits May 1, 2024 19:32

Remove platform to pass code style check currently failing

b1dfecc

Signed-off-by: Cian Donovan <[email protected]>

Replace test_signal_management.py __noop handler with standard defaul…

63ebbe3

…t_int_handler to match how signals are now handled in Launch Signed-off-by: Cian Donovan <[email protected]>

ciandonovan force-pushed the rolling branch from ae85f7a to 63ebbe3 Compare May 1, 2024 18:33

ciandonovan requested a review from fujitatomoya May 1, 2024 18:41

ciandonovan mentioned this pull request May 2, 2024

Stopping containers doesn't disable motor torque helix-robotics-ag/ros-helix#6

Open

ciandonovan mentioned this pull request May 9, 2024

Handle signals within the asyncio loop. #476

Merged

hidmic reviewed May 9, 2024

View reviewed changes

fujitatomoya mentioned this pull request Jun 10, 2024

context.on_shutdown is never called ros2/rclpy#1287

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #666 handle SIGTERM #712

Fix #666 handle SIGTERM #712

ciandonovan commented Jun 18, 2023

ciandonovan commented Jun 18, 2023 •

edited

Loading

ciandonovan commented Apr 26, 2024

fujitatomoya commented Apr 26, 2024

ciandonovan commented May 1, 2024

fujitatomoya commented May 1, 2024

ciandonovan commented May 2, 2024

ciandonovan commented May 2, 2024

hidmic left a comment •

edited

Loading

hidmic May 9, 2024

ciandonovan May 12, 2024

hidmic May 13, 2024 •

edited

Loading

wjwwood May 16, 2024

ciandonovan May 16, 2024

hidmic May 9, 2024

ciandonovan May 12, 2024

hidmic May 13, 2024 •

edited

Loading

hidmic May 9, 2024 •

edited

Loading

wjwwood commented May 16, 2024

Fix #666 handle SIGTERM #712

Are you sure you want to change the base?

Fix #666 handle SIGTERM #712

Conversation

ciandonovan commented Jun 18, 2023

SIGTERM

SIGQUIT

Future work

ciandonovan commented Jun 18, 2023 • edited Loading

ciandonovan commented Apr 26, 2024

fujitatomoya commented Apr 26, 2024

ciandonovan commented May 1, 2024

fujitatomoya commented May 1, 2024

ciandonovan commented May 2, 2024

ciandonovan commented May 2, 2024

hidmic left a comment • edited Loading

Choose a reason for hiding this comment

hidmic May 9, 2024

Choose a reason for hiding this comment

ciandonovan May 12, 2024

Choose a reason for hiding this comment

hidmic May 13, 2024 • edited Loading

Choose a reason for hiding this comment

wjwwood May 16, 2024

Choose a reason for hiding this comment

ciandonovan May 16, 2024

Choose a reason for hiding this comment

hidmic May 9, 2024

Choose a reason for hiding this comment

ciandonovan May 12, 2024

Choose a reason for hiding this comment

hidmic May 13, 2024 • edited Loading

Choose a reason for hiding this comment

hidmic May 9, 2024 • edited Loading

Choose a reason for hiding this comment

wjwwood commented May 16, 2024

ciandonovan commented Jun 18, 2023 •

edited

Loading

hidmic left a comment •

edited

Loading

hidmic May 13, 2024 •

edited

Loading

hidmic May 13, 2024 •

edited

Loading

hidmic May 9, 2024 •

edited

Loading