[BUG]: vsomeip leaks TCP sockets #674

joeyoravec · 2024-04-11T15:31:27Z

vSomeip Version

v3.4.10

Boost Version

1.82

Environment

Android and QNX

Describe the bug

My system has two nodes, QNX and Android, using routingmanagerd and TCP socket. In any situation where the network where the network "goes away and comes back" like unplugging-and-plugging the network cable, the routing manager leaks TCP sockets. After this happens enough times the process will reach an OS limit for maximum descriptors limit and fail or terminate.

This behavior seems to be present in every version I've tested from 3.4.10 back to 3.1.20. I've reproduced on both QNX + Android.

Reproduction Steps

Use an ifconfig down; sleep 10; ifconfig up to break and re-establish the network connection. This should be equivalent to many other use-cases where nodes go away: physically plugging-and-unplugging the network cable, suspend-to-ram and resume, etc.

Then use netstat or any other mechanism to study which sockets and file descriptors are open by the routingmanagerd process.

Expected behaviour

Except for transient observations, I expect a single TCP socket (in each direction) from nodeA routingmanager to nodeB routingmanager. If the code is going to detect outages and reconnect it should not leak.

Logs and Screenshots

After doing the ifconfig down; sleep 10; ifconfig up enough times netstat will show tons of sockets:

tcp    0   0 10.6.0.10:30510     10.6.0.3:65389     ESTABLISHED
tcp    0   0 10.6.0.10:30510     10.6.0.3:65435     ESTABLISHED
tcp    0   0 10.6.0.10:30510     10.6.0.3:65419     ESTABLISHED
tcp    0   0 10.6.0.10:30510     10.6.0.3:65410     ESTABLISHED
tcp    0   0 10.6.0.10:30510     10.6.0.3:65399     ESTABLISHED

The text was updated successfully, but these errors were encountered:

joeyoravec · 2024-04-16T19:15:05Z

Based on my testing, it looks like vsomeip is vulnerable to a sequence:

TCP socket is open and operational
connection is severed == cable is cut, cable is unplugged, etc
remote-end decides TCP socket is closed. Any RST is lost because connection is severed so local-end is never informed of socket break
connection is re-established and new TCP socket is established without anything on the old TCP socket resulting in a RST to inform it’s closed

No further attempt is made to speak on the old TCP socket so code doesn’t notice it’s closed. To visualize this:

The default TCP keepalive is ~2 hours on Linux (set by /proc/sys/net/ipv4/) and unknown on QNX. We tested reducing this to 30s and "dead" sockets got cleaned a lot faster once the OS noticed the broken socket. However, even with the smaller keepalive we still reproduced some sockets that stick around “forever”, even after the keepalive interval. Not sure why.

At this point we're thinking about two approaches:

Continue with TCP keepalive and passively rely on existing code paths to close-and-release connection objects the next time any function returns an error status. Figure out why some sockets get “stuck around forever”
Pursue an alternate strategy to actively close-and-release existing connection objects, perhaps in the accept_cbk when a new connection is made.

joeyoravec · 2024-04-25T20:24:14Z

I've opened draft pull request:

Use TCP keepalive to detect broken sockets #681

with the code-changes that I've applied locally to address this issue. Although sockets still "leak" the TCP keepalive with detect and close within 14 seconds maximum.

joeyoravec added the bug label Apr 11, 2024

joeyoravec linked a pull request Apr 25, 2024 that will close this issue

Use TCP keepalive to detect broken sockets #681

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: vsomeip leaks TCP sockets #674

[BUG]: vsomeip leaks TCP sockets #674

joeyoravec commented Apr 11, 2024

joeyoravec commented Apr 16, 2024

joeyoravec commented Apr 25, 2024

[BUG]: vsomeip leaks TCP sockets #674

[BUG]: vsomeip leaks TCP sockets #674

Comments

joeyoravec commented Apr 11, 2024

vSomeip Version

Boost Version

Environment

Describe the bug

Reproduction Steps

Expected behaviour

Logs and Screenshots

joeyoravec commented Apr 16, 2024

joeyoravec commented Apr 25, 2024