Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix preliminary unlinking of socket file #23

Merged
merged 1 commit into from
Aug 16, 2023
Merged

Conversation

aszlig
Copy link
Contributor

@aszlig aszlig commented Jul 13, 2020

Currently, this only consists of a regression test but I haven't yet found a very good solution to address this problem.

If we were to mainly tackle the unlink issue (#16), we could just introduce a new flag to disable unlinking altogether.

Unfortunately, this doesn't fully address the actual issue, which is about how we handle sharing versus copying of memory and file descriptors across processes.

For most programs in the wild, this isn't an issue because most of them don't do very complex socket operations, but occasionally - and especially when interacting with other subprocesses - we do run into this issue and then we end up having a very hard time to debug what's going on.

So to describe the issue in more detail, here is a simplified version of how ip2unix currently tracks socket file descriptors:

  • We have an internal registry, which maps the file descriptors of the current process to our internal Socket state.
  • Whenever we get a new socket call, we insert the file descriptor into the registry. At this point, we can't make a final decision about how to handle this socket, because we only know the socket family (eg. AF_INET or AF_INET6) but no details about ports or IP addresses.
  • Once we get a bind or connect call, we look up the file descriptor in the registry and see whether we have a matching rule.
  • If the rule matches, we go ahead and execute the corresponding action (eg. convert to Unix socket, blackhole, reject, etc...).
  • If it doesn't match, we remove the file descriptor from the registry and invoke the real bind/connect function.

This all works fine if everything is run in order and without any multiprocessing involved, but as soon as the application invokes clone, things start to get ugly.

Essentially we have two clone flags that are problematic:

  • CLONE_VM

    If set, the calling process and the child process run in the same memory space. In particular, memory writes performed by the calling process or by the child process are also visible in the other process.
    If not set, the child process runs in a separate copy of the memory space of the calling process at the time of the clone call. Memory writes or file mappings/unmappings performed by one of the processes do not affect the other.

  • CLONE_FILES

    If set, the calling process and the child process share the same file descriptor table. Any file descriptor created by the calling process or by the child process is also valid in the other process.
    If not set, the child process inherits a copy of all file descriptors opened in the calling process at the time of the clone call. Subsequent operations that open or close file descriptors, or change file descriptor flags, performed by either the calling process or the child process do not affect the other process.

See the clone(2) manpage for more detailed information.

Here is how these flags are affecting us (btw. this also includes syscalls such as fork or libc functions such as daemon):

  • 😃 CLONE_VM | CLONE_FILES

    Processes share memory and also share file descriptors, which essentially means that we don't need to do anything, because the registry and the file descriptor table are still in par.

  • 😐 CLONE_VM

    Little more tricky, we have one shared registry across both processes, but the file descriptor tables are copied. This means that whenever we're closing a socket in the second process, we can not yet unlink the socket file until the file descriptor is also closed in the first process. This could be done by adding a reference counter for the socket file descriptors.

  • 😒 CLONE_FILES

    We have two registries, but one shared file descriptor table. One way to deal with this could be a mmaped file descriptor that could be used to communicate between the two registries.

    However: How would one indicate presence of the other?

  • 😱 0

    Again, two registries, but also a copy of the file descriptor table.

    At first glance this might be an obvious case of "we don't need to handle it", but consider the scenario mentioned in an earlier comment:

    Process 1 creates a socket, clones into process 2, process 2 closes the socket... now what? Process 1 still has to be able to do operations on the socket, but it's essentially blackholed. We could use memfd_create in conjunction with mmap, we could even inspect procfs, but all of that will make it a nightmare to port to other systems.

Of course, one way to get there would be to wrap the corresponding syscalls and do some kind of IPC between the main process and various subprocesses, eg. via POSIX shared memory objects. This however is a little bit to complicated and I'd like to avoid wrapping additional libc calls as much as possible.

Another less error prone way in terms of moving parts that could go south would be to store all state that we have inside a shared mapping that is bound to a file descriptor. Again memfd_create comes to mind, but is there a more portable way?

Also, is this even feasible or are there other occasions than CLONE_FILES that have an impact here?

What about (M)FD_CLOEXEC? If either none or all the sockets in the registry are using FD_CLOEXEC and we're essentially setting MFD_CLOEXEC IFF all the sockets have FD_CLOEXEC, everything is fine. But if this is not the case, how do we handle this?

@aszlig aszlig linked an issue Jul 13, 2020 that may be closed by this pull request
@aszlig aszlig added the bug Something isn't working label Jul 13, 2020
@Profpatsch
Copy link

You can detect it, right, so you could at least throw a giant multi-line warning for now if the bad cases happen (and maybe tell users to activate the --do-not-unlink flag or similar).

@aszlig
Copy link
Contributor Author

aszlig commented Jul 13, 2020

You can detect it, right, so you could at least throw a giant multi-line warning for now if the bad cases happen (and maybe tell users to activate the --do-not-unlink flag or similar).

Unfortunately, I can only reliably detect it in the first two cases since in the last two cases, all internal datatypes of ip2unix are copies.

This is something I tried to properly fix for ages, but I haven't found
a very good solution. The commit is not what I'd call a good solution or
even a real solution at all, but it should work around problematic
cases.

So here is (simplified) how ip2unix currently tracks socket file
descriptors:

  * We have an internal registry, which maps the file descriptors of the
    current process to our internal Socket state.
  * Whenever we get a new socket() call, we insert the file descriptor
    into the registry. Note that at this point, we can't make a final
    decision about how to handle this socket, because we only know the
    socket family (eg. AF_INET or AF_INET6) but no details about ports
    or IP addresses.
  * Once we get a bind() or connect() call, we look up the file
    descriptor in the registry and see whether we have a matching rule.
  * If the rule matches, we go ahead and execute the corresponding
    action (eg. convert to Unix socket, blackhole, reject, etc...), if
    it doesn't we just remove the file descriptor from the registry and
    call the real bind() or connect() function.
  * Once the socket is closed, we remove the file descriptor from the
    registry and (where applicable) unlink the socket file.

This all works fine if everything is run in order and without any
multiprocessing involved, but as soon as the application invokes
clone(), things start to get ugly.

Essentially we have two clone() flags that are problematic:

  CLONE_VM:

    If set, the calling process and the child process run in the same
    memory space. In particular, memory writes performed by the calling
    process or by the child process are also visible in the other
    process.

    If not set, the child process runs in a separate copy of the memory
    space of the calling process at the time of the clone call. Memory
    writes or file mappings/unmappings performed by one of the processes
    do not affect the other.

  CLONE_FILES:

    If set, the calling process and the child process share the same
    file descriptor table. Any file descriptor created by the calling
    process or by the child process is also valid in the other process.

    If not set, the child process inherits a copy of all file
    descriptors opened in the calling process at the time of the clone
    call. Subsequent operations that open or close file descriptors, or
    change file descriptor flags, performed by either the calling
    process or the child process do not affect the other process.

Here is how these flags are affecting us:

 Both CLONE_VM and CLONE_FILES set:

   Processes share memory and also share file descriptors, which
   essentially means that we don't need to do anything, because the
   registry and the file descriptor table is all that we need for
   properly tracking the state.

 Only CLONE_VM set:

   This is a little more tricky since we have one shared registry across
   both processes, but the file descriptor tables are copied. This means
   that whenever we're closing a socket in the second process, we can
   not yet unlink the socket file until the file descriptor is also
   closed in the first process. We could basically just implement a
   garbage collector when the used socket file descriptors reach zero.

 Only CLONE_FILES set:

   Here we have two registries, but one shared file descriptor table. In
   this scenario, we could use a dedicated file descriptor that we can
   use for sharing state between two registries.

 None of those flags set:

   Again, two registries, but also a copy of the file descriptor table.

   This is our nightmare scenario, consider this chain of events:

     * Process 1 creates a socket
     * Process 1 clone()s into process 2
     * Process 2 closes the socket and ip2unix unlinks the file
     * ... wait, what!?

   Process 1 still has to be able to operate on the socked but we
   essentially made the socket a blackhole since it's no longer
   reachable via the filesystem. There is also no way to handle this
   properly without also introducing some side channel that tracks the
   state from outside.

The non-solution I went with is basically adding a new flag called
"noremove", which prevents the socket path from being unlinked when the
socket is closed.

This doesn't fully solve the issue because ip2unix still removes the
socket file descriptor from the registry and also has no way to even
know if it is a socket it should track.

However, this allows us to deal with the issue until we have found a
real solution and the test case I added is essentially replicating most
programs I found in the wild.

Signed-off-by: aszlig <[email protected]>
Closes: #16
@aszlig aszlig merged commit 150da4d into master Aug 16, 2023
2 checks passed
@aszlig aszlig deleted the fix-preliminary-unlink branch August 16, 2023 00:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Socket gets unlinked when children closes the socket file descriptor
2 participants