Healthcheck creating infinite zombie processes #2311

fisher60 · 2023-07-31T19:49:17Z

Describe the problem

I am having trouble finding the root cause of this problem, but one of my containers was successfully starting and running, but indicating unhealthy and spawning infinite health checks. A new zombie process was being created almost twice per minute and it appears to be originating from the health checks of a specific docker container.

I am unsure if this is due to a specific plugin configuration or odd state achieved. I am unable to reliably reproduce the issue as I run more than 8 nearly identical minecraft servers with a similar or identical plugin stack and similar or identical settings/configurations.

The zombie processes continue unless the container is manually stopped and I only noticed the issue since I actually ran out of process IDs on the bare-metal server and it crashed my entire web server.

As of writing this, the container will not start at all, which is making debugging difficult (this is likely just a result of broken plugins and not related to the above issue).

Container definition

services:
  mc:
    image: itzg/minecraft-server
    ports:
      - ${SERVER_PORT}:${SERVER_PORT}
      - ${RCON_PORT}:${RCON_PORT}
    env_file:
      - .env
    environment:
      EULA: "TRUE"
      TYPE: "PAPER"
      VERSION: "1.19.3"
      JVM_DD_OPTS: "-Paper.bypassHostCheck true -paper.useLegacyPluginLoading true"
    tty: true
    stdin_open: true
    restart: unless-stopped
    volumes:
      - ./data:/data

Container logs

gurkcraft-1-mc-1  | [init] Running as uid=1000 gid=1000 with /data as 'drwxrwxr-x 13 1000 1000 4096 Jul 31 19:20 /data'
gurkcraft-1-mc-1  | [init] Resolved version given 1.19.3 into 1.19.3 and major version 1.19
gurkcraft-1-mc-1  | [init] Resolving type given PAPER
gurkcraft-1-mc-1  | [init] Removing old PaperMC versions ...
gurkcraft-1-mc-1  | [init] Downloading PaperMC 1.19.3 (build 448) ...
gurkcraft-1-mc-1  | [init] Setting initial memory to 4G and max to 4G
gurkcraft-1-mc-1  | [init] Starting the Minecraft server...
gurkcraft-1-mc-1  | Starting org.bukkit.craftbukkit.Main
gurkcraft-1-mc-1  | *** Warning, you've not updated in a while! ***
gurkcraft-1-mc-1  | *** Please download a new build as per instructions from https://papermc.io/downloads ***
gurkcraft-1-mc-1  | System Info: Java 17 (OpenJDK 64-Bit Server VM 17.0.6+10) Host: Linux 5.15.0-78-generic (amd64)
gurkcraft-1-mc-1  | Loading libraries, please wait...
gurkcraft-1-mc-1  | 2023-07-31 19:36:07,591 ServerMain WARN Advanced terminal features are not available in this environment
gurkcraft-1-mc-1  | [19:36:09 INFO]: Building unoptimized datafixer
gurkcraft-1-mc-1  | [19:36:10 INFO]: Environment: authHost='https://authserver.mojang.com', accountsHost='https://api.mojang.com', sessionHost='https://sessionserver.mojang.com', servicesHost='https://api.minecraftservices.com', name='PROD'
gurkcraft-1-mc-1  | [19:36:11 INFO]: Loaded 7 recipes
gurkcraft-1-mc-1  | [19:36:11 INFO]: Starting minecraft server version 1.19.3
gurkcraft-1-mc-1  | [19:36:11 INFO]: Loading properties
gurkcraft-1-mc-1  | [19:36:11 INFO]: This server is running Paper version git-Paper-448 (MC: 1.19.3) (Implementing API version 1.19.3-R0.1-SNAPSHOT) (Git: 155aa36)
gurkcraft-1-mc-1  | [19:36:11 INFO]: Server Ping Player Sample Count: 12
gurkcraft-1-mc-1  | [19:36:11 INFO]: Using 4 threads for Netty based IO
gurkcraft-1-mc-1  | [19:36:11 INFO]: [ChunkTaskScheduler] Chunk system is using 1 I/O threads, 3 worker threads, and gen parallelism of 3 threads
gurkcraft-1-mc-1  | [19:36:11 INFO]: Default game type: SURVIVAL
gurkcraft-1-mc-1  | [19:36:11 INFO]: Generating keypair
gurkcraft-1-mc-1  | [19:36:11 INFO]: Starting Minecraft server on *:25566
gurkcraft-1-mc-1  | [19:36:11 INFO]: Using epoll channel type
gurkcraft-1-mc-1  | [19:36:11 INFO]: Paper: Using libdeflate (Linux x86_64) compression from Velocity.
gurkcraft-1-mc-1  | [19:36:11 INFO]: Paper: Using OpenSSL 1.1.x (Linux x86_64) cipher from Velocity.
gurkcraft-1-mc-1  | [19:36:11 INFO]: Paper: Using Proxy Protocol
gurkcraft-1-mc-1  | [19:36:12 INFO]: [FastLogin] Initialize logging service
gurkcraft-1-mc-1  | [19:36:12 INFO]: [ProtocolLib] Loading server plugin ProtocolLib v5.0.0-SNAPSHOT-b612
gurkcraft-1-mc-1  | [19:36:12 INFO]: [AuthMe] Loading server plugin AuthMe v5.6.0-beta2-b2453
gurkcraft-1-mc-1  | [19:36:12 INFO]: [GravityControl] Loading server plugin GravityControl v2.0.0
gurkcraft-1-mc-1  | [19:36:12 INFO]: [FastLogin] Loading server plugin FastLogin v1.12-SNAPSHOT-971b0e1
gurkcraft-1-mc-1  | [19:36:12 INFO]: [GriefPrevention] Loading server plugin GriefPrevention v16.18
gurkcraft-1-mc-1  | [19:36:12 INFO]: Server permissions file permissions.yml is empty, ignoring it
gurkcraft-1-mc-1  | [19:36:12 INFO]: [ProtocolLib] Enabling ProtocolLib v5.0.0-SNAPSHOT-b612
gurkcraft-1-mc-1  | [19:36:12 WARN]: **** SERVER IS RUNNING IN OFFLINE/INSECURE MODE!
gurkcraft-1-mc-1  | [19:36:12 WARN]: The server will make no attempt to authenticate usernames. Beware.
gurkcraft-1-mc-1  | [19:36:12 WARN]: While this makes the game possible to play without internet access, it also opens up the ability for hackers to connect with any username they choose.
gurkcraft-1-mc-1  | [19:36:12 WARN]: To change this, set "online-mode" to "true" in the server.properties file.
gurkcraft-1-mc-1  | [19:36:12 INFO]: Preparing level "Gurkcraft S3"
gurkcraft-1-mc-1  | [19:36:13 INFO]: Preparing start region for dimension minecraft:overworld
gurkcraft-1-mc-1  | [19:36:13 INFO]: Time elapsed: 85 ms
gurkcraft-1-mc-1  | [19:36:13 INFO]: Preparing start region for dimension minecraft:the_nether
gurkcraft-1-mc-1  | [19:36:13 INFO]: Time elapsed: 26 ms
gurkcraft-1-mc-1  | [19:36:13 INFO]: Preparing start region for dimension minecraft:the_end
gurkcraft-1-mc-1  | [19:36:13 INFO]: Time elapsed: 17 ms
gurkcraft-1-mc-1  | [19:36:13 INFO]: [AuthMe] Enabling AuthMe v5.6.0-beta2-b2453
gurkcraft-1-mc-1  | [19:36:13 INFO]: [AuthMe] SQLite Setup finished
gurkcraft-1-mc-1  | [19:36:13 INFO]: [AuthMe] No supported permissions system found! Permissions are disabled!
gurkcraft-1-mc-1  | [19:36:13 INFO]: [AuthMe] AuthMe 5.6.0-beta2 build n.2453 successfully enabled!
gurkcraft-1-mc-1  | [19:36:13 INFO]: [GravityControl] Enabling GravityControl v2.0.0
gurkcraft-1-mc-1  | [19:36:14 INFO]: [FastLogin] Enabling FastLogin v1.12-SNAPSHOT-971b0e1
gurkcraft-1-mc-1  | [19:36:14 WARN]: [FastLogin] Disabling Minecraft proxy configuration. Assuming direct connections from now on.
gurkcraft-1-mc-1  | [19:36:14 INFO]: [fastlogin.hikari.HikariDataSource] FastLogin - Starting...
gurkcraft-1-mc-1  | [19:36:14 INFO]: [fastlogin.hikari.HikariDataSource] FastLogin - Start completed.
gurkcraft-1-mc-1  | [19:36:14 INFO]: [GriefPrevention] Enabling GriefPrevention v16.18
gurkcraft-1-mc-1  | [19:36:14 INFO]: [GriefPrevention] Finished loading configuration.
gurkcraft-1-mc-1  | [19:36:14 INFO]: [GriefPrevention] Failed to load a claim (ID:1) because its world isn't loaded (yet?).  If this is not expected, delete this claim.
gurkcraft-1-mc-1  | [19:36:14 INFO]: [GriefPrevention] Failed to load a claim (ID:2) because its world isn't loaded (yet?).  If this is not expected, delete this claim.
gurkcraft-1-mc-1  | [19:36:14 INFO]: [GriefPrevention] Failed to load a claim (ID:0) because its world isn't loaded (yet?).  If this is not expected, delete this claim.
gurkcraft-1-mc-1  | [19:36:14 INFO]: [GriefPrevention] Failed to load a claim (ID:3) because its world isn't loaded (yet?).  If this is not expected, delete this claim.
gurkcraft-1-mc-1  | [19:36:14 INFO]: [GriefPrevention] 17 total claims loaded.
gurkcraft-1-mc-1  | [19:36:14 INFO]: [GriefPrevention] Customizable messages loaded.
gurkcraft-1-mc-1  | [19:36:14 INFO]: [GriefPrevention] Finished loading data (File Mode).
gurkcraft-1-mc-1  | [19:36:14 INFO]: [GriefPrevention] Boot finished.
gurkcraft-1-mc-1  | [19:36:14 INFO]: Starting remote control listener
gurkcraft-1-mc-1  | [19:36:14 INFO]: Thread RCON Listener started
gurkcraft-1-mc-1  | [19:36:14 INFO]: RCON running on 0.0.0.0:25576
gurkcraft-1-mc-1  | [19:36:14 INFO]: Running delayed init tasks
gurkcraft-1-mc-1  | [19:36:14 INFO]: [AuthMe] Downloading GEO IP database, because the old database is older than 30 days or doesn't exist
gurkcraft-1-mc-1  | [19:36:14 INFO]: Done (3.010s)! For help, type "help"
gurkcraft-1-mc-1  | [19:36:14 WARN]: [AuthMe] No MaxMind credentials found in the configuration file! GeoIp protections will be disabled.
gurkcraft-1-mc-1  | [19:36:14 INFO]: [AuthMe] There is no newer GEO IP database uploaded to MaxMind. Using the old one for now.
gurkcraft-1-mc-1  | [19:36:14 WARN]: [AuthMe] Could not download GeoLiteAPI database [FileNotFoundException]: plugins/AuthMe/GeoLite2-Country.mmdb (No such file or directory)
gurkcraft-1-mc-1  | [19:36:14 INFO]: Timings Reset
gurkcraft-1-mc-1  | [19:36:14 INFO]: [FastLogin] Hooking into auth plugin: AuthMeHook

The text was updated successfully, but these errors were encountered:

itzg · 2023-07-31T21:22:28Z

Your non-default configuration of the container/internal ports is problematic:

Leave those at the following and put back the same in server.properties:

    ports:
      - ${SERVER_PORT}:25565
      - ${RCON_PORT}:25575

If you're still experiencing problems, provide the output of

docker inspect gurkcraft-1-mc-1 -f '{{json .State.Health}}'

and also add DEBUG: true to your environment and provide those output logs.

itzg · 2023-07-31T21:29:16Z

Since mc-health is simply a bash script, one more thing to adjust in order to debug is to add the following to your mc service:

    healthcheck:
      test: bash -x mc-health

With that, again grab the output of health inspection, which will now include the execution debugs of the script:

docker inspect gurkcraft-1-mc-1 -f '{{json .State.Health}}'

zincarla · 2023-08-03T21:16:16Z

I seem to have a similar/same issue. My container lags the whole server out with a bunch of ps -ax -o stat,comm processes. htop shows all leftover cpu time going to kernel calls. I seen your link to the mc-health bash script, looks like that command is not called if ENABLE_AUTOPAUSE is off. So I turned it off and I'm back up and running! Not a fix, probably not much help, but that's my workaround. Thanks for this repo btw, very helpful.

itzg · 2023-08-03T23:22:42Z

Interesting, but @fisher60 it looks like you're not using autopause, right?

fisher60 · 2023-08-07T21:18:35Z

I am not using autopause.

Here is the output of adding that healthcheck line to my docker-compose.yml and using docker inspect

{
    "Status": "starting",
    "FailingStreak": 11,
    "Log": [
        {
            "Start": "2023-08-07T20:20:47.5634729Z",
            "End": "2023-08-07T20:21:17.634731043Z",
            "ExitCode": -1,
            "Output": "Health check exceeded timeout (30s): + . /start-utils\n+ '[' -f /data/.mc-health.env ']'\n+ isTrue ''\n+ case \"${1,,}\" in\n+ return 1\n+ isTrue ''\n+ case \"${1,,}\" in\n+ return 1\n+ mc-monitor status --host localhost --port 25566\n"
        },
        {
            "Start": "2023-08-07T20:21:22.658375119Z",
            "End": "2023-08-07T20:21:52.706556752Z",
            "ExitCode": -1,
            "Output": "Health check exceeded timeout (30s): + . /start-utils\n+ '[' -f /data/.mc-health.env ']'\n+ isTrue ''\n+ case \"${1,,}\" in\n+ return 1\n+ isTrue ''\n+ case \"${1,,}\" in\n+ return 1\n+ mc-monitor status --host localhost --port 25566\n"
        },
        {
            "Start": "2023-08-07T20:21:57.729941643Z",
            "End": "2023-08-07T20:22:27.789959294Z",
            "ExitCode": -1,
            "Output": "Health check exceeded timeout (30s): + . /start-utils\n+ '[' -f /data/.mc-health.env ']'\n+ isTrue ''\n+ case \"${1,,}\" in\n+ return 1\n+ isTrue ''\n+ case \"${1,,}\" in\n+ return 1\n+ mc-monitor status --host localhost --port 25566\n"
        },
        {
            "Start": "2023-08-07T20:22:32.809159068Z",
            "End": "2023-08-07T20:23:02.866483212Z",
            "ExitCode": -1,
            "Output": "Health check exceeded timeout (30s): + . /start-utils\n+ '[' -f /data/.mc-health.env ']'\n+ isTrue ''\n+ case \"${1,,}\" in\n+ return 1\n+ isTrue ''\n+ case \"${1,,}\" in\n+ return 1\n+ mc-monitor status --host localhost --port 25566\n"
        },
        {
            "Start": "2023-08-07T20:23:07.890101611Z",
            "End": "2023-08-07T20:23:37.938705575Z",
            "ExitCode": -1,
            "Output": "Health check exceeded timeout (30s): + . /start-utils\n+ '[' -f /data/.mc-health.env ']'\n+ isTrue ''\n+ case \"${1,,}\" in\n+ return 1\n+ isTrue ''\n+ case \"${1,,}\" in\n+ return 1\n+ mc-monitor status --host localhost --port 25566\n"
        }
    ]
}

itzg · 2023-08-08T16:24:50Z

Thanks for those logs. With that I noticed that the ping operation itself had a default timeout of 60 seconds, which is longer than the Docker health check timeout. Can you re-pull the latest image and see if that resolves this for you?

fisher60 · 2023-08-08T19:53:49Z

Hello, I pulled the latest image and built it. The healthcheck is failing without the zombie processes spawning. So it appears to be fixed now.

Thank you!

zincarla · 2023-08-08T23:32:58Z

I think my issue was somehow related or a heck of a coincidence. I pulled latest image and enabled auto_pause. All good. /shrug

itzg · 2023-08-09T00:46:59Z

@zincarla it was very likely related. I had originally added that timeout in mc-monitor because the paused process would lead to weird socket connectivity.

In any case, glad it's resolved for you also.

fisher60 added bug status/needs triage labels Jul 31, 2023

itzg removed the status/needs triage label Jul 31, 2023

This was referenced Aug 8, 2023

status: lower default ping timeout to 15s itzg/mc-monitor#54

Merged

Set mc-monitor timeout less than health check timeout #2326

Merged

itzg closed this as completed Aug 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Healthcheck creating infinite zombie processes #2311

Healthcheck creating infinite zombie processes #2311

fisher60 commented Jul 31, 2023

itzg commented Jul 31, 2023

itzg commented Jul 31, 2023

zincarla commented Aug 3, 2023

itzg commented Aug 3, 2023

fisher60 commented Aug 7, 2023

itzg commented Aug 8, 2023

fisher60 commented Aug 8, 2023

zincarla commented Aug 8, 2023

itzg commented Aug 9, 2023

Healthcheck creating infinite zombie processes #2311

Healthcheck creating infinite zombie processes #2311

Comments

fisher60 commented Jul 31, 2023

Describe the problem

Container definition

Container logs

itzg commented Jul 31, 2023

itzg commented Jul 31, 2023

zincarla commented Aug 3, 2023

itzg commented Aug 3, 2023

fisher60 commented Aug 7, 2023

itzg commented Aug 8, 2023

fisher60 commented Aug 8, 2023

zincarla commented Aug 8, 2023

itzg commented Aug 9, 2023