Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Influx not writing to disk #25296

Open
Cripyy opened this issue Sep 9, 2024 · 0 comments
Open

Influx not writing to disk #25296

Cripyy opened this issue Sep 9, 2024 · 0 comments

Comments

@Cripyy
Copy link

Cripyy commented Sep 9, 2024

We have a Docker Swarm cluster running InfluxDB v2.7.8 on a single node, where we experience issues every Sunday evening at 00 AM, with a recurrence interval of 2 to 3 weeks between occurrences.

We’ve noticed that memory usage suddenly begins to increase steadily, and InfluxDB struggles to write data to disk. While some data is still being written, most of it is not. I’ve reviewed the logs, including Docker logs, syslog, and docker.service logs, but haven’t found any relevant information. There are no error entries suggesting that Influx is having some sort of problem. It is compacting and doing its normal tasks.

Our first attempt to resolve this issue was upgrading InfluxDB from version 2.7.5 to 2.7.8, as there was a changelog entry addressing an infinite write loop bug. Unfortunately, this didn’t resolve the issue, as we experienced the same problem again today.

We have Telegraf running on the server which gathers metrics from the Influx, where we've seen a spike in the queue active:

Skjermbilde fra 2024-09-09 10-04-34

Below is also the graph from the memory usage of the server:

image

The problem is solved by restarting the docker container, but we have lost all data that Influx had in memory.

Environment info:

uname -srm:
Linux 5.15.0-113-generic x86_64

Docker version:
Docker version 24.0.7, build afdd53b

Influxdb-docker image version 2.7.8

The server is a VM running in VMware.

Logs:
The only log error I can find which might be relevant is that Telegraf failed to send metrics to our off-site influx at 00:00 AM and 00:15 AM, even though our off-site influx still received some data from the server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant