Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nftables] remediation component shutdowns after a failed response #369

Open
LaurenceJJones opened this issue May 16, 2024 · 15 comments
Open

Comments

@LaurenceJJones
Copy link
Contributor

What happened?

When the remediation component fails to connect to LAPI currently with nftables, the whole service comes down and flushes the nftables set

time="10-05-2024 11:06:07" level=info msg="Processing new and deleted decisions . . ."
time="10-05-2024 11:07:07" level=error msg="http code 504, invalid body: invalid character '<' looking for beginning of value"
time="10-05-2024 11:07:07" level=info msg="Shutting down backend"
time="10-05-2024 11:07:07" level=info msg="flushing 'crowdsec-blacklists' set in 'crowdsec' table"
time="10-05-2024 11:07:07" level=info msg="flushing 'crowdsec6-blacklists' set in 'crowdsec6' table"
time="10-05-2024 11:07:07" level=fatal msg="process terminated with error: bouncer stream halted"
time="10-05-2024 11:07:17" level=info msg="Starting crowdsec-firewall-bouncer v0.0.28-debian-pragmatic-af6e7e25822c2b1a02168b99ebbf8458bc6728e5"
time="10-05-2024 11:07:17" level=info msg="backend type : nftables"
time="10-05-2024 11:07:17" level=info msg="nftables initiated"

This is not what we want as the IP's currently within set are useful to the service.

What did you expect to happen?

Remediation component should allow for failures to connect to LAPI after the service has started, EG connect first if failed at startup then yes restart but after that should be resilient

How can we reproduce it (as minimally and precisely as possible)?

Bring up a LAPI and firewall remediation, currently user has reported if the response code > 500 the service comes down

Anything else we need to know?

No response

version

remediation component version:

$ crowdsec-firewall-bouncer --version
# paste output here

crowdsec version

crowdsec version:

$ crowdsec --version
# paste output here

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here
Copy link

@LaurenceJJones: Thanks for opening an issue, it is currently awaiting triage.

In the meantime, you can:

  1. Check Documentation to see if your issue can be self resolved.
  2. You can also join our Discord
Details

I am a bot created to help the crowdsecurity developers manage community feedback and contributions. You can check out my manifest file to understand my behavior and what I can do. If you want to use this for your project, you can check out the BirthdayResearch/oss-governance-bot repository.

Copy link

@LaurenceJJones: There are no 'kind' label on this issue. You need a 'kind' label to start the triage process.

  • /kind feature
  • /kind enhancement
  • /kind bug
  • /kind packaging
Details

I am a bot created to help the crowdsecurity developers manage community feedback and contributions. You can check out my manifest file to understand my behavior and what I can do. If you want to use this for your project, you can check out the BirthdayResearch/oss-governance-bot repository.

@dolgovas
Copy link

Hello! We met this trouble. Do you have any update about this trouble?

@mr1jingles
Copy link

UPDATE. This only happens if the bouncer is restarted. If the api does not respond when bouncer is running, bouncer tries to get new solutions and continues to work.

One more question: Why does bouncer reset nftables set on restart?

@LaurenceJJones
Copy link
Contributor Author

LaurenceJJones commented May 22, 2024

UPDATE. This only happens if the bouncer is restarted. If the api does not respond when bouncer is running, bouncer tries to get new solutions and continues to work.

Yes, this is the current design, as if the remediation component doesn't get an initial connection, then it could be a bad configuration

One more question: Why does bouncer reset nftables set on restart?

We remove the set because it takes ten times more time to do an initial load if we have to check if each element already exists. So, to be more efficient, we remove the set and then reinstate it upon restart

@mr1jingles
Copy link

But if the host is under attack and clearing the nftables set can negatively affect the server.

It is also not entirely clear, if bouncer clears the nftables set, why does it pull all decisions (also outdated) if the set is cleared?

@LaurenceJJones
Copy link
Contributor Author

But if the host is under attack and clearing the nftables set can negatively affect the server.

Yes, but this should only happen if you restart the service when under attack. As the service should be running for a long time unless there is a reason not to run it.

Most likely, the way crowdsec sends decisions, bouncers don't have a direct influence on what they get sent unless it's filtered. There is no impact on performance. You just see an unesscary log line that's all

@mr1jingles
Copy link

If the host is under attack, then it is possible that free memory runs out and the OOM process can kill bouncer, so when restarting bouncer clears the table, thereby provoking even more load on the server.

I think it's reasonable to add an option that allows you to compare the data received from the API instead of clearing the table when restarting

@mr1jingles
Copy link

About decisions. When I restarted a large number of bouncers, I saw a large load on the database on the API server. This led to a memory leak and complete unavailability of the API
Screenshot at May 23 13-45-34
Screenshot at May 23 13-45-48
Screenshot at May 23 13-46-02

@LaurenceJJones
Copy link
Contributor Author

This led to a memory leak and complete unavailability of the API

Memory as a spike does not equal a memory leak it just means the api is handling the requests, and because it holds decisions in memory whilst it queries, then it will spike.

We have a feature flag for streamed decisions it may help https://docs.crowdsec.net/docs/next/configuration/feature_flags#list-of-available-feature-flags

If you can capture the memory leak via pprof, we look into it.

https://docs.crowdsec.net/docs/next/observability/pprof

I understand the OOM part, and we can improve this in the future, but currently, we have no resources to look at this, so contributions are welcome.

@LaurenceJJones
Copy link
Contributor Author

LaurenceJJones commented May 23, 2024

/kind enhancement
/accepted

@mr1jingles
Copy link

We have a feature flag for streamed decisions it may help https://docs.crowdsec.net/docs/next/configuration/feature_flags#list-of-available-feature-flags

Should I enable this flag on the API server?

Correct me if I'm wrong. Does this feature allow you to send decisions in a batch?

@LaurenceJJones
Copy link
Contributor Author

LaurenceJJones commented May 24, 2024

We have a feature flag for streamed decisions it may help https://docs.crowdsec.net/docs/next/configuration/feature_flags#list-of-available-feature-flags

Should I enable this flag on the API server?

Correct me if I'm wrong. Does this feature allow you to send decisions in a batch?

Exactly, so instead of getting all decisions in memory, it will fetch X amount then write to stream, then fetch next batch and write to stream and so on and so on. It may become standard for next releases currently it behind a feature flag since we wanted to ensure stability but we have a large enterprise using it in production for over 2 minor releases with no issues reported from their side.

@mr1jingles
Copy link

And if I use MySQL as a database server, will it work for it too?

@LaurenceJJones
Copy link
Contributor Author

And if I use MySQL as a database server, will it work for it too?

Yes works for all databases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants