Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rapid log file growth and repeated Waku errors when DNS encryption is not set #5679

Open
qfrank opened this issue Aug 8, 2024 · 17 comments · Fixed by #5682 or waku-org/go-waku#1238
Open
Assignees

Comments

@qfrank
Copy link
Contributor

qfrank commented Aug 8, 2024

Issue Description

When running the Status app without setting encrypted DNS (e.g., https://9.9.9.9/dns-query), the geth.log file grows rapidly within seconds of opening the app. The log is filled with repetitive error messages, primarily related to Waku and peer selection.

Observed Behavior

  1. Without encrypted DNS, opening the Status app causes geth.log to generate a large volume of repeated error logs within seconds.
  2. The log file size increases rapidly, potentially overwriting initial logs (e.g., 'status-go version details').
  3. The majority of error messages are related to peer selection and Waku.

Log Analysis

The log shows multiple recurring errors:

  • "peer selection returned err"
  • "no suitable peers found"
  • "dns discovery error"
  • "could not obtain dns discovery peers"

Example log snippet:

ERROR[08-08|08:12:56.847][github.com/status-im/status-go/vendor/github.com/waku-org/go-waku/waku/v2/protocol/filter/client.go:350] peer selection returned err
error="no suitable peers found"

Potential Causes

  1. GFW (Great Firewall) interference: The issue may be related to network restrictions in certain regions.
  2. Waku-related code issues in the Status app.
  3. DNS resolution problems: Default DNS may fail to resolve Ethereum network node addresses correctly.
  4. Network connectivity issues: Certain network configurations may be blocking the Status app from connecting to the Ethereum network.

Workaround

Setting encrypted DNS (e.g., https://9.9.9.9/dns-query) appears to resolve or mitigate the issue. This suggests that the problem might be related to DNS resolution or network restrictions.

Additional Context

  • The issue seems to create a loop-like behavior in Waku-related processes.
  • The problem may be more pronounced in regions with strict internet controls.

Questions for Investigation

  1. Is there a potential infinite loop in the Waku-related code when peer discovery fails?
  2. How does the encrypted DNS setting affect the peer discovery process?
  3. Are there any network-level checks that can be implemented to gracefully handle situations where peer discovery consistently fails?

Environment

  • Status app version: on branch develop, commit id: 3b4cbfa4bca1b76cf3e9675e2e6b99d14c09b9e1
  • ios simulator: iphone 15 pro
  • surge v5.7.x

Steps to Reproduce

  1. Ensure encrypted DNS is not set in surge.
  2. Open the Status app.
  3. Monitor the geth.log file for rapid growth and repeated error messages.

relate screenshots and comments:

This might be a serious issue. I just opened the Status app for a few seconds, and the log file size grows rapidly. The content seems to repeatedly relate to Waku, as if there is a dead loop. I can't even see the initial log like status-go version details because it might be getting overwritten.

image

after changing log level to INFO , geth.log get fullfilled with this content, so now I'm sure there's something relate to waku

image update encrypted DNS
@churik
Copy link
Member

churik commented Aug 8, 2024

cc @fryorcraken would you mind having somebody look at it?

@richard-ramos richard-ramos self-assigned this Aug 9, 2024
@chaitanyaprem
Copy link
Contributor

can i get full logs for this?

interesting that without peers filter was trying to subscribe. Without any peers node should be in offline mode and should not try to subscribe(which must be causing these errors).

@qfrank
Copy link
Contributor Author

qfrank commented Aug 9, 2024

can i get full logs for this?

interesting that without peers filter was trying to subscribe. Without any peers node should be in offline mode and should not try to subscribe(which must be causing these errors).

mass-repeat-error-geth.log.zip
^ is the log for INFO enabled I only saved, if you need a DEBUG level, pls let me know @chaitanyaprem

@chaitanyaprem
Copy link
Contributor

can i get full logs for this?
interesting that without peers filter was trying to subscribe. Without any peers node should be in offline mode and should not try to subscribe(which must be causing these errors).

mass-repeat-error-geth.log.zip ^ is the log for INFO enabled I only saved, if you need a DEBUG level, pls let me know @chaitanyaprem

yup, was about to ask for DEBUG because i want to know what went wrong with setting node state. Do provide the debug.

@qfrank
Copy link
Contributor Author

qfrank commented Aug 9, 2024

debug-geth.log.zip

@qfrank
Copy link
Contributor Author

qfrank commented Aug 9, 2024

look at the size of the log file, it's rotate already, the duration between each executing ls -lh is just 1 seconds around

image

@qfrank
Copy link
Contributor Author

qfrank commented Aug 9, 2024

increased max log file size to 10 GB, so now get the full log without rotating: debug-geth.log.zip
@chaitanyaprem

@qfrank
Copy link
Contributor Author

qfrank commented Sep 30, 2024

Reopen this as this problem is encountered again on latest branch develop. cc @chaitanyaprem

@qfrank qfrank reopened this Sep 30, 2024
@chaitanyaprem
Copy link
Contributor

Reopen this as this problem is encountered again on latest branch develop. cc @chaitanyaprem

would help if i can get logs to look into it.

@qfrank
Copy link
Contributor Author

qfrank commented Sep 30, 2024

Reopen this as this problem is encountered again on latest branch develop. cc @chaitanyaprem

would help if i can get logs to look into it.

geth.log.zip

@chaitanyaprem
Copy link
Contributor

chaitanyaprem commented Sep 30, 2024

Reopen this as this problem is encountered again on latest branch develop. cc @chaitanyaprem

would help if i can get logs to look into it.

geth.log.zip

would it be possible to share debug logs?
odd that ticker did not kick-in that would set node-state to offline if none of the peers are connected.

also what is the peer-count do you see in the UI?

@qfrank
Copy link
Contributor Author

qfrank commented Sep 30, 2024

image

There seems to be another place called the log

@qfrank
Copy link
Contributor Author

qfrank commented Sep 30, 2024

maybe we could just search the usages on handleFilterSubscribeOptions, and we can find out all the possibles.

would it be possible to share debug logs?

geth-debug.log.zip

also what is the peer-count do you see in the UI?

5

@chaitanyaprem
Copy link
Contributor

chaitanyaprem commented Sep 30, 2024

maybe we could just search the usages on handleFilterSubscribeOptions, and we can find out all the possibles.

i know where this log is coming from, but this should not get triggered at all since the node is not able to reach any peers.

But odd that you have a peer-count of 5. maybe the node was able to connect to 5 peers that are not using hostnames.
Also, do we notice this issue in all scenarios or only in case of DNS failure(which seems like a rare one)?

Will take a look at logs and see.

@chaitanyaprem
Copy link
Contributor

maybe we could just search the usages on handleFilterSubscribeOptions, and we can find out all the possibles.

would it be possible to share debug logs?

geth-debug.log.zip

also what is the peer-count do you see in the UI?

5

ah, from the logs it looks like somehow connection to storenodes is successfull eventhough they are configured with DNS name. But for some reason node is unable to connect to other peers.

i think we need to add add some backoff to filter subscribe in case of subsequent failures especially for specific errors such as "no suitable peers found" . This would prevent such situation from occuring. Will start working on a PR for this.

@chaitanyaprem
Copy link
Contributor

chaitanyaprem commented Oct 7, 2024

#5909 is being dogfooded in status-mobile which includes a simple backoff in case of certain errors.

mobile PR is status-im/status-mobile#21386

@qfrank please test this in the same env and confirm the resolution of the issue.

@qfrank
Copy link
Contributor Author

qfrank commented Oct 8, 2024

#5909 is being dogfooded in status-mobile which includes a simple backoff in case of certain errors.

mobile PR is status-im/status-mobile#21386

@qfrank please test this in the same env and confirm the resolution of the issue.

thank you for your effort, it works ❤️ @chaitanyaprem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants