Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address the telnet console death bug. #2347

Merged
merged 1 commit into from
Feb 9, 2024

Conversation

spikefishjohn
Copy link
Contributor

@spikefishjohn spikefishjohn commented Feb 3, 2024

Add wait_for for drain() call with a 5 second timeout. If we're stuck on drain then the buffer isn't getting emptied. 5 seconds after drain() blocks, exception will be thrown and client will be removed from connection table and will no longer be a problem. There will be a pause in console output while in this 5 second wait, after that the console will return to normal.

When the exception is triggered the follow debug will be logged.

2024-02-03 03:07:15 DEBUG telnet_server.py:309 Timeout while sending data to client: ('REMOTE_IP', REMOTE_PORT), closing and removing from connection table.

It just dawned on me that peerinfo() is a list so if you want that reformatted it would be easy enough to do.

Fixes Issue 2344

…re stuck on drain then the buffer isn't getting emptied. 5 seconds after drain() blocks, exception will be thrown and client will be removed from connection table and will no longer be a problem.
@spikefishjohn
Copy link
Contributor Author

oops, just noticed its a 10 second timeout and not 5.

@spikefishjohn
Copy link
Contributor Author

One additional note about the fix. I switched to looking up items by keys instead of values and returning a list so that we can safely remove items from the dictionary while transversion.

@cristian-ciobanu
Copy link

@spikefishjohn so merging this pull request will fix completely the console freeze bug when using telnet ?

@spikefishjohn
Copy link
Contributor Author

spikefishjohn commented Feb 6, 2024

It does for me. May want to lower the timeout as 10 seconds feels a little long. That being said the patch could use additional testing. Do you have a way to replicate the issue btw?

The way I recreate the issue is clear the firewall's connection table for the telnet session (I'm accessing GNS3 through a firewall to be clear). This will cause the firewall to do a silent drop on the tcp session. Then open a 2nd telnet session and do something that creates a large amount of output on the console. I use find / on a linux node for example.

Without the patch the console will output data for about 10 seconds or so then lock up. You'll be able to open new sessions and get connected and even send data but you won't see anything happening on the console as the echo of your data doesn't get sent to you.

With the patch there will be a pause while GNS3 is trying to communicate with the down telnet session. After 10 seconds it will give up, delete the session and then session 2 will start working again as well as any new session.

Also note the next version of GNS3 will have tcp keepalives timers lowered so that it sends keep alives more often then every 2 hours. This will prevent a firewall from killing a session from being idle for too long but won't fix the broken console issue.

@cristian-ciobanu
Copy link

Do you have a way to replicate the issue btw?

I can only replicate this only by leaving the telnet console session opened for a while (no activity for several hours I think) until I receive the message connection refused.

@spikefishjohn
Copy link
Contributor Author

hmmmm.. connection refused. Ok I can't say for sure this will address your issue but only because I don't know what happens if the telnet console is left in a bad state for long enough. I personally have never seen it throw connection refused, which would mean the listening socket closed. That being said if the GNS3 node continues to try to push data out and it can't write data at some point I would expect the service to do something bad (like crash) ... so its possible.

If you can give me more details i'll see if I can replicate or by all means give the update a try and report back.

For me to replicate can you tell me how your accessing the telnet console network topology wise? To give you an idea of bullet points i'm looking for.

  • GSN3 version
  • client OS
  • GNS3 server OS
  • GNS3 node your connecting to (cisco router, linux, etc). Hopefully its something I have access to.
  • How does the client reach the server? (its accessed through router, firewall, VPN, GNS3 VM which is running on Virtbox/Vmware/ESX etc (crossing fingers its not GNS3VM).
    Just be as detailed as possible if you want me to try to replicate. Minor details may not be. :)

@cristian-ciobanu
Copy link

Sorry I got a bit confused. I have two different issues related to telnet console connections.

  1. The telnet console session freezes if I leave opened for a while (no activity for several hours I think) no message received.
  2. When I use auto-start console option for devices the console of the devices automatically pops in with the message connection refused (using SecureCRT for accessing console of devices). If I close the console and reopen again it starts to work fine.

I have two different setups for GNS3.

  1. GNS3 server 2.2.44 running on bare metal Ubuntu 22.04 machine and GNS3 client 2.2.44 running remotely on Windows.
  2. GNS3 3.0.0 beta (server and client) running on bare metal on Linux (same machine Archlinux)

The first issue with the telnet console freeze happens only with the remote server setup not when running both server and client on same machine.
The second issues happens randomly so it is not easy to replicate. I think maybe when I start multiple devices at once and their all open their console, but I need to do some more extensive testing.

@spikefishjohn
Copy link
Contributor Author

ah ok.

For sure, this will fix issue number 1. I'll be my burrito on it. Please give the patch a try to see if I owe you one.

Issue 2 sounds like a completely different thing. Like a timing issue where it hasn't waited for the device to return running or maybe for the console to have started. I would make a new bug report for that.

@grossmj grossmj changed the base branch from master to 2.2 February 9, 2024 06:09
@grossmj grossmj merged commit 16f72b4 into GNS3:2.2 Feb 9, 2024
10 checks passed
@cristian-ciobanu
Copy link

@grossmj 3.0 branch does not need this patch applied or the other pull request using telnetlib3 rewrite should in theory fix this annoying issue ?

@grossmj
Copy link
Member

grossmj commented Feb 9, 2024

@grossmj 3.0 branch does not need this patch applied or the other pull request using telnetlib3 rewrite should in theory fix this annoying issue ?

The patch will be merged into 3.0 and the other pull request using telnetlib3 should make it there as well once it has been sufficiently tested.

@spikefishjohn
Copy link
Contributor Author

sooo.. uh.. loaded up 2.2.46. Opened a telnet console. Slept laptop, went to bed. Work up. Double click vyos box.

... nothing...

.....

image

@spikefishjohn
Copy link
Contributor Author

spikefishjohn commented Mar 1, 2024

EDIT: This is completely wrong. Doh! I need to stop the late night bug research.

I did notice something. this calls except if it fails which calls close() which then does this.

   async def close(self):
        for writer, connection in self._connections.items():
            try:
                writer.write_eof()
                await writer.drain()
                writer.close()
                # await writer.wait_closed()  # this doesn't work in Python 3.6
            except (AttributeError, ConnectionError):
                continue

So my theory is the drain call needs to have a wait_for as well. I'm going to see if I can recreate the problem then take a swing at it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Telnet console death issue
3 participants