Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telnet console death issue #2344

Open
spikefishjohn opened this issue Jan 31, 2024 · 7 comments · Fixed by #2347
Open

Telnet console death issue #2344

spikefishjohn opened this issue Jan 31, 2024 · 7 comments · Fixed by #2347

Comments

@spikefishjohn
Copy link
Contributor

@grossmj

Since there isn't really an open issue to track the original death of the telnet console I'm opening this. I've narrowed down where the issue is. I haven't figured out what to do about it just yet.

Here is a link to the telnet_server.py with debugs showing the issue.
telnet_server.py

Here is the debug logs when the death loop happens.

2024-01-30 23:13:41 DEBUG telnet_server.py:271 __process True loop
2024-01-30 23:13:41 DEBUG telnet_server.py:273 __process  reader_read is None
2024-01-30 23:13:41 DEBUG telnet_server.py:251 Start async _get_reader
2024-01-30 23:13:41 DEBUG telnet_server.py:256 self._reader_process == network_reader
2024-01-30 23:13:41 DEBUG telnet_server.py:257 <StreamReader waiter=<Future pending cb=[Task.task_wakeup()] created at /usr/lib/python3.10/asyncio/base_events.py:429> transport=<_SelectorSocketTransport fd=11 re
ad=polling write=<idle, bufsize=0>>> == <StreamReader waiter=<Future pending cb=[Task.task_wakeup()] created at /usr/lib/python3.10/asyncio/base_events.py:429> transport=<_SelectorSocketTransport fd=13 read=poll
ing write=<polling, bufsize=65550>>>
2024-01-30 23:13:41 DEBUG telnet_server.py:262 _get_reader Returning None
2024-01-30 23:13:41 DEBUG telnet_server.py:275 __process reader_read is still 2nd None
2024-01-30 23:13:41 DEBUG telnet_server.py:277 __process reader_read is still 3rd None
2024-01-30 23:13:42 DEBUG telnet_server.py:292 __process just before coro done check
2024-01-30 23:13:42 DEBUG telnet_server.py:293
2024-01-30 23:13:42 DEBUG telnet_server.py:294
2024-01-30 23:13:42 DEBUG telnet_server.py:295
2024-01-30 23:13:42 DEBUG telnet_server.py:271 __process True loop
2024-01-30 23:13:42 DEBUG telnet_server.py:273 __process  reader_read is None
2024-01-30 23:13:42 DEBUG telnet_server.py:251 Start async _get_reader
2024-01-30 23:13:42 DEBUG telnet_server.py:256 self._reader_process == network_reader
2024-01-30 23:13:42 DEBUG telnet_server.py:257 <StreamReader waiter=<Future pending cb=[Task.task_wakeup()] created at /usr/lib/python3.10/asyncio/base_events.py:429> transport=<_SelectorSocketTransport fd=11 re
ad=polling write=<idle, bufsize=0>>> == <StreamReader waiter=<Future pending cb=[Task.task_wakeup()] created at /usr/lib/python3.10/asyncio/base_events.py:429> transport=<_SelectorSocketTransport fd=13 read=poll
ing write=<polling, bufsize=65550>>>
2024-01-30 23:13:42 DEBUG telnet_server.py:262 _get_reader Returning None
2024-01-30 23:13:42 DEBUG telnet_server.py:275 __process reader_read is still 2nd None
2024-01-30 23:13:42 DEBUG telnet_server.py:277 __process reader_read is still 3rd None
2024-01-30 23:13:43 DEBUG telnet_server.py:292 __process just before coro done check
2024-01-30 23:13:43 DEBUG telnet_server.py:293
2024-01-30 23:13:43 DEBUG telnet_server.py:294
2024-01-30 23:13:43 DEBUG telnet_server.py:295
2024-01-30 23:13:43 DEBUG telnet_server.py:271 __process True loop
2024-01-30 23:13:43 DEBUG telnet_server.py:273 __process  reader_read is None
2024-01-30 23:13:43 DEBUG telnet_server.py:251 Start async _get_reader
2024-01-30 23:13:43 DEBUG telnet_server.py:256 self._reader_process == network_reader
2024-01-30 23:13:43 DEBUG telnet_server.py:257 <StreamReader waiter=<Future pending cb=[Task.task_wakeup()] created at /usr/lib/python3.10/asyncio/base_events.py:429> transport=<_SelectorSocketTransport fd=11 re
ad=polling write=<idle, bufsize=0>>> == <StreamReader waiter=<Future pending cb=[Task.task_wakeup()] created at /usr/lib/python3.10/asyncio/base_events.py:429> transport=<_SelectorSocketTransport fd=13 read=poll
ing write=<polling, bufsize=65550>>>
2024-01-30 23:13:43 DEBUG telnet_server.py:262 _get_reader Returning None
2024-01-30 23:13:43 DEBUG telnet_server.py:275 __process reader_read is still 2nd None
2024-01-30 23:13:43 DEBUG telnet_server.py:277 __process reader_read is still 3rd None
2024-01-30 23:13:44 DEBUG telnet_server.py:292 __process just before coro done check
2024-01-30 23:13:44 DEBUG telnet_server.py:293
2024-01-30 23:13:44 DEBUG telnet_server.py:294
2024-01-30 23:13:44 DEBUG telnet_server.py:295
etc etc etc

So.. the interesting thing here _get_reader is always returning None. This then causes a infinity loop in _process.

As you can see based on the debugs above _get_reader is making it all the way to that final if statement, which doesn't match and them returns None. I've looked at this a few time and I don't the intention of that compare.

BTW ignore the "__process reader_read is still 2nd None" log. I didn't have that in the right place and didn't feel like redoing everything.

Some other interesting things to point out. self._reader_process has a 0 byte buffer, but network_reader has a buffer of 65550. I have to poke around pdb to see what I can find, but if i recall from looking at this the socket on self._reader_process technically closed so its impossible to write data to it. I think this is why reader's buffer isn't getting drained. I'm not sure if maybe by calling future on a dead socket is what is causing this.

The way I'm personally triggering this is to setup a firewall rule then delete a telnet console session as i'm running find / on 2 telnet consoles. The telnet consoles will wedge in 10 seconds or so.

I still plan on finishing up proxy replacement, but its not ready. I got to the point where I can start and stop a qemu VM and the console works, but if i delete the devices I'm getting exceptions. It also doesn't support all the options the original telnet_server supports (such as passing in reader/writer).

@spikefishjohn
Copy link
Contributor Author

I'm really starting to question if this is the reason the new console breaks. I will say once in this loop there is nothing to delete the connection that is in a failed state.

@spikefishjohn
Copy link
Contributor Author

spikefishjohn commented Feb 2, 2024

ok I think I have this dialed in now. So many more debugs..

This shows a bad connection isn't being dealt with correctly.

024-02-02 23:09:13 DEBUG telnet_server.py:284 Start async _get_reader
2024-02-02 23:09:13 DEBUG telnet_server.py:290 Start async _get_reader - self._reader_process is not None
2024-02-02 23:09:13 DEBUG telnet_server.py:291 self._reader_process == network_reader
2024-02-02 23:09:13 DEBUG telnet_server.py:292 <StreamReader exception=TimeoutError(110, 'Connection timed out') transport=<_SelectorSocketTransport closed fd=16>> == <StreamReader waiter=<Future pending cb=[Task.task_wakeup()] created at /usr/lib/python3.10/asyncio/base_events.py:429> transport=<_SelectorSocketTransport fd=13 read=polling write=<idle, bufsize=0>>>
2024-02-02 23:09:13 DEBUG telnet_server.py:298 _get_reader Returning None

As you can see self._reader_process has a StreamReader that has an exception thrown with a dead socket, but it hasn't been removed.

I'm going to add a wait_for on a drain call. If it fails after say 30 seconds then we haven't been able to get data out to a low water mark on the socket and its most likely dead. Then yank from connection dictionary and in theory this problem goes away.

Crossing fingers...

@spikefishjohn
Copy link
Contributor Author

Fix located here.

PR2347

@grossmj
Copy link
Member

grossmj commented Feb 7, 2024

Thanks, I am going to check everything soon (need to catch up after holidays) 👍

@grossmj
Copy link
Member

grossmj commented Feb 9, 2024

Awesome, thanks for the fix, I hope we have finally squashed that annoying bug 👍

@grossmj grossmj closed this as completed Feb 9, 2024
@spikefishjohn
Copy link
Contributor Author

Its.. uh.. not.

@grossmj Can you re-open this while I yell into the void in an overly dramatic manner?

@grossmj grossmj reopened this Mar 1, 2024
@spikefishjohn
Copy link
Contributor Author

spikefishjohn commented Mar 1, 2024

Here is the exception. I don't know how long it was hung on this call, but you can see it does timeout after a while. My guess is 2 hours. Oh I also feel like maybe the keepalives aren't always being set for some reason, but no evidence.

Knee jerk reaction is asyncio.wait_for() again. I'll put sometime into it this weekend.

2024-03-01 02:36:54 ERROR base_events.py:1758 Task exception was never retrieved
future: <Task finished name='Task-1745' coro=<AsyncioTelnetServer.run() done, defined at /usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/utils/asyncio/telnet_server.py:188> exception=TimeoutError(110, 'Connection timed out') created at /usr/lib/python3.10/asyncio/streams.py:242>
source_traceback: Object created at (most recent call last):
  File "/usr/bin/gns3server", line 8, in <module>
    sys.exit(main())
  File "/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/main.py", line 87, in main
    run()
  File "/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/run.py", line 267, in run
    server.run()
  File "/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/web/web_server.py", line 352, in run
    self._loop.run_forever()
  File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/usr/lib/python3.10/asyncio/base_events.py", line 1901, in _run_once
    handle._run()
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/lib/python3.10/asyncio/streams.py", line 242, in connection_made
    self._task = self._loop.create_task(res)
Traceback (most recent call last):
  File "/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/utils/asyncio/telnet_server.py", line 207, in run
    await self._process(network_reader, network_writer, connection)
  File "/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/utils/asyncio/telnet_server.py", line 269, in _process
    data = coro.result()
  File "/usr/lib/python3.10/asyncio/streams.py", line 669, in read
    await self._wait_for_data('read')
  File "/usr/lib/python3.10/asyncio/streams.py", line 501, in _wait_for_data
    await self._waiter
  File "/usr/lib/python3.10/asyncio/selector_events.py", line 862, in _read_ready__data_received
    data = self._sock.recv(self.max_size)
TimeoutError: [Errno 110] Connection timed out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants