Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conn-retry needed before file-retry? #12

Open
varenius opened this issue May 20, 2021 · 0 comments
Open

Conn-retry needed before file-retry? #12

varenius opened this issue May 20, 2021 · 0 comments

Comments

@varenius
Copy link

varenius commented May 20, 2021

Doing remote-remote testing I found something interesting. Using three machines (gyller: etc, skirner_x: data source, TARGET: data target) I decided to check how it handles broken connections. I tested two situations
A) I run etc on gyller, having etd running on target, but not on source. etc cannot connect, so it waits and then retries to connect. After a while, I start etd on source, and then etc finds the connection and starts the transfer. All good.
B) However, if I now kill the etd instance on source, the transfer stops. etc now tries to re-transfer the file, but it does NOT try to reconnect! So, even if I restart etd on source, the transfer is not resumed because the connection is never re-established (so the transfer-retries do not work).

Here is very verbose output from the etc side:

oper@gyller:~/eskil/etransfer.git/Linux-x86_64-native-opt$ ./etc skirner_x:/mnt/vbsmnt/ev0244_ow_244-1000* TARGET#2620:/gpfs/cdata/incoming/onsala-test/ -v --resume --max-retry 10 --max-conn-retry 10 -m 5
2021-05-20 09:26:20.64: [etdc::etdc_fdptr mk_client(const T&, const etdc::detail::client_settings&) [with T = etdc::protocol_type; etdc::etdc_fdptr = std::shared_ptr<etdc::etdc_fd>]] mk_client/attempt #1/11 trying to connect to tcp:skirner_x:4004
2021-05-20 09:26:20.64: [etdc::etdc_fdptr mk_client(const T&, const etdc::detail::client_settings&) [with T = etdc::protocol_type; etdc::etdc_fdptr = std::shared_ptr<etdc::etdc_fd>]] mk_client/sleeping for 5s trying to connect to tcp:skirner_x:4004
2021-05-20 09:26:25.64: [etdc::etdc_fdptr mk_client(const T&, const etdc::detail::client_settings&) [with T = etdc::protocol_type; etdc::etdc_fdptr = std::shared_ptr<etdc::etdc_fd>]] mk_client/attempt #2/11 trying to connect to tcp:skirner_x:4004
2021-05-20 09:26:25.64: [etdc::etdc_fdptr mk_client(const T&, const etdc::detail::client_settings&) [with T = etdc::protocol_type; etdc::etdc_fdptr = std::shared_ptr<etdc::etdc_fd>]] mk_client/sleeping for 5s trying to connect to tcp:skirner_x:4004
2021-05-20 09:26:30.64: [etdc::etdc_fdptr mk_client(const T&, const etdc::detail::client_settings&) [with T = etdc::protocol_type; etdc::etdc_fdptr = std::shared_ptr<etdc::etdc_fd>]] mk_client/attempt #3/11 trying to connect to tcp:skirner_x:4004
2021-05-20 09:26:30.64: [virtual etdc::protocolversion_type etdc::ETDProxy::protocolVersion() const] ETDProxy::protocolVersion/sending message 'protocol-version
'
2021-05-20 09:26:30.64: [etdc::etdc_fdptr mk_client(const T&, const etdc::detail::client_settings&) [with T = etdc::protocol_type; etdc::etdc_fdptr = std::shared_ptr<etdc::etdc_fd>]] mk_client/attempt #1/11 trying to connect to tcp:TARGET:2620
2021-05-20 09:26:30.67: [virtual etdc::protocolversion_type etdc::ETDProxy::protocolVersion() const] ETDProxy::protocolVersion/sending message 'protocol-version
'
2021-05-20 09:26:30.70: [int main(int, const char* const*)] This client supports protocol version 1
2021-05-20 09:26:30.70: [int main(int, const char* const*)] Server protocol version: 1
2021-05-20 09:26:30.70: [int main(int, const char* const*)] Server protocol version: 1
2021-05-20 09:26:30.70: [virtual etdc::filelist_type etdc::ETDProxy::listPath(const string&, bool) const] ETDProxy::listPath/sending message 'list /mnt/vbsmnt/ev0244_ow_244-1000*
'
2021-05-20 09:26:30.73: [virtual etdc::filelist_type etdc::ETDProxy::listPath(const string&, bool) const] listPath/reply from server: 'OK /mnt/vbsmnt/ev0244_ow_244-1000_0'
2021-05-20 09:26:30.73: [virtual etdc::filelist_type etdc::ETDProxy::listPath(const string&, bool) const] listPath/reply from server: 'OK /mnt/vbsmnt/ev0244_ow_244-1000_1'
2021-05-20 09:26:30.73: [virtual etdc::filelist_type etdc::ETDProxy::listPath(const string&, bool) const] listPath/reply from server: 'OK /mnt/vbsmnt/ev0244_ow_244-1000_2'
2021-05-20 09:26:30.73: [virtual etdc::filelist_type etdc::ETDProxy::listPath(const string&, bool) const] listPath/reply from server: 'OK /mnt/vbsmnt/ev0244_ow_244-1000_3'
2021-05-20 09:26:30.73: [virtual etdc::filelist_type etdc::ETDProxy::listPath(const string&, bool) const] listPath/reply from server: 'OK /mnt/vbsmnt/ev0244_ow_244-1000_4'
2021-05-20 09:26:30.73: [virtual etdc::filelist_type etdc::ETDProxy::listPath(const string&, bool) const] listPath/reply from server: 'OK /mnt/vbsmnt/ev0244_ow_244-1000_5'
2021-05-20 09:26:30.73: [virtual etdc::filelist_type etdc::ETDProxy::listPath(const string&, bool) const] listPath/reply from server: 'OK /mnt/vbsmnt/ev0244_ow_244-1000_6'
2021-05-20 09:26:30.73: [virtual etdc::filelist_type etdc::ETDProxy::listPath(const string&, bool) const] listPath/reply from server: 'OK /mnt/vbsmnt/ev0244_ow_244-1000_7'
2021-05-20 09:26:30.73: [virtual etdc::filelist_type etdc::ETDProxy::listPath(const string&, bool) const] listPath/reply from server: 'OK'
2021-05-20 09:26:30.73: [virtual etdc::dataaddrlist_type etdc::ETDProxy::dataChannelAddr() const] ETDProxy::dataChannelAddr/sending message 'data-channel-addr-ext
'
2021-05-20 09:26:30.76: [virtual etdc::dataaddrlist_type etdc::ETDProxy::dataChannelAddr() const] dataChannelAddr/reply from server: 'OK <udt/TARGET:2630/mss=1500,max-bw=-1>'
2021-05-20 09:26:30.76: [etdc::sockname_type etdc::decode_data_addr(const string&)] decode_data_addr: 1='udt' 2='TARGET' 9='2630' 11='mss=1500,max-bw=-1'
2021-05-20 09:26:30.76: [virtual etdc::dataaddrlist_type etdc::ETDProxy::dataChannelAddr() const] dataChannelAddr/reply from server: 'OK'
2021-05-20 09:26:30.76: [int main(int, const char* const*)] PUSH Resume /mnt/vbsmnt/ev0244_ow_244-1000_0 -> /gpfs/cdata/incoming/onsala-test/ev0244_ow_244-1000_0
2021-05-20 09:26:30.76: [void signal_thread(const signallist_type&, pthread_t, etdc::etd_state&, std::vector<std::shared_ptr<etdc::ETDServerInterface> >&, unique_result (&)[2]) [with int KillSignal = 10; signallist_type = std::vector<int>; pthread_t = long unsigned int; unique_result = std::unique_ptr<std::tuple<etdc::uuid_type, long int> >]] sigwaiterthread: enter wait phase
2021-05-20 09:26:30.76: [virtual etdc::result_type etdc::ETDProxy::requestFileWrite(const string&, etdc::openmode_type)] ETDProxy::requestFileWrite/sending message 'write-file-Resume /gpfs/cdata/incoming/onsala-test/ev0244_ow_244-1000_0
' sz=72
2021-05-20 09:26:30.80: [virtual etdc::result_type etdc::ETDProxy::requestFileRead(const string&, off_t)] ETDProxy::requestFileRead/sending message 'read-file 2539694560 /mnt/vbsmnt/ev0244_ow_244-1000_0
'
2021-05-20 09:26:30.80: [int main(int, const char* const*)] Destination is complete or is larger than source file
2021-05-20 09:26:30.80: [virtual bool etdc::ETDProxy::removeUUID(const etdc::uuid_type&)] ETDProxy::removeUUID/sending message 'remove-uuid f3YWyd7mJTMqcjq9
' fd=4
2021-05-20 09:26:30.83: [virtual bool etdc::ETDProxy::removeUUID(const etdc::uuid_type&)] ETDProxy::removeUUID/uuid removed succesfully
2021-05-20 09:26:30.83: [virtual bool etdc::ETDProxy::removeUUID(const etdc::uuid_type&)] ETDProxy::removeUUID/sending message 'remove-uuid ISr3sOzwb0oWUugmT
' fd=3
2021-05-20 09:26:30.83: [virtual bool etdc::ETDProxy::removeUUID(const etdc::uuid_type&)] ETDProxy::removeUUID/uuid removed succesfully
2021-05-20 09:26:30.83: [int main(int, const char* const*)] PUSH Resume /mnt/vbsmnt/ev0244_ow_244-1000_1 -> /gpfs/cdata/incoming/onsala-test/ev0244_ow_244-1000_1
2021-05-20 09:26:30.83: [virtual etdc::result_type etdc::ETDProxy::requestFileWrite(const string&, etdc::openmode_type)] ETDProxy::requestFileWrite/sending message 'write-file-Resume /gpfs/cdata/incoming/onsala-test/ev0244_ow_244-1000_1
' sz=72
2021-05-20 09:26:30.86: [virtual etdc::result_type etdc::ETDProxy::requestFileRead(const string&, off_t)] ETDProxy::requestFileRead/sending message 'read-file 2539316256 /mnt/vbsmnt/ev0244_ow_244-1000_1
'
2021-05-20 09:26:30.87: [int main(int, const char* const*)] Destination is complete or is larger than source file
2021-05-20 09:26:30.87: [virtual bool etdc::ETDProxy::removeUUID(const etdc::uuid_type&)] ETDProxy::removeUUID/sending message 'remove-uuid f3YWyd7mJTMqcjq9
' fd=4
2021-05-20 09:26:30.90: [virtual bool etdc::ETDProxy::removeUUID(const etdc::uuid_type&)] ETDProxy::removeUUID/uuid removed succesfully
2021-05-20 09:26:30.90: [virtual bool etdc::ETDProxy::removeUUID(const etdc::uuid_type&)] ETDProxy::removeUUID/sending message 'remove-uuid ISr3sOzwb0oWUugmT
' fd=3
2021-05-20 09:26:30.90: [virtual bool etdc::ETDProxy::removeUUID(const etdc::uuid_type&)] ETDProxy::removeUUID/uuid removed succesfully
2021-05-20 09:26:30.90: [int main(int, const char* const*)] PUSH Resume /mnt/vbsmnt/ev0244_ow_244-1000_2 -> /gpfs/cdata/incoming/onsala-test/ev0244_ow_244-1000_2
2021-05-20 09:26:30.90: [virtual etdc::result_type etdc::ETDProxy::requestFileWrite(const string&, etdc::openmode_type)] ETDProxy::requestFileWrite/sending message 'write-file-Resume /gpfs/cdata/incoming/onsala-test/ev0244_ow_244-1000_2
' sz=72
2021-05-20 09:26:30.93: [virtual etdc::result_type etdc::ETDProxy::requestFileRead(const string&, off_t)] ETDProxy::requestFileRead/sending message 'read-file 2539669888 /mnt/vbsmnt/ev0244_ow_244-1000_2
'
2021-05-20 09:26:30.93: [int main(int, const char* const*)] Destination is complete or is larger than source file
2021-05-20 09:26:30.93: [virtual bool etdc::ETDProxy::removeUUID(const etdc::uuid_type&)] ETDProxy::removeUUID/sending message 'remove-uuid f3YWyd7mJTMqcjq9
' fd=4
2021-05-20 09:26:30.96: [virtual bool etdc::ETDProxy::removeUUID(const etdc::uuid_type&)] ETDProxy::removeUUID/uuid removed succesfully
2021-05-20 09:26:30.96: [virtual bool etdc::ETDProxy::removeUUID(const etdc::uuid_type&)] ETDProxy::removeUUID/sending message 'remove-uuid ISr3sOzwb0oWUugmT
' fd=3
2021-05-20 09:26:30.96: [virtual bool etdc::ETDProxy::removeUUID(const etdc::uuid_type&)] ETDProxy::removeUUID/uuid removed succesfully
2021-05-20 09:26:30.96: [int main(int, const char* const*)] PUSH Resume /mnt/vbsmnt/ev0244_ow_244-1000_3 -> /gpfs/cdata/incoming/onsala-test/ev0244_ow_244-1000_3
2021-05-20 09:26:30.96: [virtual etdc::result_type etdc::ETDProxy::requestFileWrite(const string&, etdc::openmode_type)] ETDProxy::requestFileWrite/sending message 'write-file-Resume /gpfs/cdata/incoming/onsala-test/ev0244_ow_244-1000_3
' sz=72
2021-05-20 09:26:30.99: [virtual etdc::result_type etdc::ETDProxy::requestFileRead(const string&, off_t)] ETDProxy::requestFileRead/sending message 'read-file 1864780480 /mnt/vbsmnt/ev0244_ow_244-1000_3
'
2021-05-20 09:26:31.02: [virtual etdc::xfer_result etdc::ETDProxy::sendFile(const etdc::uuid_type&, const etdc::uuid_type&, off_t, const dataaddrlist_type&)] ETDProxy::sendFile/sending message 'send-file ISr3sOzwb0oWUugmT f3YWyd7mJTMqcjq9 674469984 <udt/TARGET:2630/mss=1500,max-bw=-1>
' fd=3
2021-05-20 09:26:35.76: [int main(int, const char* const*)] Got exception: assertion error: src/etdc_etdserver.cc:1107 [n>0] Failed to read data from remote end
2021-05-20 09:26:35.76: [virtual bool etdc::ETDProxy::removeUUID(const etdc::uuid_type&)] ETDProxy::removeUUID/sending message 'remove-uuid f3YWyd7mJTMqcjq9
' fd=4
2021-05-20 09:26:35.79: [virtual bool etdc::ETDProxy::removeUUID(const etdc::uuid_type&)] ETDProxy::removeUUID/uuid removed succesfully
2021-05-20 09:26:35.79: [virtual bool etdc::ETDProxy::removeUUID(const etdc::uuid_type&)] ETDProxy::removeUUID/sending message 'remove-uuid ISr3sOzwb0oWUugmT
' fd=3
2021-05-20 09:26:35.79: [int main(int, const char* const*)] Retry #2 (#2 for this file), go to sleep for 10s
2021-05-20 09:26:45.79: [int main(int, const char* const*)] PUSH Resume /mnt/vbsmnt/ev0244_ow_244-1000_3 -> /gpfs/cdata/incoming/onsala-test/ev0244_ow_244-1000_3
2021-05-20 09:26:45.79: [virtual etdc::result_type etdc::ETDProxy::requestFileWrite(const string&, etdc::openmode_type)] ETDProxy::requestFileWrite/sending message 'write-file-Resume /gpfs/cdata/incoming/onsala-test/ev0244_ow_244-1000_3
' sz=72
2021-05-20 09:26:45.83: [virtual etdc::result_type etdc::ETDProxy::requestFileRead(const string&, off_t)] ETDProxy::requestFileRead/sending message 'read-file 2133215936 /mnt/vbsmnt/ev0244_ow_244-1000_3
'
2021-05-20 09:26:45.83: [int main(int, const char* const*)] Got exception: assertion error: src/etdc_etdserver.cc:901 [__m_connection->write(__m_connection->__m_fd, msg.data(), msg.size())==(ssize_t)msg.size()] fails
2021-05-20 09:26:45.83: [virtual bool etdc::ETDProxy::removeUUID(const etdc::uuid_type&)] ETDProxy::removeUUID/sending message 'remove-uuid f3YWyd7mJTMqcjq9
' fd=4
2021-05-20 09:26:45.86: [virtual bool etdc::ETDProxy::removeUUID(const etdc::uuid_type&)] ETDProxy::removeUUID/uuid removed succesfully
2021-05-20 09:26:45.86: [virtual bool etdc::ETDProxy::removeUUID(const etdc::uuid_type&)] ETDProxy::removeUUID/sending message 'remove-uuid ISr3sOzwb0oWUugmT
' fd=3
2021-05-20 09:26:45.86: [int main(int, const char* const*)] Retry #3 (#3 for this file), go to sleep for 10s
2021-05-20 09:26:55.86: [int main(int, const char* const*)] PUSH Resume /mnt/vbsmnt/ev0244_ow_244-1000_3 -> /gpfs/cdata/incoming/onsala-test/ev0244_ow_244-1000_3
2021-05-20 09:26:55.86: [virtual etdc::result_type etdc::ETDProxy::requestFileWrite(const string&, etdc::openmode_type)] ETDProxy::requestFileWrite/sending message 'write-file-Resume /gpfs/cdata/incoming/onsala-test/ev0244_ow_244-1000_3
' sz=72
2021-05-20 09:26:55.89: [virtual etdc::result_type etdc::ETDProxy::requestFileRead(const string&, off_t)] ETDProxy::requestFileRead/sending message 'read-file 2133215936 /mnt/vbsmnt/ev0244_ow_244-1000_3
'
2021-05-20 09:26:55.89: [int main(int, const char* const*)] Got exception: assertion error: src/etdc_etdserver.cc:901 [__m_connection->write(__m_connection->__m_fd, msg.data(), msg.size())==(ssize_t)msg.size()] fails
2021-05-20 09:26:55.89: [virtual bool etdc::ETDProxy::removeUUID(const etdc::uuid_type&)] ETDProxy::removeUUID/sending message 'remove-uuid f3YWyd7mJTMqcjq9
' fd=4
2021-05-20 09:26:55.92: [virtual bool etdc::ETDProxy::removeUUID(const etdc::uuid_type&)] ETDProxy::removeUUID/uuid removed succesfully
2021-05-20 09:26:55.92: [virtual bool etdc::ETDProxy::removeUUID(const etdc::uuid_type&)] ETDProxy::removeUUID/sending message 'remove-uuid ISr3sOzwb0oWUugmT
' fd=3
2021-05-20 09:26:55.92: [int main(int, const char* const*)] Retry #4 (#4 for this file), go to sleep for 10s
^C2021-05-20 09:26:57.97: [void signal_thread(const signallist_type&, pthread_t, etdc::etd_state&, std::vector<std::shared_ptr<etdc::ETDServerInterface> >&, unique_result (&)[2]) [with int KillSignal = 10; signallist_type = std::vector<int>; pthread_t = long unsigned int; unique_result = std::unique_ptr<std::tuple<etdc::uuid_type, long int> >]] sigwaiterthread: got signal 2
2021-05-20 09:26:57.97: [void signal_thread(const signallist_type&, pthread_t, etdc::etd_state&, std::vector<std::shared_ptr<etdc::ETDServerInterface> >&, unique_result (&)[2]) [with int KillSignal = 10; signallist_type = std::vector<int>; pthread_t = long unsigned int; unique_result = std::unique_ptr<std::tuple<etdc::uuid_type, long int> >]] sigwaiterthread: removing DST uuid  f3YWyd7mJTMqcjq9
2021-05-20 09:26:57.97: [virtual void etdc::ETDProxy::cancel(const etdc::uuid_type&)] ETDProxy::cancel/sending message 'cancel f3YWyd7mJTMqcjq9
'
2021-05-20 09:26:57.97: [void signal_thread(const signallist_type&, pthread_t, etdc::etd_state&, std::vector<std::shared_ptr<etdc::ETDServerInterface> >&, unique_result (&)[2]) [with int KillSignal = 10; signallist_type = std::vector<int>; pthread_t = long unsigned int; unique_result = std::unique_ptr<std::tuple<etdc::uuid_type, long int> >]] sigwaiterthread: removing SRC uuid  ISr3sOzwb0oWUugmT
2021-05-20 09:26:57.97: [virtual void etdc::ETDProxy::cancel(const etdc::uuid_type&)] ETDProxy::cancel/sending message 'cancel ISr3sOzwb0oWUugmT
'
2021-05-20 09:26:57.97: [void signal_thread(const signallist_type&, pthread_t, etdc::etd_state&, std::vector<std::shared_ptr<etdc::ETDServerInterface> >&, unique_result (&)[2]) [with int KillSignal = 10; signallist_type = std::vector<int>; pthread_t = long unsigned int; unique_result = std::unique_ptr<std::tuple<etdc::uuid_type, long int> >]] sigwaiterthread: done.
2021-05-20 09:27:05.92: [int main(int, const char* const*)] PUSH Resume /mnt/vbsmnt/ev0244_ow_244-1000_3 -> /gpfs/cdata/incoming/onsala-test/ev0244_ow_244-1000_3
2021-05-20 09:27:05.92: [virtual etdc::result_type etdc::ETDProxy::requestFileWrite(const string&, etdc::openmode_type)] ETDProxy::requestFileWrite/sending message 'write-file-Resume /gpfs/cdata/incoming/onsala-test/ev0244_ow_244-1000_3
' sz=72
2021-05-20 09:27:05.95: [virtual etdc::result_type etdc::ETDProxy::requestFileRead(const string&, off_t)] ETDProxy::requestFileRead/sending message 'read-file 2133215936 /mnt/vbsmnt/ev0244_ow_244-1000_3
'
2021-05-20 09:27:05.95: [int main(int, const char* const*)] Got exception: assertion error: src/etdc_etdserver.cc:901 [__m_connection->write(__m_connection->__m_fd, msg.data(), msg.size())==(ssize_t)msg.size()] fails
2021-05-20 09:27:05.95: [virtual bool etdc::ETDProxy::removeUUID(const etdc::uuid_type&)] ETDProxy::removeUUID/sending message 'remove-uuid f3YWyd7mJTMqcjq9
' fd=4
2021-05-20 09:27:05.98: [virtual bool etdc::ETDProxy::removeUUID(const etdc::uuid_type&)] ETDProxy::removeUUID/uuid removed succesfully
2021-05-20 09:27:05.98: [virtual bool etdc::ETDProxy::removeUUID(const etdc::uuid_type&)] ETDProxy::removeUUID/sending message 'remove-uuid ISr3sOzwb0oWUugmT
' fd=3
2021-05-20 09:27:05.98: [etdc::etd_state::~etd_state()] ~etd_state/need to wait for 0 threads

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant