[zmq] pub/sub is not reliable at all #12754

issue bitkevin openend this issue on March 22, 2018
  1. bitkevin commented at 11:43 am on March 22, 2018: contributor

    The client only subscribe hashblock topic, if there’s no block found over than about 30 minutes on the network, the pub/sub will timeout and the client can’t detect anything unusual, the client will keep waiting forever.

    I fix this by reconnecting, example code:

     0#define BITCOIND_ZMQ_HASHBLOCK      "hashblock"
     1
     2const time_t KReconnectInterval = 600;  // seconds
     3
     4//
     5// check if need to reconnect ZMQ
     6//
     7if (subscriber == nullptr || lastRecvTime + KReconnectInterval < time(nullptr)) {
     8  // disconnect
     9  if (subscriber != nullptr) {
    10    delete subscriber;
    11    subscriber = nullptr;
    12  }
    13
    14  // connect
    15  subscriber = new zmq::socket_t(zmqContext_, ZMQ_SUB);
    16  subscriber->connect(zmqBitcoindAddr_);
    17
    18  // subscribe topic
    19  subscriber->setsockopt(ZMQ_SUBSCRIBE,
    20                         BITCOIND_ZMQ_HASHBLOCK, strlen(BITCOIND_ZMQ_HASHBLOCK));
    21}
    22
    23//
    24// recv ZMQ messages
    25//
    26
    27// ...
    28
    29// set last recv time
    30lastRecvTime = time(nullptr);
    31
    32// ...
    

    I am not sure if add these options by zmq_setsockopt() could solve this issue.

    0// ZMQ_HEARTBEAT_xxxx available since zmq-4.2.0
    1ZMQ_HEARTBEAT_TTL
    2ZMQ_HEARTBEAT_TIMEOUT
    3ZMQ_HEARTBEAT_IVL
    

    Maybe could add some heartbeat messages when there’s no traffic for a while? I prefer this way.

    Some links:

  2. fanquake added the label RPC/REST/ZMQ on Mar 22, 2018
  3. laanwj added the label Upstream on Mar 27, 2018
  4. TheBlueMatt commented at 7:32 pm on March 28, 2018: member
    ZMQ_HEARTBEAT_* look promising, can you test with those in your test setup?
  5. MarcoFalke removed the label Upstream on Mar 28, 2018
  6. luke-jr commented at 8:00 pm on March 28, 2018: member
    ZMQ is itself unreliable. I’m not sure this is a Core issue.
  7. mruddy commented at 4:26 am on July 5, 2018: contributor

    @bitkevin Do you have any more info on this problem? Is it caused by a middle box? Does it only happen when running with a particular node topology? Have you tried tuning TCP keepalive, or anything like that, on your host(s) to remedy the problem? Did you try the ZMQ_HEARTBEAT_* options https://github.com/zeromq/libzmq/blob/master/doc/zmq_setsockopt.txt ?

    I tried running a regtest node as a publisher with ./src/qt/bitcoin-qt -regtest -txindex -datadir=/tmp/node1 -zmqpubhashblock=tcp://127.0.0.1:28332 & and python3 contrib/zmq/zmq_sub.py as a subscriber so that I could easily control block generation. Then I watched my loopback interface with WireShark. I generated one block, saw the packets, and then did nothing and did not see any packets go either way around the 30 minute mark with my system config. After 45 minutes of doing nothing and seeing no packet traffic for that TCP port, I generated a block and my subscriber got the message as expected from my Bitcoin node. I am using libzmq v4.2.5, if it matters.

    Are you able to reproduce the issue using the same tools from the bitcoin repo that I used in my test?

  8. ch4ot1c commented at 8:42 pm on August 17, 2018: contributor

    Looks like they’re starting to sort it out in the ZMQ repo:

    https://github.com/zeromq/libzmq/commit/0867c380326829dc7e48e12d46b977bffad207c4 https://github.com/zeromq/libzmq/commit/79d5ac3deeb944c9fd8a7a7a20a8b973d119fa5e https://github.com/zeromq/libzmq/commit/cdf556610812c172a15c53da063ffd5684c5d995

    We should probably await a fresh release / sufficient tests, but can begin on the implementation with 4.2.3 (using zmq_setsockopt and ZMQ_HEARTBEAT_IVL/TIMEOUT/TTL. I’ll take a crack at it this week.

    http://api.zeromq.org/4-2:zmq-setsockopt#toc17

    https://github.com/zeromq/libzmq/blob/master/tests/test_heartbeats.cpp#L425

  9. mruddy commented at 11:36 am on August 18, 2018: contributor
    @ch4ot1c Are you able to reproduce the issue using the same tools from the bitcoin repo that I used in my test?
  10. ch4ot1c commented at 6:03 pm on August 20, 2018: contributor

    @mruddy Just tried it out, got the same results - everything seems to still be connected and appear immediately (block 7 here, after a generate 1 after 45 minutes, port 28332):

    Maybe this is a timeout due to a middle box topology, like you mentioned?

  11. mruddy commented at 7:45 pm on August 25, 2018: contributor

    @ch4ot1c Thanks for confirming what you see!

    I’m convinced that this behavior is caused by middle boxes. The OP could confirm with a little command line inspection using tools like netstat, ss, and tcpdump. He could verify that both sides think the connection is established and that some packets are sent, but not received, when this condition occurs.

    With the middle box idea in mind, I’ve reproduced this behavior with a topology spread across multiple hosts with multiple NATs and firewalls in between.

    The thing is that the ZMQ PUB sockets are not being created with TCP keep-alive enabled (i.e.- SO_KEEPALIVE, or more specifically to libzmq, setting ZMQ_TCP_KEEPALIVE to 1 in CZMQAbstractPublishNotifier::Initialize) This is easily verified with th the -o option to netstat or ss. I think enabling TCP keep-alive is what we want to do here because we want to keep the connection alive and we are not concerned if the other side is alive (what app level heartbeating is designed to do). Plus, using the TCP keep-alive options means we don’t have to bump the min required library version (the heartbeat options are newer, see https://github.com/zeromq/libzmq/releases/tag/v4.2.0).

    So, one way we can go is to set ZMQ_TCP_KEEPALIVE to 1 and then use the operating system provided defaults for keep alive interval and probes to fail before killing the connection. That works for me when the OS settings are set low enough to keep the middle boxes from forgetting the connection. The values must be set before starting the bitcoin process the connection is established.

    The other way we could go is to set ZMQ_TCP_KEEPALIVE to 1 and then also provide bitcoin command line args to customize the other ZMQ ZMQ_TCP_KEEPALIVE_* socket options (see http://api.zeromq.org/4-2:zmq-setsockopt).

    For more on the OS keep alive settings see /proc/sys/net/ipv4/tcp_keepalive_time, /proc/sys/net/ipv4/tcp_keepalive_intvl, and /proc/sys/net/ipv4/tcp_keepalive_probes in man 7 tcp.

    So I don’t forget, something like this works (added in src/zmq/zmqpublishnotifier.cpp before zmq_bind):

    0        const int so_keepalive_option = 1;
    1
    2        rc = zmq_setsockopt(psocket, ZMQ_TCP_KEEPALIVE, &so_keepalive_option, sizeof(so_keepalive_option));
    3        if (rc != 0)
    4        {
    5            zmqError("Failed to set SO_KEEPALIVE");
    6            zmq_close(psocket);
    7            return false;
    8        }
    
  12. bitkevin commented at 2:34 pm on September 3, 2018: contributor

    After read zmq doc, still have no idea how to use ZMQ_HEARTBEAT_*, copy some code snippets, try use zmq_socket_monitor(), heartbeat still not work, no timeout event or similar event.

    But reconnect after no message in a while is very stable.

  13. mruddy commented at 10:31 pm on September 11, 2018: contributor
    ZMQ_HEARTBEAT_* is not what you want. i’ll make some ZMQ_TCP_KEEPALIVE changes available after some of my other zmq stuff gets merged to avoid having to rebase more than necessary.
  14. jonasschnelli referenced this in commit 3a3e21dafb on Sep 2, 2020
  15. sidhujag referenced this in commit 73c1218670 on Sep 3, 2020
  16. adamjonas commented at 11:05 pm on December 16, 2020: member
    Closed by #14687.
  17. MarcoFalke commented at 7:58 am on December 17, 2020: member
    Is this still an issue with a recent version of Bitcoin Core? If yes, what are the steps to reproduce?
  18. MarcoFalke closed this on Dec 17, 2020

  19. PastaPastaPasta referenced this in commit 9d7f30252c on Sep 17, 2021
  20. PastaPastaPasta referenced this in commit 516cfb54c5 on Sep 24, 2021
  21. kittywhiskers referenced this in commit 4bb73be509 on Oct 12, 2021
  22. DrahtBot locked this on Feb 15, 2022

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-07-06 01:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me