Unresponsive RPC server and RPC threads hanging #6454

issue kanzure opened this issue on July 17, 2015

kanzure commented at 7:38 PM on July 17, 2015: contributor
When using a python bitcoin RPC client, I sometimes encounter errors like:
```
>>> proxy._call("getblock", "45b0efe24da4ed0f984a8808e04b584a78cedaa8b49ac9097f724d368d0871df")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.4/dist-packages/bitcoin/rpc.py", line 142, in _call
    'Content-type': 'application/json'})
  File "/usr/lib/python3.4/http/client.py", line 1065, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib/python3.4/http/client.py", line 1093, in _send_request
    self.putrequest(method, url, **skips)
  File "/usr/lib/python3.4/http/client.py", line 948, in putrequest
    raise CannotSendRequest(self.__state)
http.client.CannotSendRequest: Request-sent
```
When this error is received, I find that:
- bitcoin-cli and other RPC clients cannot get bitcoind to respond, they hang
- only way to kill bitcoind is to SIGKILL it?
- RPC threads seem locked, syscalls were all waits or recvs
- on testnet in this state, bitcoind continues normal p2p gossip
Replication steps: https://gist.github.com/kanzure/57f1b50cf7fb82cc5c1a

Replication steps summary:
- setgenerate true 1000
- when setgenerate is running, send another RPC request
- kill the client that sent the other RPC request
- when setgenerate is done, RPC will be unavailable
- use rpcthreads=1 to help see this behavor
Expected behavior is that the RPC thread should eventually be available again.

Relevant commits: https://github.com/bitcoin/bitcoin/commit/16a5c18cea7330bd68dc9d2f768eb518af88795b https://github.com/bitcoin/bitcoin/commit/7d2cb485116636595250fce4ea4eab16a877479b

Relevant issues: #5655 (comment)

Using rpckeepalive=0 solves this problem, although the above commit messages make it sound like this is only a temporary fix. So hopefully others can replicate this problem now with the steps given above. Alternatively, #5677 should fix this too.

FWIW, I was completely unable to determine that this was a keep-alive problem when I was receiving error messages. Only once I started reading commit history on the rpcserver file did I find that rpckeepalive was meant to help with this issue. So maybe this is a docs issue.....
laanwj commented at 7:58 PM on July 17, 2015: member

Yes, #5677 should fix these kind of issues, it's why I started working on it. rpckeepalive=0 is the workaround until then.
laanwj added the label RPC on Jul 17, 2015
LaurentMT commented at 12:39 AM on July 22, 2015: none
I've met the same problem with python bitcoin rpc + bitcoin 0.10.2. Basically my client does this:
- 1 process calls getblock() for a given block and distributes the processing of txs to n subprocesses
- n subprocesses call getrawtransaction() in parallel.
rpckeepalive=0 fixes the problem with hanging connections but causes another issue, crashing bitcoin after a while.

Trace found in debug.log

2015-07-21 23:30:00 LevelDB read failure: IO error: [...]/chainstate/5386207.ldb: The process cannot access the file because it is being used by another process.

2015-07-21 23:30:00 IO error: [...]/chainstate/5386207.ldb: The process cannot access the file because it is being used by another process.

I'm going to decrease the number of subprocesses to check if it's more stable. The weird thing is this script has run for a while without any problem. Issues have started a few days ago and are persistent since this date.
laanwj commented at 5:23 AM on July 22, 2015: member

The leveldb issue has nothing at all do with this. That's a separate issue, from the error message caused by another process trying to access the files while bitcoind is running.
LaurentMT commented at 10:35 AM on July 22, 2015: none

@laanwj: thanks for the feedback. Unfortunately, this is the unique trace I've found in the log for the 2 crashes of my bitcoind (Edit: I mean the 2 times).

FWIW, since my previous comment I've relaunched the task after having decreased the number of subprocesses and moved the bitcoin data directory to a SSD. Everything was ok.

The 2 modifications don't help for a diagnostic but I was eager to complete the task which was a bottleneck for my project. I'll try to run the task on another machine in order to reproduce the problem.
sipa commented at 10:38 AM on July 22, 2015: member

But do you have another process accessing the database? LevelDB is not designed for multiprocess operation...
LaurentMT commented at 11:05 AM on July 22, 2015: none

@sipa: No other process on my side except the ones calling the rpc api. My best guess for now is that it's related to something run by the OS (AV, ...). I'll have to investigate further others logs of the server.
dthorpe commented at 7:01 PM on August 4, 2015: none

I'm seeing similar RPC sluggishness running Bitcoin-qt 0.11 on 32 bit Windows 7.

However, setgenerate true is not required in my case. I have bfgminer 5.1.0 (managing USB asics) running on the same machine talking to bitcoin-qt node via localhost RPC. After running for some time (hours?), making any RPC call (such as "getinfo") from a remote machine takes an extraordinary length of time - 2 minutes or more. Requests are usually answered eventually, if the client doesn't time out first.

When the node is in this RPC degraded state, other operations in the Bitcoin-qt client appear unaffected. Network throughput is fine, the UI is responsive, the debug window's network traffic panel shows light network i/o, and only a handful of peers (7 incoming, 8 outgoing). CPU load is around 2%.

Stopping the bfgminer appears to return the bitcoin-qt RPC responsiveness to normal. Bitcoin-qt RPC responsiveness is fine after restarting bfgminer... for "awhile".

This RPC behavior appears to be new with the v0.11 release.
kanzure commented at 7:04 PM on August 4, 2015: contributor

@dthorpe, that sounds like a very different bug to me. Your RPC server seems to recover. Also, I should add that the system I was working on was linux at the time, thanks for the reminder by mentioning your system...
dthorpe commented at 11:42 PM on August 4, 2015: none

@kanzure Ok, I'll open a new issue then.
kanzure commented at 3:26 PM on September 4, 2015: contributor

Closing now that #5677 (libevent http server) is merged.
kanzure closed this on Sep 4, 2015
MarcoFalke locked this on Sep 8, 2021

Contributors

Labels

Linked (view graph)

#5655 Add a -rpckeepalive and disable RPC use of HTTP persistent connections.#5677 libevent-based http server