test: switch from curl to urllib for HTTP requests

iw4p commented at 12:24 pm on April 26, 2024: none

Switched from using subprocess to make HTTP requests (curl) to using the Python urllib library, improving cleanliness and maintainability.

DrahtBot commented at 12:24 pm on April 26, 2024: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	AngusP
Concept NACK	maflcko

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

DrahtBot added the label Refactoring on Apr 26, 2024

in test/get_previous_releases.py:145 in 38176a7333 outdated

151-        ret = subprocess.run(cmd).returncode
152-        if ret:
153-            return ret
154+    response = requests.get(tarballUrl)
155+    with open(tarball, 'wb') as file:
156+        file.write(response.content)

maflcko commented at 12:30 pm on April 26, 2024:

This is not a refactor. It is a behavior change when an error occurs, for example a network error, or a file error

laanwj commented at 12:40 pm on April 26, 2024:

If you’re going to read the entire data into memory, you might as well do the hashing at the same time? Currently, it reads back the same tarball it wrote and passes it to the hasher, which seems kind of a waste.

iw4p commented at 12:40 pm on April 26, 2024:

You’re right about behavior change. I am still not sure about using script or util tag.

laanwj commented at 1:38 pm on April 26, 2024:

It’s important to be careful about behavior changes but i do think this is an improvement, using requests.head is better practice than calling curl then looking for a string in the output.

laanwj removed the label Refactoring on Apr 26, 2024

laanwj added the label Tests on Apr 26, 2024

laanwj commented at 1:43 pm on April 26, 2024: member

This is a script used for the tests, so adding test tag.

iw4p renamed this:
~~refactor: switch from curl to requests for HTTP requests~~
test: switch from curl to requests for HTTP requests
on Apr 26, 2024

iw4p commented at 2:00 pm on April 26, 2024: none

The title’s tag also changed from Refactor to Test

DrahtBot added the label CI failed on Apr 26, 2024

iw4p force-pushed on Apr 26, 2024

iw4p commented at 5:46 pm on April 26, 2024: none

@laanwj I investigated and found that you modified import requests here here. Is there any shell script that I can provide pip install requests to pass the previous release, depends DEBUG?

edit: It seems bitcoin/ci/test/00_setup_env_native_previous_releases.sh is responsible to provides everything that the environment’s needed. Shall I add pip install requests here?

laanwj commented at 10:25 pm on April 26, 2024: member

Yes, that’s usually how it’d be done.

But gah, requests is an external dependency? That’s unfortunate, i don’t think we should be adding dependencies unless absolutely necessary. Probably better to do this with python’s built in urllib or http client functionality, or keep it like this.

iw4p force-pushed on Apr 27, 2024

iw4p renamed this:
~~test: switch from curl to requests for HTTP requests~~
test: switch from curl to urllib for HTTP requests
on Apr 27, 2024

iw4p force-pushed on Apr 27, 2024

iw4p commented at 5:35 am on April 27, 2024: none

I replaced the requests with urllib, tested the script for downloading .tar.gz file and works fine for me.

DrahtBot removed the label CI failed on Apr 27, 2024

in test/get_previous_releases.py:138 in 7fe94f79c9 outdated

142-
143-    curlCmds = [
144-        ['curl', '--remote-name', tarballUrl]
145-    ]
146+    try:
147+        response = urlopen(Request(tarballUrl, method='HEAD'))

maflcko commented at 7:20 am on April 27, 2024:

Is it required to check for 404 in a separate, additional request?

laanwj commented at 7:28 am on April 27, 2024:

That’s how the current code works too - it first does a HEAD request to check that the file is there, then goes to downloading. i don’t know the original reasoning behind this, though.

iw4p commented at 7:29 am on April 27, 2024:

No, I could write more pythonically and with one request, but the reason I wrote it this way was to commit to the previous code. If you agree with one request, I can change it. Should this change be in a new commit or should I force it on the previous commit?

maflcko commented at 7:30 am on May 13, 2024:

Should this change be in a new commit or should I force it on the previous commit?

Squashing seems fine?

iw4p commented at 7:37 am on May 13, 2024:

To rewrite the git history for having the last change on the last commit and ignore old ones, I forced it. I used squashing only for merging before that.

maflcko commented at 4:17 pm on May 25, 2024:

Was this addressed?

maflcko commented at 9:16 am on June 9, 2024:

I tested this myself, and the reason seems to be that absent a HEAD request, a 404.html would be downloaded as the targz, which then fails the shasum check.

AngusP commented at 9:51 pm on June 12, 2024:

With curl that makes sense, though here the urlopen will raise a HTTPError on a 404 for both a GET and HEAD request, so the earlier HEAD does seem superfluous?

in test/get_previous_releases.py:147 in 7fe94f79c9 outdated

155-    for cmd in curlCmds:
156-        ret = subprocess.run(cmd).returncode
157-        if ret:
158-            return ret
159+    try:
160+        response = urlopen(tarballUrl)

AngusP commented at 10:20 am on May 10, 2024:

AFAIK there’s a minor behaviour change here, where urlopen will follow redirects whereas curl won’t usually

0$ curl -I https://httpbin.org/absolute-redirect/3
1HTTP/2 302
2# ...

0>>> from urllib.request import urlopen
1>>> response = urlopen("https://httpbin.org/absolute-redirect/3")
2>>> response.code
3200  # Not 302 because redirects were followed

This should be fine, but worth a mention.

iw4p commented at 12:21 pm on May 12, 2024:

Great point, Thank you.

AngusP approved

AngusP commented at 10:22 am on May 10, 2024: contributor

crACK 7fe94f79c90b689dce89b287d1016a97218021f6

test: switch from curl to urllib for HTTP requests 588cad38f7

in test/get_previous_releases.py:8 in 7fe94f79c9 outdated

4@@ -5,7 +5,7 @@
5 # file COPYING or http://www.opensource.org/licenses/mit-license.php.
6 #
7 # Download or build previous releases.
8-# Needs curl and tar to download a release, or the build dependencies when
9+# Needs urllib built-in python library and tar to download a release, or the build dependencies when

fanquake commented at 7:33 am on May 13, 2024:

If this is part of the standard library, I don’t think you need to list it as a requirement.

iw4p force-pushed on May 25, 2024

iw4p requested review from laanwj on May 25, 2024

iw4p requested review from fanquake on May 25, 2024

iw4p requested review from maflcko on May 25, 2024

AngusP approved

AngusP commented at 9:50 pm on May 25, 2024: contributor

ACK 588cad38f7813d88f022d0304aa20a0b96d184ed

theStack commented at 2:47 pm on June 7, 2024: contributor

One drawback of this change is that a user doesn’t see download progress per file any more (relevant if e.g. servers are slow or the connection is bad for any other reason; for me this currently takes several minutes per file :/) and hence is left in the dark whether there is still activity, so Concept ~0.

iw4p commented at 3:53 pm on June 7, 2024: none

One drawback of this change is that a user doesn’t see download progress per file any more (relevant if e.g. servers are slow or the connection is bad for any other reason; for me this currently takes several minutes per file :/) and hence is left in the dark whether there is still activity, so Concept ~0.

I appreciate your feedback. While it’s true that switching from curl to urllib removes the per-file download progress visibility, this change offers significant improvements in code cleanliness and maintainability, which are crucial for long-term project sustainability.

Moreover, the optimization to reduce two requests into one significantly lowers the traffic load on the host, enhancing overall performance. Although users might not see detailed progress, the efficiency gains and cleaner codebase justify this trade-off. For those needing more detailed progress feedback, we can consider integrating optional logging mechanisms directly into the code, ensuring we don’t rely on third-party libraries like tqdm.

maflcko commented at 9:20 am on June 9, 2024: member

Moreover, the optimization to reduce two requests into one significantly lowers the traffic load on the host,

This isn’t true? There are still two requests before and after this change, so this isn’t changed. Also a HEAD request shouldn’t cause any meaningful traffic, compared to a full download.

in test/get_previous_releases.py:149 in 588cad38f7

157-        if ret:
158-            return ret
159+    try:
160+        response = urlopen(tarballUrl)
161+        with open(tarball, 'wb') as file:
162+            file.write(response.read())

maflcko commented at 9:23 am on June 9, 2024:

Does this load the full response into memory, which curl does not do? (Probably fine, but if the approach is extended to debug symbols, which are 0.5GB, this may become an issue)

AngusP commented at 9:49 pm on June 12, 2024:

You can response.read(amt=<number of bytes>) so it could be chunked to avoid loading the whole thing in to memory. That could also be used to add back some coarse progress indicator

maflcko commented at 7:33 am on June 13, 2024:

Right, but at some point I wonder if it is worth it to try to re-implement curl in Python. I am not sure, but is curl not bundled in git, which is already a dependency of this project? Of all open source software projects curl is currently unlikely to go unmaintained, so the motivation still seems weak.

maflcko commented at 9:26 am on June 9, 2024: member

Also tend toward NACK here, as there are several issues here from a quick glance:

This is more lines of code, so possibly more to maintain.
This worsens the UX, due to lack of progress.
Review and testing this change may not be worth it, considering the limited benefit, if any?

iw4p commented at 9:50 am on June 9, 2024: none

Moreover, the optimization to reduce two requests into one significantly lowers the traffic load on the host,

This isn’t true? There are still two requests before and after this change, so this isn’t changed. Also a HEAD request shouldn’t cause any meaningful traffic, compared to a full download.

Twice I asked if you think it’s fine, we can use one request to check the status code (if it’s 404) or download (if status code isn’t 404) and I got no answer. Right now the code send two requests and I did not change that to one request because we’re not aware of what’s the reason, but it’s clear that one request is enough. It seems I can not change the logic easily and I must continue on the old same logic, otherwise I’d like to write it in that way which is more reliable and general for handling all kind of errors, not only 404.

0try:
1    response = urlopen(tarballUrl)
2    with open(tarball, 'wb') as file:
3        file.write(response.read())
4except Exception as e:
5    print(f"Failed to download file. Error: {str(e)}")
6    return 1

maflcko commented at 10:34 am on June 9, 2024: member

Twice I asked if you think it’s fine […] and I got no answer.

As the author of a pull request, there is an expectation that you spend time to review and to test the code you have written yourself. I try to ask questions during review to motivate pull request authors to look into interesting areas and edge-cases to explore in the change. There are currently more than 300 open pull request and I currently don’t have the time to test everything I can imagine in all of them. In this case, it is not really that hard, so I did the test myself, see #29970 (review). So with the current code in master, the two requests are needed. I haven’t tested this with your code, so I can’t provide you an answer to that.

DrahtBot added the label CI failed on Jun 18, 2024

DrahtBot removed the label CI failed on Jun 20, 2024

bitcoin deleted a comment on Jun 20, 2024

maflcko commented at 7:30 am on June 21, 2024: member

I think this can be closed for now. It seems too controversial for a simple test change?

iw4p commented at 1:33 pm on June 21, 2024: none

I think this can be closed for now. It seems too controversial for a simple test change?

I think so. I’ll try to take a look at other issues, I’d be happy if I can help.

maflcko commented at 2:03 pm on June 21, 2024: member

Ok, closing for now.

You can also help with review of other open pull requests. This will teach you about the project and naturally you will find leftovers, follow-ups, or (un)related changes that you can tackle yourself.

maflcko closed this on Jun 21, 2024

bitcoin locked this on Jun 21, 2025

test: switch from curl to urllib for HTTP requests #29970

Code Coverage

Reviews