Travis timeouts #16148

issue MarcoFalke openend this issue on June 5, 2019
  1. MarcoFalke commented at 11:15 am on June 5, 2019: member

    Travis generally times out when ./depends changes or when a commonly included header file is changed. It currently needs human attention to run the job again and continue compiling from the last time the cache was saved.

    Solutions to this could be:

    • Run a helper script to poll travis failures and rerun them if the failure was due to a timeout.
    • Rerun about-to-time-out jobs from within travis
    • Switch to another CI solution. We currently have travis, appveyor and (not enabled) Cirrus CI, so we’d have to get rid of at least one of them before adding a new one.
  2. MarcoFalke added the label Brainstorming on Jun 5, 2019
  3. MarcoFalke added the label Tests on Jun 5, 2019
  4. practicalswift commented at 11:05 pm on June 5, 2019: contributor
    Would the Travis timeouts be less frequent if we paid them more?
  5. MarcoFalke commented at 3:15 pm on June 6, 2019: member
    I want to avoid that, because that would make it harder for dev to run builds in their own repo
  6. marpme commented at 6:17 pm on June 12, 2019: none

    Well, we basically switched away from using Travis for the VERGE core at all because of this very problem.

    For us this script is very basic but it does what it should do. maybe there’s even more to be added for utilizing this at bitcoin: https://github.com/vergecurrency/VERGE/blob/master/.circleci/config.yml

    We’re now using CircleCI and it doesn’t have any certain limit on processing time as long as you have an active stdout. It runs quite smoothly with every feature that Travis supports and including pure docker containers that you can run on their VM’s :)

  7. MarcoFalke commented at 7:15 pm on June 12, 2019: member

    Our requirements are pretty much:

    • Should have free parallel builds even for forked repos, so that developers can run it in their own public repo
    • Should support permissions similar to travis, so that organization contributors can rerun the build like they do for timed out travis builds today
    • Be able to support the same matrix as our current travis matrix of cross-compiles (with caching) and running the functional tests in sanitizers
  8. marpme commented at 8:28 pm on June 12, 2019: none

    Supports all of those listed points.

    1. Free package (for open-source projects with 4x parallelization), forks are also supported by default
    2. Haven’t tested it too much but there’s definitely a permission system for contributors/members of orgs
    3. Yep should be possible and afaik there are no limitations for either caching, matrices limitations

    To put in a nutshell, I wouldn’t see any blocker that would keep you away from doing the same things that are being done on TravisCI. And there even more pretty neat feature to support easy maintenance for all those tests. But to stay realistic, the only downside I could see is that you have everything inside a yaml file which is (in my opinion) not that easily readable and become quite huge after some time : ^(

  9. MarcoFalke commented at 11:57 am on June 13, 2019: member
    It seems there are issues with travis caching feature (either on our side or elsewhere), which is why almost all builds time out :weary:
  10. marpme commented at 3:46 pm on June 13, 2019: none
    Well unless you switch away or convince Travis to upgrade their running system, this won’t happen. Just wanted to share my insights here and if you need some help just ping me :)
  11. ryanofsky commented at 5:07 pm on June 13, 2019: member

    @MarcoFalke how difficult would it be to “Run a helper script to poll travis failures and rerun them if the failure was due to a timeout” or “Rerun about-to-time-out jobs from within travis” as suggested in the issue description? I understand doing these things wouldn’t solve every problem, but it seems like they would solve most of the current problems with no downsides and not be too difficult to implement. The second option “Rerun about-to-time-out jobs from within travis” seems especially appealing and it looks like there’s an existing implementation of it with curl here: https://github.com/plume-lib/trigger-travis#use-in-travisyml

    The only potential downside I see is that logs from “Error! Initial build successful, but not enough time remains…” builds would be discarded, but since successful build logs probably aren’t very interesting, this doesn’t seem so important.

    I guess in the longer term, it does seem like we are pushing against travis limits and might need to find other solutions.

  12. MarcoFalke commented at 10:47 am on June 14, 2019: member

    @ryanofsky Thanks for the reply and links, but I think that we should first fix the underlying caching problem before starting to re-run builds (manually or automatically).

    There is a red line depends/built in the travis log: https://travis-ci.org/bitcoin/bitcoin/jobs/545187878#L392

    Does anyone know what that could mean. :thinking:

  13. promag commented at 11:49 am on June 14, 2019: member
    Does that happens because of https://travis-ci.org/bitcoin/bitcoin/jobs/545187878#L6977 (under store build cache)?
  14. MarcoFalke referenced this in commit f6f924811d on Jun 14, 2019
  15. MarcoFalke commented at 10:31 pm on June 14, 2019: member

    Travis just confirmed to me it was an issue on their side:

    Hi Marco, thank you for following up and glad to hear you were able to resolve this on your end. I apologize for the inconvenience. I can confirm that relative paths were broken but should now be fixed. Considering your builds now work with absolute paths, I just wanted to let you know that if you feel inclined to try with relative paths again, I’d be curious to see if you observe any issues. Thank you for your patience again and wishing you a great weekend.

  16. MarcoFalke commented at 5:17 pm on June 17, 2019: member

    So we haven’t changed anything and the caching issue is back. https://travis-ci.org/bitcoin/bitcoin/jobs/546813573#L150

    Did anyone make progress on switching away from travis?

  17. MarcoFalke added the label Up for grabs on Jun 17, 2019
  18. jonasschnelli commented at 3:25 pm on June 18, 2019: contributor

    A few days ago I added semaphore 2 CI to my personal fork. https://github.com/jonasschnelli/bitcoin/commit/5950a65c1d99cfe13f38c343bc96160fa7abe1d6 It works pretty well and builds fast… however, their lack of support for making build-logs public available makes it pretty unusable for our use case.

    According to one of their founders, they have planed support for that later this year.

  19. MarcoFalke commented at 8:03 pm on June 18, 2019: member

    Hi Marco, thank you for your follow up and your patience here. We pushed a fix that should hopefully resolve this. If you haven’t had a chance to restart your builds, could you try doing so? Once again thank you for your patience as well as your collaboration here. Please don’t hesitate to follow up if you’re still observing issues.

  20. MarcoFalke commented at 6:30 pm on June 19, 2019: member

    Our quote for the next year is 8k USD for 15 parallel jobs, which seems in line with CircleCI and estimated Semaphore Pricing.

    000986 pdf

  21. jonasschnelli commented at 6:57 pm on June 19, 2019: contributor
    I’m currently working on a custom self-hosted CI solution. Something that is completely open source and can be adapted and extended easily. Interacting with GitHubs Checks API seems doable. Maybe it’s not what we want,… but I’ll try nevertheless.
  22. MarcoFalke commented at 6:59 pm on June 19, 2019: member
    @jonasschnelli You have until June 20, 2020, so take your time ;)
  23. MarcoFalke commented at 4:00 pm on June 20, 2019: member
    Someone suggested to me to use azure pipelines. They offer a lot of compute for free to open source projects, but they seem to require a microsoft account, so I didn’t dig deeper.
  24. jonasschnelli commented at 6:39 pm on June 21, 2019: contributor

    Semaphore just made us a nice offer, basically 16 parallel jobs of 2 vCPU 3.4GHZ 4GB Ram machines for free.

    Since travis is back running… I don’t know what to do.

  25. MarcoFalke commented at 6:51 pm on June 21, 2019: member
    @jonasschnelli Thanks for working on this. I wouldn’t mind if someone worked on the semaphore yaml descriptor and we merged it here, (without enabling it). Just like we did for .cirrus.yml. If it turns out to work better than travis in personal forks of this repo, we can enable it for bitcoin/bitcoin as well and then get rid of travis.
  26. MarcoFalke removed the label Up for grabs on Aug 13, 2019
  27. laanwj closed this on Aug 14, 2019

  28. laanwj referenced this in commit b120645e8c on Aug 14, 2019
  29. MarcoFalke commented at 1:47 pm on August 14, 2019: member

    So we fixed one type of timeout and just in time travis figures out how to time out in a different way :weary:

    https://travis-ci.org/bitcoin/bitcoin/jobs/571796312#L209 https://travis-ci.org/bitcoin/bitcoin/jobs/571787127#L219

  30. MarcoFalke reopened this on Sep 10, 2019

  31. MarcoFalke commented at 1:49 pm on September 10, 2019: member
    The timeout failures in apt update have been reported to travis in the tickets [10319], and [8618]. Since June 21st I haven’t heard back from them a solution or even workaround.
  32. practicalswift commented at 3:08 pm on September 10, 2019: contributor
    @moneyball You don’t happen to have line of communication open with Travis like you have with GitHub? :)
  33. moneyball commented at 3:41 pm on September 10, 2019: contributor
    I don’t but if it would be helpful to the project I can try to establish a connection. If I get a few thumbs up on this comment I’ll prioritize it. I’d want to know what the project’s history is with Travis.
  34. MarcoFalke commented at 9:13 am on September 11, 2019: member
    We sent them more than 8k USD a few weeks ago and as of recently they are no longer responding on the latest ticket, so I would not put too much hope into Travis at this point… They were acquired by Idera and fired their brightest minds. I shouldn’t be surprised that their quality of service turned out insufficient for our project in recent months. We should spend our resources on moving away from Travis finally.
  35. practicalswift commented at 10:32 am on September 11, 2019: contributor

    @MarcoFalke

    I feel your frustration with Travis (which I share), but if @moneyball can help us escalate this and make our Travis experience somewhat more pleasant as long as we are stuck with them, then why not try that?

    The engineering problem of moving away from Travis is AFAICT totally independent of the project management problem of opening up a line of communication to one of our most important (and most troublesome!) suppliers.

    Making sure non-engineering roadblocks are taken care of swiftly seems like an excellent PM task to me. Especially problems that a.) frustrate developers greatly and b.) likely could be solved quickly by opening up alternative lines of communication :)

  36. practicalswift commented at 12:40 pm on September 30, 2019: contributor
    @moneyball Any success in establishing a connection to Travis? It really would be nice to have a line of communication open the next time we encounter a major Travis problems blocking technical progress :)
  37. moneyball commented at 3:24 pm on September 30, 2019: contributor
    I haven’t prioritized it due to only 2 thumbs up and Marco suggesting we spend our resources in a different way. If more contributors including at least one maintainer voice support for me engaging Travis, I will be happy to do so.
  38. meeDamian commented at 4:40 pm on September 30, 2019: contributor

    Perhaps Github Actions are worth considering too.

    While admittedly they might still lack some edge-case features, my experience with them was very positive, and I’m gradually migrating some of my projects there (from Travis).

    It’s completely free, while also quite generous with limits:

    src: https://help.github.com/en/articles/workflow-syntax-for-github-actions#usage-limits

    Comparably Travis (free) has 60 min timeout limit, extendable by support to 90 min. The 6 hours limit mentioned above is per job, and each workflow file can contain multiple parallel/serial jobs.

    Supported environments:

    src: https://help.github.com/en/articles/virtual-environments-for-github-actions#supported-virtual-environments-and-hardware-resources

  39. MarcoFalke commented at 5:14 pm on September 30, 2019: member
    GitHub Actions is lacking support for caching last time I looked, so the build take longer than on travis. I’d prefer to wait until caching is supported.
  40. MarcoFalke commented at 7:01 pm on October 2, 2019: member
    I finally got a reply from travis, saying that they cycled their NAT instances, which didn’t help, as the errors are still there.
  41. practicalswift commented at 9:01 pm on October 2, 2019: contributor

    Looks like this call for help might have helped: https://twitter.com/practicalswift/status/1178707466818863105 :)

    I still think we need PM type help here to establish a working line of communication to Travis that can be used to escalate matters like this where tickets go silent for weeks (!). That type of response time is really unacceptable given the amount of money we’re paying Travis. Since Steve is working on other stuff – do we have any other volunteer PM:s around who wants to run with this and make sure we get a proper escalation point within the Travis organisation? :)

  42. MarcoFalke commented at 10:24 pm on October 2, 2019: member
    I rather not stick with an org that requires a PM (or twitter) to escalate issues for paying clients
  43. practicalswift commented at 5:15 am on October 3, 2019: contributor

    @MarcoFalke We are all frustrated with Travis. We all would like to move away from Travis. And obviously it would be preferable if we lived in an alternative universe where didn’t have to rely on PM or Twitter to get Travis to deliver what they’ve promised :)

    But that doesn’t change the fact that we currently rely on them and that we’ve recently paid them 8 000 USD of donor money. I think it is our moral responsibility to make sure we get the maximum value out of that donation. Given that we have a PM role then it seems like having someone with that role try to sort this thing out would be the prudent way of action. I fail to see why you object to that :)

  44. michaelfolkson commented at 11:53 am on October 3, 2019: contributor

    I haven’t prioritized it due to only 2 thumbs up and Marco suggesting we spend our resources in a different way. If more contributors including at least one maintainer voice support for me engaging Travis, I will be happy to do so.

    Added a thumbs up, I would encourage others to do the same. We need to transition from Travis but during that transition period we need to get by as best we can.

  45. MarcoFalke referenced this in commit 284cd3195a on Oct 3, 2019
  46. MarcoFalke commented at 7:32 pm on October 3, 2019: member

    I did get a reply from travis with ideas to explore:


    Hey Marco,

    Thank you for your reply and for sending us these links to newer instance of the issue.

    I have some suggestions for you to try. Ideally you should try them independently.

    1. Clear the content of the daemon.json config with a snippet similar to the following:
    0before_install:
    1  - echo '{"registry-mirrors":["https://mirror.gcr.io"]}' | sudo tee /etc/docker/daemon.json
    2  - sudo service docker restart
    

    Right now this file contains the following: {“registry-mirrors”: [“https://mirror.gcr.io”], “mtu”: 1460} and I’m wondering if the mtu setting could be at play.

    1. Change the archive.ubuntu.com URL with the following command:
    0sed -i -e 's/http:\/\/archive\.ubuntu\.com/http\:\/\/us-central1\.gce\.archive\.ubuntu\.com/' /etc/apt/sources.list
    

    Please ensure to call this command before calling apt-get update.

    I hope this helps and please let us know the results you will get.


  47. MarcoFalke commented at 7:33 pm on October 3, 2019: member
    Also, I made the apt-install step more verbose, to see if that is responsible for the deadlock: #17040
  48. fanquake deleted a comment on Oct 23, 2019
  49. MarcoFalke commented at 9:14 pm on October 23, 2019: member
    Ok, haven’t seen an issue since making the output more verbose. Closing for now
  50. MarcoFalke closed this on Oct 23, 2019

  51. vijaydasmp referenced this in commit e2ccc9cd7f on Oct 2, 2021
  52. MarcoFalke locked this on Dec 16, 2021

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-12-22 03:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me