juliaci / julia-buildkite Goto Github PK
View Code? Open in Web Editor NEWBuildkite configuration files for Base Julia CI
Buildkite configuration files for Base Julia CI
@staticfloat If I understand correctly, the Windows autodump.jl
script currently only creates the core dump files if the test job exceeds the timeout (two hours).
Would it be possible for us to modify the autodump.jl
script so that it creates the core dump files if either of the following occur:
I am trying to help the SuiteSparse.jl team debug https://github.com/JuliaLang/SuiteSparse.jl/issues/43, in which we are seeing nondeterministic failures in the tester_win64
Buildbot job. In that issue, what we see is that the one of the Distributed worker process (the worker process is running the SuiteSparse test set) crashes. I figured that it would be helpful if I could provide the core dump files associated with the crash. However, the overall test job does not exceed the timeout, so the autodump.jl
does not currently create any core dump files.
Current status:
Windows - These required a lot of work because our windows buildbot story has always been a :dumpster-fire: . Previously, we had someone build windows VMs, used cygwin to install a bunch of compiler versions, and then whatever was installed was just what we got. Updating the system meant potentially upgrading the compiler versions and breaking everything. It was awful. Also, the windows VMs often get wedged and must be restarted. I have recently managed to streamline Windows VM creation via Packer, and figured out the proper KVM recipes to be able to run docker-on-windows-in-KVM on our amdci machines. This means that we can build the system toolchains as docker images that get deployed onto the machines, providing reproducibility, and we can destroy the docker container after each build, which should provide a lot more reliability. We will definitely need help creating the windows docker images and spearheading buildkite windows development, as I believe it’s mostly in powershell.
Next steps:
We need to build windows docker images that contain the mingw32 build tools and windows platform SDK headers and whatnot necessary to build Julia. We need to also build a set of powershell instructions to build and package Julia inside of that docker container, and then integrate the whole thing into buildkite. This will involve some powershell scripting and some buildkite plugin development, I bet.
The big picture is that we’re in the middle of redesigning the CI tools that we use for Base Julia (and some associated Julia projects such as the JuliaGPU and SciML orgs, and even some non-Julia associated projects like rr!) to significantly improve reliability, reproducibility, and ease of use.
As part of this, we have a couple different layers upon which we’re doing work:
- At the lowest level, we need to set up secure ways for the buildkite runners to actually run. This means building technologies like Sandbox.jl to create fast and lightweight methods for isolating workloads from the running system. This is very important for us since we’re an opensource project, and we don’t want a malicious user to be able to open a PR that installs a nasty process on our buildbots that backdoor all Julia builds from then until it gets noticed.
- This layer involves things like writing buildkite plugins (such as the sandbox plugin for deploying tools to the agent, or the cryptic plugin for secrets management) which is basically a lot of bash scripting, or even some C Linux programming to extend the capabilities of the sandboxing executable itself.
- At the next layer, we need to adapt our Julia build recipes to run on the new system. We basically want to come to feature parity with the old buildbot system, while making the system more reproducible and reliable. Issues with the old system were things like relying on the system environment to provide compiler versions, inflexible configuration, unreproducible builds, etc….
- This layer involves writing buildkite configuration scripts, and some associated scripting to make things automatic and easy. It may also involve some buildkite plugin development, as we have a few plugins under development that would improve the workflow for Base Julia.
- At the next layer, we have hot new features that we want to build on top of the whole system that would increase productivity for all. One such feature that we’ve actually already built is rr trace upload support for failing tests; when configured correctly, Base Julia (or anyone who uses the julia-test buildkite plugin) will run its tests inside of rr, and if the test fails, it will automatically upload the rr trace so that someone can download it and reproduce the error. We want to improve on things like this (even for systems that don’t support rr, which is everything other than Linux) by building even better tools. Imagine being able to open your CI run, see that it’s failing, then click on a link to bring up a VSCode instance that is logged into a build machine with an identical environment setup and that runs the same steps, reproducing the same error.
- This layer would entail a little bit of everything; it’s more speculative than the other layers, but it’s where we’re trying to head. It also depends on the other layers being more or less finished before we can start building cool new toys.
The macOS nightlies as linked on the home page and used by GitHub actions like setup-julia, are quite a bit out of date:
$ wget https://julialangnightlies-s3.julialang.org/bin/mac/x64/julia-latest-mac64.dmg
# mount dmg
$ ./julia _
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.9.0-DEV.439 (2022-04-26)
_/ |\__'_|_|_|\__'_| | Commit 7ad0e3deae (29 days old master)
|__/ |
This is causing some issues with packages that assume latest master, such as GPUCompiler.jl.
Looking at the logs of the upload x86_64-apple-darwin
job on Buildkite, https://buildkite.com/julialang/julia-master/builds/12357#0180fb98-58d5-4c98-bc3b-9dc178cf97c9, it seems like the binaries are now being uploaded to a different endpoint: s3://julialangnightlies/bin/macos/x64/julia-latest-macos64.dmg. However, the HTTP version of that, http://julialangnightlies-s3.julialang.org/bin/macos/x64/julia-latest-macos64.dmg, is not usable. The reason behind that is that the bucked is empty, because the upload failed:
$ s3cmd ls s3://julialangnightlies/bin/macos/x64/
DIR s3://julialangnightlies/bin/macos/x64/1.9/
2022-05-25 15:32 833 s3://julialangnightlies/bin/macos/x64/julia-latest-macos64.tar.gz.asc
> Upload primary products to S3
.buildkite/utilities/upload_julia.sh: /usr/local/bin/aws: /usr/local/Cellar/awscli/2.5.2/libexec/bin/python3.9: bad interpreter:
No such file or directory
I'd argue that the exit code of the aws
utility needs to be checked. That currently isn't the case, despite the set -e
, because the aws
invocations are backgrounded: https://github.com/JuliaCI/julia-buildkite/blob/0387123a0e8ff6d6db402c1bb96165be2c6d4e74/utilities/upload_julia.sh#L66=
In addition, I think it would be better if the endpoint wouldn't have changed. Unless there's a good reason to, this seems like a needlessly-breaking change that requires updating uses of the old URL (the homepage, the setup-julia GitHub action, etc).
This consists of two steps:
JuliaLang/docs.julialang.org
repository. This will be a signed pipeline.We currently load the libstdc++ we ship from rpath, which is usually fine. It isn't fine when the system's libstdc++ is newer than ours, and a system lib needs the newer symbols, which leads to errors.
To fix that there needs to be some way that checks the systems libstdc++ and ,if it's newer than ours, load it instead.
Some ideas were put forward in the ci-dev call.
dlopen
it. Check for some GLIBCXX
version symbol and if it's there then load it. Otherwise dlclose
it and load ours as usual (Probably the best)dlclose
causes issues.Building Julia with ASAN is non-trivial and takes a significant amount of time. Since we build it for CI already, it would be nice if we go one step further and package the build + upload as an artifact. This can be helpful when debugging failures, e.g., looking at JuliaLang/julia#45608 the PkgEval logs showed some possibly memory corruption-related segfaults, for which it would have been helpful to download the ASAN build and try a Pkg.test
locally.
We need to construct a windows packaging template. The first and largest challenge is finding a good source of gcc and friends for windows.
This is a great step! But there seem to be a few issues still, looking at https://app.codecov.io/gh/JuliaLang/julia
base/
prefix, so the coverage tools can't find them in the repo. This means that you can't view the source code files on the Codecov website.stdlib/
prefixed https://codecov.io/gh/JuliaLang/julia/tree/master/cache/build/default-amdci5-7/julialang/julia-master-scheduled/usr/share/julia/stdlib/v1.9JuliaLang/julia
repo) are getting included in the statsamend_coverage_from_src!
step is being skipped, though it is hard to compare what changed currently do to previous issuesOriginally posted by @vtjnash in #136 (comment)
It would be nice if the nightlies page on julialang.org could indicate the commit hash and perhaps datetime that each nightly binary build is at.
I don't believe there's a way to do that currently without a client-side download, so perhaps a metadata text file could be updated when the nightlies update, or if there was an API endpoint that julialang.org could query.
Currently when waiting for a nightly to land for stdlib CI I don't know a smarter way than re-running the CI repeatedly.
cc @staticfloat
With code coverage enabled and native code from the sysimage disabled, the Pkg tests take more than an hour to run. I think we can exclude the Pkg tests from the coverage
job without losing too much Base
code coverage.
Hey @staticfloat,
Weirdly, your personal build path has replaced user paths in debugging information of macOS binaries since at least 2019: https://discourse.julialang.org/t/developers-path-and-username-is-persisted-in-the-error-stack/30746. Any idea why this happens? @jpsamaroo mentioned this was a known issue back then but I am still seeing it today.
I sent a bug report to a different library today and weirdly saw /Users/sabae/
in the error traceback I sent them, instead of my own pathname.
Thanks!
Miles
There's no reason for us to build Julia from source yet again to run coverage; we should just download a prebuilt Julia, just like we do in the doctest runner.
https://buildkite.com/julialang/julia-master-scheduled
It has two jobs:
USE_BINARYBUILDER=0
To allow better introspection and more isolation of the failing Sockets test. Concretely, we need to:
Note that it is possible that moving to an isolated netns will actually fix the Sockets test failure (if the issue is that another test job running on the same machine accidentally picks up the packet intended for the first test job), but it's a good thing to try regardless.
We should write a plugin that parses the output of Test
and renders it nicely as an annotation.
Some of our .arches
tables have a USE_RR
column, so that we can distinguish between jobs that run under rr
and jobs that don't. Other .arches
tables (e.g. the .arches
table for the i686-linux-gnu
tests) have a GROUP
column - for i686-linux-gnu
, we separate the network tests from the non-network tests, because doing so greatly reduced the failure rate (compared to having them all in a single group).
We should just combine the USE_RR
and GROUP
columns into a single column.
Next steps:
We need a sandbox. Alex Arslan has supposedly been researching whether FreeBSD jails can be used to give us something like what we have on macOS. I’m not worried about this until we get the story on the three previous points solid.
Every time we have a new feature freeze, we need to remember to create the new https://buildkite.com/julialang/julia-release-1-dot-NN/
.
E.g. we recently branched 1.8, so I had to create the https://buildkite.com/julialang/julia-release-1-dot-8/
pipeline.
We need to document the steps for creating this new pipeline.
The old, 1-month nightlies at //julialangnightlies-s3.julialang.org/bin/mac/x64/julia-latest-mac64.dmg can be successfully opened on my M1 mac with SIP enabled after mounting the dmg. The "new" ones at http://julialangnightlies-s3.julialang.org/bin/macos/x64/julia-latest-macos64.dmg, which are available since #122, do not:
Control-click+open doesn't work because then the initial 'script' gets to run, but subsequent binaries or libraries are still prevented from executing.
We should make this column more generic.
Build jobs:
Tier | Platform triplet | Job | Current Group | Desired Group | Concordance |
---|---|---|---|---|---|
1 | i686-linux-gnu | Build | Build |
Build |
✅ |
1 | x86_64-apple-darwin | Build | Build |
Build |
✅ |
1 | x86_64-linux-gnu | Build | Build |
Build |
✅ |
1 | x86_64-w64-mingw32 | Build | [not yet migrated] | Build |
➖ |
2 | aarch64-linux-gnu | Build | Build |
Build |
✅ |
2 | i686-w64-mingw32 | Build | [not yet migrated] | Build |
➖ |
2 | x86_64-unknown-freebsd | Build | [not yet migrated] | Build |
➖ |
3 | aarch64-apple-darwin | Build | Build |
Build or Allow Fail |
✅ |
3 | armv7l-linux-gnueabihf | Build | [not yet migrated] | Build or Allow Fail |
➖ |
3 | powerpc64le-linux-gnu | Build | Build |
Build or Allow Fail |
✅ |
3 | x86_64-linux-musl | Build | Build |
Build or Allow Fail |
✅ |
Test jobs:
Tier | Platform triplet | Job | Current Group | Desired Group | Concordance |
---|---|---|---|---|---|
1 | i686-linux-gnu | Test | Allow Fail |
Test |
❌ |
1 | x86_64-apple-darwin | Test | Test |
Test |
✅ |
1 | x86_64-linux-gnu | Test | Test |
Test |
✅ |
1 | x86_64-w64-mingw32 | Test | [not yet migrated] | Test |
➖ |
2 | aarch64-linux-gnu | Test | Allow Fail |
Test or Allow Fail |
✅ |
2 | i686-w64-mingw32 | Test | [not yet migrated] | Test or Allow Fail |
➖ |
2 | x86_64-unknown-freebsd | Test | [not yet migrated] | Test or Allow Fail |
➖ |
3 | aarch64-apple-darwin | Test | Allow Fail |
Test or Allow Fail |
✅ |
3 | armv7l-linux-gnueabihf | Test | [not yet migrated] | Test or Allow Fail |
➖ |
3 | powerpc64le-linux-gnu | Test | Allow Fail |
Test or Allow Fail |
✅ |
3 | x86_64-linux-musl | Test | Allow Fail |
Test or Allow Fail |
✅ |
This was fixed for x64 in 2b1525e by @staticfloat, and the same should probably be done for aarch64 so that we can use:
https://julialangnightlies-s3.julialang.org/bin/mac/aarch64/julia-latest-macaarch64.tar.gz
Or at least that's the path that our Buildkite plugin expects: https://github.com/JuliaCI/julia-buildkite-plugin/blob/main/hooks/pre-command#L147-L150=. We could update the plugin to match the current path instead, but seeing how that's apparently not final yet maybe it's easier to duplicate the x64 hack?
We could do something like https://perf.rust-lang.org for our regression testings. Basically it runs tests under perf, cachegrind, dhat etc, and stores the results in a database. This could be like PkgEval but clearer for over time comparisons.
Since we already have a doctest
build, I think it would save us a nice chunk of time if we could skip the docs build that occurs during make install
(which happens when we do make binary-dist
). Since I think we want the docs in the uploaded tarballs, we'll want to build these when running on master
(or release-*
) but on PR builds, we can shave off something like 7% (~1 minute) of our build time by skipping this, and this happens for every build configuration.
One of the most frequent mistakes that I make is forgetting to update the signatures. Obviously we'll catch that when we run the test job on our test repo. But I would much rather catch that mistake sooner, before I merge the PR here.
What I'm envisioning is a very quick GitHub Action on this repo that simply takes in the repo public key and verifies all of the signatures against the public key.
I propose changing the buildbot username (or just the homedir, in particular) on Windows to "정의영", to make things harder
In #103, we added basic documentation on how a maintainer can re-generate all of the signatures in this repository if they have the agent private key file.
But we should also have the documentation for the (more common) use case where you only have the repository private key file, and you want to re-generate all signatures.
Originally, the once-daily scheduled jobs went into their own pipeline. In other words, we had:
julia-master
: the main pipeline. Contained only the non-scheduled jobs.julia-master->scheduled
: the scheduled pipeline. Contained only the scheduled jobs.Recently (in #102), I moved all the scheduled jobs into the main pipeline and deleted the scheduled pipeline. So, currently we have:
julia-master
: the main pipeline. Contain all jobs (both non-scheduled and scheduled).At the time, I thought this was a good idea. It turns out that it is a bad idea. The non-scheduled builds and scheduled builds are trampling all over each other's GitHub commit statuses. If at 1AM a regular (non-scheduled build) is triggered on a push to master, and all three of our groups/GitHub commit statuses (Build
, Check
, Test
) are green on that commit, and then at 5AM a scheduled build is triggered on that exact same commit, now the scheduled build will erase those three green commit statuses.
Also, I find it quite confusing that when I am scrolling through https://buildkite.com/julialang/julia-master/builds?branch=master I see a mix of non-scheduled and scheduled builds, which makes it difficult when I am e.g. trying to figure out when a breakage first started occuring on master.
So it turns out that it will be better to have two separate pipelines, the main julia-master
pipeline and the scheduled julia-master->scheduled
pipeline. As I wrote in #102, yes it is annoying to maintain two separate pipelines, but as I have now learned, it is even more annoying (and confusing, inconvenient, etc.) to have them combined into a single pipeline. So we will go with the lesser of two evils.
Right now, we expect all of our build jobs to pass (even if we allow some of our test jobs to fail), and thus we expect all of our upload jobs to fail.
This is fine for Tier 1 and Tier 2, but eventually, for Tier 3, we actually want to allow a select few of our build jobs to fail. So in those cases, if the build job passes, obviously we want to upload the binaries, but if the build job fails, we want the corresponding upload job to be allowed to fail.
This is what our Buildkite groups currently look like:
I'm not thrilled about the fact that the Linux
group is red. It's red (with a warning sign) because one of the jobs in the Linux
group failed but has soft-fail: true
. Specifically, the asan
job failed but has soft-fail: true
.
Now, this particular example is not a good example, because on Julia master we can actually take off the soft-fail for asan
, since asan
is passing on Julia master. But, in general, we will always have jobs that we want to run but allow to fail. And I don't think those should "pollute" our green groups.
I would like us to put "allowed to fail" jobs into their own groups, and explicitly put the text allowed to fail
in the name of the group so that user can easily tell that it's okay for those groups to be red. We have two choices:
Here is what option 1 would look like:
Here is what option 2 would look like:
Our CI set-up here has its own utilities for working with rr traces: https://github.com/JuliaCI/julia-buildkite/tree/main/utilities/rr
I wonder if it wouldn't be better to re-use BugReporting.jl, and just launch Julia under --bug-report=rr
. Much of the needed functionality (packing and uploading traces, handling time-outs, doing replay...) is already part of BugReporting.jl, and there's some useful features that aren't available here (packaging additional metadata, downloading and importing Julia sources on replay, ...).
Current status:
Linux - Sandboxing works well, and we have build agents running and building Julia. To get reproducibility, we build rootfs images that get mounted via the sandbox buildkite plugin which is great, but has one big downside: because we want relatively recent Linux distros for tools like cmake, python, etc… we end up getting a very recent version of glibc as well. This means that the binaries we build here are not as portable as we’d like. Ideally, we’d use the same kind of compiler toolchain that we use in BinaryBuilder, so that we can build against an ancient version of glibc (to be maximally compatible), but building that toolchain is non-trivial.
Next steps:
We need to build GCC, glibc, binutils, linux kernel headers, etc… as relocatable tarballs that can be unpacked somewhere and used to build things. Yes, this basically means creating a GCC_jll. I’ve got a good head start on this, and I wouldn’t wish this task on my worst enemy, so if I get no volunteers for this, I won’t be surprised. This will involve a lot of build system debugging and patience. This will unblock using buildkite to build the actual julia binaries.
From the CI call, we could try to catch things like the ProcessExitedException(10)
and rerun that specific test on a new worker.
Another idea would be to reduce the max RSS to something smaller so the worker restarts more frequently.
Whenever we make a release, we attach the tarballs created via make full-source-dist light-source-dist USE_BINARYBUILDER=0
as artifacts for the GitHub release corresponding to the tag. These used to be built as part of the doctest job on buildbot but that builder no longer exists. It'd be nice to have this on buildkite for releases so it doesn't have to be done manually.
We currently have some very long conditionals, and we should break them up into multiple lines. See the section on "multi-line conditionals" here: https://buildkite.com/docs/pipelines/conditionals#conditionals-in-steps
In the short-term, we will have the same Buildkite configuration files for master
, release-1.6
, release-1.7
, etc.
However, eventually, the Buildkite configurations will diverge for release-1.6
versus the latest stable release-1.*
versus master
. At that point, the buildkite-*
branches will diverge. However, it would still be nice to have a quick and easy way of backporting PRs to older buildkite-*
branches in this repository.
One option would be to use Kristoffer's backport script. However, if I understand correctly, that script has to be run locally. I would instead prefer to have some kind of GitHub Action that lives in this repository and automatically performs backports when you add a label to a PR.
Does anyone know of a good "backporter" GitHub Action that already exists that we could use for this purpose? I would probably prefer to use a pre-existing GitHub Action, instead of having to write our own.
Suggested by @timholy on Slack:
Should we run tests on a debug build once per day? Or once per week? I'm looking at JuliaLang/julia#46064.
Current status and next steps:
Sandboxing works (barely tested) and we have build agents running, but not building Julia. The sandboxing works via “seatbelt”, the built-in app sandboxing that macOS provides, and it basically marks certain parts of the system as read-only. There was a PR to build Julia using these workers, but it hasn’t been merged yet. For reproducibility, we’d like to develop a way to stuff a macOS toolchain into a .tar.gz file and distribute it as an artifact, then deploy that onto the macOS workers on-demand, JLL style. That will require some playing around.
Rather than building on Windows with windows-hosted toolchains, let's see how difficult it is for us to build on Linux with cross-toolchains and then bootstrap with WINE.
We need to set up buildkite CI on the Julia-buildkite external repository so that we can start experimenting with different CI setups, then once we’re happy, we’ll point base Julia to that repository. This will involve some system administration (setting the agents up on the build bots themselves) and some debugging of the buildkite webUI and whatnot.
sign_treehashes
should only replace signatures that are invalid.
Originally posted by @staticfloat in #212 (comment)
Logs for the non-assert build: https://buildkite.com/julialang/julia-master/builds/6069
Logs for the assert build: https://buildkite.com/julialang/julia-master/builds/6071
The non-assert build was built with: make -j 8
The assert build was built with: make -j 8 FORCE_ASSERTIONS=1 LLVM_ASSERTS=1
Job | Non-assert | Assert | Ratio |
---|---|---|---|
tester_linux32_g1 | 41m 40s | 23m 12s | 0.56 |
tester_linux32_g2 | 6m 43s | 7m 4s | 1.05 |
tester_linux64_g1_mt | 19m 26s | 29m 1s | 1.49 |
tester_linux64_g1_rrst | 31m 57s | 35m 37s | 1.11 |
tester_linux64_g1_st | 18m 49s | 23m 38s | 1.26 |
tester_linux64_g2_mt | 6m 29s | 6m 3s | 0.93 |
tester_linux64_g2_rrst | 10m 59s | 11m 8s | 1.01 |
tester_linux64_g2_st | 6m 4s | 19m 33s | 3.22 |
tester_linux64_g3_st | 6m 56s | 7m 17s | 1.05 |
tester_musl64_g2 | 6m 7s | 6m 26s | 1.05 |
Obviously, the sample size here is really small, so I'm not sure how useful these data are.
Right now, we run the Pkg tests as part of the 64-bit testers. But we skip the Pkg tests when testing 32-but platforms, because we think the Pkg tests are causing us to go out-of-memory.
Eventually, we should start running the Pkg tests on 32-bit platforms, in addition to 64-bit platforms.
Before we do so, we should figure out how to divide the Pkg tests into multiple smaller test groups.
We need to skip the uploaddocs job if all of the following conditions are false:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.