oxidecomputer / cio Goto Github PK
View Code? Open in Web Editor NEWRust libraries for APIs needed by our automated CIO.
License: Apache License 2.0
Rust libraries for APIs needed by our automated CIO.
License: Apache License 2.0
Add documentation diagrams detailing what happens when the RFD commit and RFD pull request web hook responders run.
If a user in configs
has been invite to the GitHub org in the past, but did not accept their invite and instead has let the invite expire, subsequent attempts to add the user to the organization will fail with 422 errors. This is seen in src/providers.rs:242
. Currently this causes the sync
function to return early with a failure.
This should likely be changed, as we are left currently with a partially provisioned user. Either the remaining provisioning steps could be run and GitHub access can be left in its current state or be manually triaged, or we can cancel the invite and re-issue upon checking that it is expired.
A PR for RFD 273 was never created. Check logs around 6:20-6:30pm CDT. (Maybe related to #130)
When cio
is shutdown any long running sagas are marked as cancelled. This means that any deployment will:
This issue may be helped by work on breaking down some of the long running sagas, but ideally there is a way to version these jobs such that post-deployment they can either be resumed (if they are still valid), or cancelled (if they are no longer compatible).
This worked properly until maybe 3 weeks ago. Since then, the host boot huddle never gets canceled even though the meetings are in the calendar and properly picked up by airtable. I haven't had a chance to debug this, and probably won't for a long time, but it might be a good idea to look through changes from a few weeks ago.
Visible on https://github.com/oxidecomputer/rfd/commit/ba972be9a440d2af1329aa61e010bc48a6ef4275
PDF generation (convert_and_upload_pdf
) failed twice when trying to find the mmdc
binary. This was running during a webhook handler, specifically for RFD 0276.
Resources synced from configs
currently are assumed to be CONFERENCE_ROOM
resources. This should be generalized to support OTHER
and CATEGORY_UNKNOWN
so that we can better represent https://github.com/oxidecomputer/configs/issues/90
ResourceConfig
to support a category field. Default this field to CONFERENCE_ROOM
.ResourceConfig
to use a generic Resource
name instead of ConferenceRoom
. This should additionally create migrations for a resources
table.conference_rooms
table to resources
table.AIRTABLE_CONFERENCE_ROOMS_TABLE
constant.AIRTABLE_CONFERENCE_ROOMS_TABLE
value.This is fundamentally an issue with our Google Drive client (and other clients), but tracking it here as it has direct impact.
When calling download_by_id
, the Drive client will call to request_raw
. request_raw
will return the response body directly to the caller without can interference, and leaves it to the caller to perform any necessary error handling. download_by_id
does not perform any checking of the response status or headers and instead immediately translates the request body into a bytes::Bytes
. At this point, any data about the status of the request has been lost and a caller only has the raw bytes of the response to make determinations on.
In the case of cio
, the bytes returned are being treated as a String, assuming that if the client had failed to download the file an error would have been returned. The visible results of this is the server error message being written to the database as if it was a successful download. Specifically we have seen this when failing to download a chat log file. This resulted in an error body for a 401
error overwriting the stored chat log data.
To resolve this we will need to address a couple sub-issues:
unwrap_or_default
is used which will overwrite with an empty string on failure)Each time the GoogleDrive client is constructed it reads the GOOGLE_KEY_ENCODED
env variable, creates a /tmp/google_key.json
file (or truncates if it exists), and then writes the decoded key value to the file. There is a race condition where:
This can be seen in the applicants refresh job where multiple applicants are processed at the same time and the job fails during GoogleDrive authentication.
Currently third party api clients only authenticate upon construction, generating new access tokens if the current tokens are expired. To be able use a client in a long running task or to share a long lived instance of a client each task needs to check every call it makes for authentication errors and potentially re-authenticate and re-issue the request.
Instead the clients should capture the necessary data for re-authenticating upon construction and internally handle re-authentication (likely behind an option) when an authentication error occurs.
RFD discussion links are generated and stored during a PR "open" action webhook. Once generated any commits that remove the discussion link from the main .adoc
file will cause the discussion url to be interpreted as a blank value and erase the old link. Due to generation only happen during the "open" action, discussion links can never self correct.
This was seen in https://github.com/oxidecomputer/rfd/pull/434 where the discussion link was written during https://github.com/oxidecomputer/rfd/runs/6712689245, but subsequently deleted during https://github.com/oxidecomputer/rfd/compare/2d4470fa6c57502e71537007503058e364d03a7d..c8d5d8b3559f2eace92c918074db5afc8ed53438
Huddle reminders were sent with encoded apostrophes and quotes in the notes section.
Notes: I'd like to ... highlight "what ... making", ... week'
See: Control Plane topics 5/24
I've tried to figure out why this is happening and cannot. My GSuite calendar has recurring host boot huddle entries far into the future; they seem like they should match the fuzzy string from configs. However, the airtable calendar (https://airtable.com/tblb9ACUfEopO5Rx2/viwbDtmYCWzUF8EEO?blocks=bipLfQLiV0dJ5dXRC) has past meetings, then a meeting that I canceled via GSuite, and then the next (and only) meeting listed is July 20. From looking at the code, I would expect to see 13 weeks worth of future meetings automatically populated. Not surprisingly, this also makes it impossible to submit an agenda item for any other meeting date. This has been the same for over a week, so the regular jobs to update huddles aren't fixing it. As none of the diagnostic printfs from the crate end up in the job output and the cargo test used to do this population requires things I don't have to run (right?), I'm not sure how to debug this.
Currently a build and deploy takes roughly one hour. Determine a best path forward to reduce this. This may be in the shape of reducing complexity or in breaking apart the deployable pieces, but it needs investigation.
Focusing on likely rfd and hr functionality initially.
see #74
sync-shorturls
(and by extension other functions like sync-other
) has started running into CloudFlare rate limiting:
More than 1200 requests per 300 seconds reached. Please wait and consider throttling your request speed
Currently every record performs a list request, and as such can easily exhaust the limit. A few options:
During refresh_new_applicants_and_reviews
, applicants are processed 3 at a time. If any applicant fails (i.e. arbitrary external service failures) then the error is early returned and short circuits processing. This leaves the remaining applicants unprocessed.
Ideally each applicant can be processed and succeed / fail independently.
This occurs a few times per day in the webhooky. Looks to be correlated to M-F work (as expected). Needs investigation.
2022-05-05 13:50:35.960 CDT thread 'tokio-runtime-worker' has overflowed its stack
2022-05-05 13:50:35.960 CDT fatal runtime error: stack overflow
2022-05-05 13:50:35.961 CDT Uncaught signal: 6, pid=1, tid=3, fault_addr=0.
Quickbooks sync is failing on let bill_payments = qb.list_bill_payments().await?;
with a 400 Bad Request
Existing and new rack line signups should be pushed to Zoho as Leads. They should be written once, and then marked as complete internally. They should not be synchronized.
Fetching meeting chat logs are failing with permission errors. This results in cio writing the error message in place of the chat log.
Hi! Thanks for the great crates pack :)
I'm looking for the library to abstract away interaction with Zoom API, and your crate looks like what I need!
There is one detail though: my Zoom App is OAuth 2.0 App, and it looks like currently your lib supports only JWT-apps.
Is it possible to consider adding some option to allow OAuth 2.0 as well?
It looks like to do so we just need a possibility to provide OAuth access token during the creation of Zoom
struct, instead of jwt key and secret.
I'd be glad to be of any help if you consider this feature worthy!
This issue is a tracking issue for a rework of the job runner model that cio-bot uses.
CIO does a good job of providing a readable (in code) encapsulation of the business rules and processes that run many day to day operations that are otherwise often spread across multiple departments and people. Currently it struggles though in reporting back what work it has done and why. Parts of the execution (specifically cron jobs) collect logs, but they are unstructured and lose their level designation when globally integrated. Webhook handlers track separate data to global logs storage, GitHub commit comments, and GitHub Check Runs. Additionally all handlers send some portion of logs (warnings and errors) to Sentry.
As such in the current state, asking "What did handler X do?" or "Why did handler Y do Z?" requires tracing the cio codebase to understand how that handler reports its data. While there are some patterns that can be learned as a starting point, we can make this significantly more effective.
To summarize, we are specifically interested in improving "how" cio does the work it does, as opposed to "what" work cio does.
The key goals that we want to hit during this implementation are:
Research if the keys API is sufficient for automating the provisioning of new Tailscale auth keys when the old ones expire. https://github.com/tailscale/tailscale/blob/main/api.md#keys
If this is not possible, then determine an alerting mechanism for when keys are within a week of expiring.
Deserialization of AirTable users during sync-configs
is started failing on 6/2:
**********@oxidecomputer.com` failed: error decoding response body: premature end of input at line 1 column 156, pids: [308], saga_id: a3a6e586-1446-4625-8bb4-f89f1d44bf1d, cmd: sync-configs
RFD discussion links are only auto-committed during the webhook handler that responds the opening of a PR for a given RFD. If the discussion url is then removed in a future commit, cio will not restore the url to the document. Additionally the url will be removed from the cio database, causing the RFD site to fail to display a link to the relevant PR. handle_rfd_push
has a audit for testing and initial fix that will update discussion urls in response to commit push webhooks.
Two tasks are needed to address this:
sync-recorded-meetings
is scheduled to run every two hours, but takes roughly 2h10m - 2h30m to run.
We are sending far too many entries to Sentry and quickly exceeding limits. Trace sampling needs to be reduced (down to 10-20%), and custom sampling should be implemented for db query traces (down to 0.1%).
#185 fixed the issue with repository events failing to be handled. Syncing settings though failed with a repo that was recently added to the org (06/14 14:25 CDT), The handler failed without an error message at some point prior to assigning teams. This particular repo did not have any commits pushed, and it is possible that sync settings needs to be changed to account for this case.
https://remote.com/blog/introducing-remote-global-employee-api
versus gusto
should use a trait (gusto and remote.com implement) since people are either one or the other
so something like a HRService
trait or something named more appropriately
when not causing tokio stack overflows on endpoints
Since deploying the Google auth and CloudFlare rate limit fixes (f433244) there have been two instances (so far) of hitting the memory limit in the cron container. Overall behavior is not dramatically different than previous runs, except for spikes now reaching the 80-90% usage level instead of 65%.
If any meeting attendee in the list of users sent to Airtable can not be found, then nothing is written to attendees. Currently this is visible for recent Hardware Huddles, but is likely a general issue when trying to write users to Airtable. Need to determine which of these is true:
GitHub supports creating a set of default labels that will be pre-configured for all new repos in the organization. These labels can be edited or deleted though, and creating default labels does not apply to existing repos.
Add the ability to:
Hello, I'm attempting to use your library and get the following error when trying the Google Drive example:
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: APIError: status code -> 400 Bad Request, body -> {
"error": {
"errors": [
{
"domain": "global",
"reason": "invalid",
"message": "Invalid Value",
"locationType": "parameter",
"location": "q"
}
],
"code": 400,
"message": "Invalid Value"
}
}
The google sheets example works fine -- any tips on how to debug this?
Add database and Airtable tables or tracking orders. This will need to store:
To repro:
$ rustup update
$ cargo build
Output:
error: could not compile `slack-chat-api` due to previous error
warning: build failed, waiting for other jobs to finish...
error: future cannot be sent between threads safely
--> slack/src/lib.rs:425:5
|
425 | #[async_recursion]
| ^^^^^^^^^^^^^^^^^^ future created by async block is not `Send`
|
= help: the trait `Sync` is not implemented for `core::fmt::Opaque`
note: future is not `Send` as this value is used across an await
--> slack/src/lib.rs:433:66
|
433 | bail!("status code: {}, body: {}", s, resp.text().await?);
| -------------------------------------------------^^^^^^--
| | |
| | await occurs here, with `$crate::__export::format_args!($($arg)*)` maybe used later
| has type `ArgumentV1<'_>` which is not `Send`
| `$crate::__export::format_args!($($arg)*)` is later dropped here
= note: required for the cast to the object type `dyn std::future::Future<Output = Result<FormattedMessageResponse, anyhow::Error>> + Send`
= note: this error originates in the attribute macro `async_recursion` (in Nightly builds, run with -Z macro-backtrace for more info)
Placeholder. Need details on location of RFD bot.
>>> !rfd 273
bot >>> RFD 158 I²C Multiplexing (discussion) [github](https://158.rfd.oxide.computer/) [rendered](https://rfd.shared.oxide.computer/rfd/0158) [discussion](https://github.com/oxidecomputer/rfd/pull/229)
Syncing users during in response to a change in users.toml
exceeds that time limit allowed for a request. This causes the request to fail mid processing of the users map. There are a number of optimizations that can be made, but we need a failure large increase to get under the required time.
This may end up needing to be addressed by changing to an async job system. If it does seem that we want to defer this, then we will need to increase the frequency at which the sync runs.
docker-image-webhooky
frequently runs out of disk spacedocker-image-cio
occasionally runs out of spaceNeeds investigation in to what is consuming space and if it is necessary.
WARNING: local cache import at /tmp/.buildx-cache not found due to err: could not read /tmp/.buildx-cache/index.json: open /tmp/.buildx-cache/index.json: no such file or directory
error: failed to solve: error writing layer blob: failed to copy: failed to send write: write /tmp/.buildx-cache/ingest/329fb5a686d86abb1d433305b851144eeece294c8ddd20f17be04691017128b1/data: no space left on device: unknown
Error: buildx failed with: error: failed to solve: error writing layer blob: failed to copy: failed to send write: write /tmp/.buildx-cache/ingest/329fb5a686d86abb1d433305b851144eeece294c8ddd20f17be04691017128b1/data: no space left on device: unknown
its this line triggering it: https://github.com/oxidecomputer/cio/blob/master/cio/src/rfds.rs#L693
something to do with the third party clients being updated to use https://docs.rs/reqwest-middleware/latest/reqwest_middleware/ for tracing and retrying the request
cloudflare.rs:288
incorrectly indexes into an empty array on a cache miss. It should instead return an option for the found record.
A number of these parse issues have been resolved, but infrequently requests come in with payloads that do not match the struct definitions in webhooky/src/github_types.rs
For reference parsing is currently performed via oxidecomputer/dropshot (handler argument).
Recent example:
May 06 08:00:23.194 INFO request completed, error_message_external: unable to parse body: invalid type: null, expected a string at line 1 column 2714, error_message_internal: unable to parse body: invalid type: null, expected a string at line 1 column 2714, response_code: 400, uri: /github, method: POST, req_id: cb4a404c-d8ff-4be1-b8f4-7cb09ab98e09
Writing records to the Functions table in Airtable frequently generate duplicate records with the same saga_id
. Determine if there are also duplicated internal records or if this is only visible in Airtable.
Currently webhook receipt and handling history are not recorded outside of temporary log messages. We want to be able to know historically:
cio
Specific RFDs have started failing too generate due to failure to launch puppeteer:
Jun 30 11:53:44.356 INFO /usr/local/lib/node_modules/puppeteer/.local-chromium/linux-1002410/chrome-linux/chrome: error while loading shared libraries: libatk-1.0.so.0: cannot open shared object file: No such file or directory, pids: [3135], saga_id: 3b86eb1e-9e33-4274-9f26-99e055e49135, cmd: sync-rfds
In the most recent sync, 0256
, 0273
, and 0276
failed.
In regards to applicants.rs, I am curious if specialized formatting is preserved during read_pdf()'s extraction. Would a preview page prior to the submit be worthwhile?
The company sync job currently runs every 12 hours, but the company record contains critical data that we want to have update in near real time. Implement a webhook listener for these changes so that the internal Postgres db can be updated faster.
System accounts should be able to be (optionally) provisioned in Ramp. This is needed to be able to authenticate as a system account during automation. If possible this account should not have permissions to create new cards.
likely the fact production uses a unix socket for postgres comms, since it works locally over tcp
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.