oxidecomputer / cio Goto Github PK

View Code? Open in Web Editor NEW

261.0 261.0 40.0 8.67 MB

Rust libraries for APIs needed by our automated CIO.

License: Apache License 2.0

Rust 99.44% PLpgSQL 0.06% Dockerfile 0.35% Makefile 0.01% Shell 0.06% JavaScript 0.07%

cio's People

Contributors

Stargazers

Watchers

cio's Issues

`master` branch broken on 1.60.0-nightly

To repro:

$ rustup update
$ cargo build

Output:

error: could not compile `slack-chat-api` due to previous error
warning: build failed, waiting for other jobs to finish...
error: future cannot be sent between threads safely
   --> slack/src/lib.rs:425:5
    |
425 |     #[async_recursion]
    |     ^^^^^^^^^^^^^^^^^^ future created by async block is not `Send`
    |
    = help: the trait `Sync` is not implemented for `core::fmt::Opaque`
note: future is not `Send` as this value is used across an await
   --> slack/src/lib.rs:433:66
    |
433 |                 bail!("status code: {}, body: {}", s, resp.text().await?);
    |                 -------------------------------------------------^^^^^^--
    |                 |                                                |
    |                 |                                                await occurs here, with `$crate::__export::format_args!($($arg)*)` maybe used later
    |                 has type `ArgumentV1<'_>` which is not `Send`
    |                 `$crate::__export::format_args!($($arg)*)` is later dropped here
    = note: required for the cast to the object type `dyn std::future::Future<Output = Result<FormattedMessageResponse, anyhow::Error>> + Send`
    = note: this error originates in the attribute macro `async_recursion` (in Nightly builds, run with -Z macro-backtrace for more info)

add remote.com api for tracking international hires

https://remote.com/blog/introducing-remote-global-employee-api

versus gusto

should use a trait (gusto and remote.com implement) since people are either one or the other

so something like a HRService trait or something named more appropriately

Repo settings sync during webhook fails silently

#185 fixed the issue with repository events failing to be handled. Syncing settings though failed with a repo that was recently added to the org (06/14 14:25 CDT), The handler failed without an error message at some point prior to assigning teams. This particular repo did not have any commits pushed, and it is possible that sync settings needs to be changed to account for this case.

Disk usage on Docker image actions

docker-image-webhooky frequently runs out of disk space
docker-image-cio occasionally runs out of space

Needs investigation in to what is consuming space and if it is necessary.

WARNING: local cache import at /tmp/.buildx-cache not found due to err: could not read /tmp/.buildx-cache/index.json: open /tmp/.buildx-cache/index.json: no such file or directory
error: failed to solve: error writing layer blob: failed to copy: failed to send write: write /tmp/.buildx-cache/ingest/329fb5a686d86abb1d433305b851144eeece294c8ddd20f17be04691017128b1/data: no space left on device: unknown
Error: buildx failed with: error: failed to solve: error writing layer blob: failed to copy: failed to send write: write /tmp/.buildx-cache/ingest/329fb5a686d86abb1d433305b851144eeece294c8ddd20f17be04691017128b1/data: no space left on device: unknown

Google Drive download_by_id does not have error handling

This is fundamentally an issue with our Google Drive client (and other clients), but tracking it here as it has direct impact.

When calling download_by_id, the Drive client will call to request_raw. request_raw will return the response body directly to the caller without can interference, and leaves it to the caller to perform any necessary error handling. download_by_id does not perform any checking of the response status or headers and instead immediately translates the request body into a bytes::Bytes. At this point, any data about the status of the request has been lost and a caller only has the raw bytes of the response to make determinations on.

In the case of cio, the bytes returned are being treated as a String, assuming that if the client had failed to download the file an error would have been returned. The visible results of this is the server error message being written to the database as if it was a successful download. Specifically we have seen this when failing to download a chat log file. This resulted in an error body for a 401 error overwriting the stored chat log data.

To resolve this we will need to address a couple sub-issues:

Update Google Drive client to return errors when download files fails
Only write chat log data when a file is successfully downloaded (currently unwrap_or_default is used which will overwrite with an empty string on failure)

Tokio stack overflow

This occurs a few times per day in the webhooky. Looks to be correlated to M-F work (as expected). Needs investigation.

2022-05-05 13:50:35.960 CDT thread 'tokio-runtime-worker' has overflowed its stack
2022-05-05 13:50:35.960 CDT fatal runtime error: stack overflow
2022-05-05 13:50:35.961 CDT Uncaught signal: 6, pid=1, tid=3, fault_addr=0.

Preview page prior to application?

In regards to applicants.rs, I am curious if specialized formatting is preserved during read_pdf()'s extraction. Would a preview page prior to the submit be worthwhile?

PDF generation for RFD fails to launch

Specific RFDs have started failing too generate due to failure to launch puppeteer:

Jun 30 11:53:44.356 INFO /usr/local/lib/node_modules/puppeteer/.local-chromium/linux-1002410/chrome-linux/chrome: error while loading shared libraries: libatk-1.0.so.0: cannot open shared object file: No such file or directory, pids: [3135], saga_id: 3b86eb1e-9e33-4274-9f26-99e055e49135, cmd: sync-rfds

In the most recent sync, 0256, 0273, and 0276 failed.

CloudFlare refactor breaks on cache miss

cloudflare.rs:288 incorrectly indexes into an empty array on a cache miss. It should instead return an option for the found record.

Ensure discussion link during RFD sync

RFD discussion links are only auto-committed during the webhook handler that responds the opening of a PR for a given RFD. If the discussion url is then removed in a future commit, cio will not restore the url to the document. Additionally the url will be removed from the cio database, causing the RFD site to fail to display a link to the relevant PR. handle_rfd_push has a audit for testing and initial fix that will update discussion urls in response to commit push webhooks.

Two tasks are needed to address this:

Verify that committing the discussion url in response to a commit will not result in a loop
Determine a way to verify / update discussion urls during a sync run (i.e. absent of the data a webhook provides)

RFD Bot posting incorrect link

Placeholder. Need details on location of RFD bot.

>>> !rfd 273 
bot >>> RFD 158 I²C Multiplexing (discussion) [github](https://158.rfd.oxide.computer/) [rendered](https://rfd.shared.oxide.computer/rfd/0158) [discussion](https://github.com/oxidecomputer/rfd/pull/229)

Add label syncing for Repos

GitHub supports creating a set of default labels that will be pre-configured for all new repos in the organization. These labels can be edited or deleted though, and creating default labels does not apply to existing repos.

Add the ability to:

Assign a set of labels that are mandatory / default for repos
Ensure that the label exists on each repo during repo sync

Recorded Meetings take too long to sync

sync-recorded-meetings is scheduled to run every two hours, but takes roughly 2h10m - 2h30m to run.

Update RFD discussion links during refresh

RFD discussion links are generated and stored during a PR "open" action webhook. Once generated any commits that remove the discussion link from the main .adoc file will cause the discussion url to be interpreted as a blank value and erase the old link. Due to generation only happen during the "open" action, discussion links can never self correct.

This was seen in https://github.com/oxidecomputer/rfd/pull/434 where the discussion link was written during https://github.com/oxidecomputer/rfd/runs/6712689245, but subsequently deleted during https://github.com/oxidecomputer/rfd/compare/2d4470fa6c57502e71537007503058e364d03a7d..c8d5d8b3559f2eace92c918074db5afc8ed53438

Airtable user requests failng to decode

Deserialization of AirTable users during sync-configs is started failing on 6/2:

**********@oxidecomputer.com` failed: error decoding response body: premature end of input at line 1 column 156, pids: [308], saga_id: a3a6e586-1446-4625-8bb4-f89f1d44bf1d, cmd: sync-configs

OAuth 2.0 support in zoom-api

Hi! Thanks for the great crates pack :)

I'm looking for the library to abstract away interaction with Zoom API, and your crate looks like what I need!
There is one detail though: my Zoom App is OAuth 2.0 App, and it looks like currently your lib supports only JWT-apps.

Is it possible to consider adding some option to allow OAuth 2.0 as well?
It looks like to do so we just need a possibility to provide OAuth access token during the creation of Zoom struct, instead of jwt key and secret.

I'd be glad to be of any help if you consider this feature worthy!

tokio stack overflow on rfd update

its this line triggering it: https://github.com/oxidecomputer/cio/blob/master/cio/src/rfds.rs#L693

something to do with the third party clients being updated to use https://docs.rs/reqwest-middleware/latest/reqwest_middleware/ for tracing and retrying the request

Push lead signups to CRM

Existing and new rack line signups should be pushed to Zoho as Leads. They should be written once, and then marked as complete internally. They should not be synchronized.

User configs webhook takes too long to run

Syncing users during in response to a change in users.toml exceeds that time limit allowed for a request. This causes the request to fail mid processing of the users map. There are a number of optimizations that can be made, but we need a failure large increase to get under the required time.

This may end up needing to be addressed by changing to an async job system. If it does seem that we want to defer this, then we will need to increase the frequency at which the sync runs.

Improve build and deploy times

Currently a build and deploy takes roughly one hour. Determine a best path forward to reduce this. This may be in the shape of reducing complexity or in breaking apart the deployable pieces, but it needs investigation.

Focusing on likely rfd and hr functionality initially.

Memory exhaustion

Since deploying the Google auth and CloudFlare rate limit fixes (f433244) there have been two instances (so far) of hitting the memory limit in the cron container. Overall behavior is not dramatically different than previous runs, except for spikes now reaching the 80-90% usage level instead of 65%.

huddle calendar updates not working

I've tried to figure out why this is happening and cannot. My GSuite calendar has recurring host boot huddle entries far into the future; they seem like they should match the fuzzy string from configs. However, the airtable calendar (https://airtable.com/tblb9ACUfEopO5Rx2/viwbDtmYCWzUF8EEO?blocks=bipLfQLiV0dJ5dXRC) has past meetings, then a meeting that I canceled via GSuite, and then the next (and only) meeting listed is July 20. From looking at the code, I would expect to see 13 weeks worth of future meetings automatically populated. Not surprisingly, this also makes it impossible to submit an agenda item for any other meeting date. This has been the same for over a week, so the regular jobs to update huddles aren't fixing it. As none of the diagnostic printfs from the crate end up in the job output and the cargo test used to do this population requires things I don't have to run (right?), I'm not sure how to debug this.

Add support for specifying GSuite resource category

Resources synced from configs currently are assumed to be CONFERENCE_ROOM resources. This should be generalized to support OTHER and CATEGORY_UNKNOWN so that we can better represent https://github.com/oxidecomputer/configs/issues/90

Update ResourceConfig to support a category field. Default this field to CONFERENCE_ROOM.
Refactor db backing for ResourceConfig to use a generic Resource name instead of ConferenceRoom. This should additionally create migrations for a resources table.
Copy data from conference_rooms table to resources table.
Update AIRTABLE_CONFERENCE_ROOMS_TABLE constant.
Determine plan for updating AIRTABLE_CONFERENCE_ROOMS_TABLE value.

Single applicant failure causes sync to fail

During refresh_new_applicants_and_reviews, applicants are processed 3 at a time. If any applicant fails (i.e. arbitrary external service failures) then the error is early returned and short circuits processing. This leaves the remaining applicants unprocessed.

Ideally each applicant can be processed and succeed / fail independently.

Long running sagas are cancelled on shutdown

When cio is shutdown any long running sagas are marked as cancelled. This means that any deployment will:

Interrupt currently running sagas
Prevent those sagas from completing until next run

This issue may be helped by work on breaking down some of the long running sagas, but ideally there is a way to version these jobs such that post-deployment they can either be resumed (if they are still valid), or cancelled (if they are no longer compatible).

Airtable company config webhook

The company sync job currently runs every 12 hours, but the company record contains critical data that we want to have update in near real time. Implement a webhook listener for these changes so that the internal Postgres db can be updated faster.

RFD 0276 fails to find mmdc

Visible on https://github.com/oxidecomputer/rfd/commit/ba972be9a440d2af1329aa61e010bc48a6ef4275

PDF generation (convert_and_upload_pdf) failed twice when trying to find the mmdc binary. This was running during a webhook handler, specifically for RFD 0276.

Meeting chat log failing due to permissions

Fetching meeting chat logs are failing with permission errors. This results in cio writing the error message in place of the chat log.

Enable third party api clients to self refresh

Currently third party api clients only authenticate upon construction, generating new access tokens if the current tokens are expired. To be able use a client in a long running task or to share a long lived instance of a client each task needs to check every call it makes for authentication errors and potentially re-authenticate and re-issue the request.

Instead the clients should capture the necessary data for re-authenticating upon construction and internally handle re-authentication (likely behind an option) when an authentication error occurs.

Reduce Sentry traffic

We are sending far too many entries to Sentry and quickly exceeding limits. Trace sampling needs to be reduced (down to 10-20%), and custom sampling should be implemented for db query traces (down to 0.1%).

Allow Ramp provisioning for system accounts

System accounts should be able to be (optionally) provisioned in Ramp. This is needed to be able to authenticate as a system account during automation. If possible this account should not have permissions to create new cards.

Webhook tracking

Currently webhook receipt and handling history are not recorded outside of temporary log messages. We want to be able to know historically:

The events that were accepted by cio
Which handlers accepted the event
Which saga(s) were run for the handler
The saga log(s) associated with the handler invocation
The saga result(s) associated with the handler invocation
Status and runtime duration of each handler

get db tracing to work w a unix socket

likely the fact production uses a unix socket for postgres comms, since it works locally over tcp

User lists fail to write to Airtable

If any meeting attendee in the list of users sent to Airtable can not be found, then nothing is written to attendees. Currently this is visible for recent Hardware Huddles, but is likely a general issue when trying to write users to Airtable. Need to determine which of these is true:

Any user in the system can be written to a cell
Only users with access to a base / table can be written to a cell

Move to full asynchronous runner

This issue is a tracking issue for a rework of the job runner model that cio-bot uses.

Motivation

CIO does a good job of providing a readable (in code) encapsulation of the business rules and processes that run many day to day operations that are otherwise often spread across multiple departments and people. Currently it struggles though in reporting back what work it has done and why. Parts of the execution (specifically cron jobs) collect logs, but they are unstructured and lose their level designation when globally integrated. Webhook handlers track separate data to global logs storage, GitHub commit comments, and GitHub Check Runs. Additionally all handlers send some portion of logs (warnings and errors) to Sentry.

As such in the current state, asking "What did handler X do?" or "Why did handler Y do Z?" requires tracing the cio codebase to understand how that handler reports its data. While there are some patterns that can be learned as a starting point, we can make this significantly more effective.

To summarize, we are specifically interested in improving "how" cio does the work it does, as opposed to "what" work cio does.

Goals

The key goals that we want to hit during this implementation are:

Retain all existing functionality of current cio handlers
Collect logs and tracing metadata for all handlers in a global log store
Store payloads of accepted incoming events
Track handlers that have been run against a given event
Update webhook handlers to immediately respond to requests, and execute work asynchronously
Allow redelivery of events
Allow re-running of handlers
Generalize cron and webhook handlers in to a single execution model
Merge cron and event handlers so that their core functionality is the same

CloudFlare Rate Limiting

sync-shorturls (and by extension other functions like sync-other) has started running into CloudFlare rate limiting:

More than 1200 requests per 300 seconds reached. Please wait and consider throttling your request speed

Currently every record performs a list request, and as such can easily exhaust the limit. A few options:

Add forced delay to ensure we are below the limit
Fetch and cache all records up front during the sync process
Use DNS resolution to verify that a name resolves as expected instead of using CloudFlare API (this does not provide the same guarantees as what the sync currently does)

turn back on tracing

when not causing tokio stack overflows on endpoints

Failure to parse incoming GitHub webhook

A number of these parse issues have been resolved, but infrequently requests come in with payloads that do not match the struct definitions in webhooky/src/github_types.rs

For reference parsing is currently performed via oxidecomputer/dropshot (handler argument).

Recent example:

May 06 08:00:23.194 INFO request completed, error_message_external: unable to parse body: invalid type: null, expected a string at line 1 column 2714, error_message_internal: unable to parse body: invalid type: null, expected a string at line 1 column 2714, response_code: 400, uri: /github, method: POST, req_id: cb4a404c-d8ff-4be1-b8f4-7cb09ab98e09

Order tracking

Add database and Airtable tables or tracking orders. This will need to store:

Unique identifier for the recipient
Unique identifier for the order
Status of the full order
One or many external tracking ids
Status of individual external jobs

google drive error

Hello, I'm attempting to use your library and get the following error when trying the Google Drive example:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: APIError: status code -> 400 Bad Request, body -> {
 "error": {
  "errors": [
   {
    "domain": "global",
    "reason": "invalid",
    "message": "Invalid Value",
    "locationType": "parameter",
    "location": "q"
   }
  ],
  "code": 400,
  "message": "Invalid Value"
 }
}

The google sheets example works fine -- any tips on how to debug this?

Duplicating rows in Functions table

Writing records to the Functions table in Airtable frequently generate duplicate records with the same saga_id. Determine if there are also duplicated internal records or if this is only visible in Airtable.

RFD 273 failed to create PR

A PR for RFD 273 was never created. Check logs around 6:20-6:30pm CDT. (Maybe related to #130)

Quickbooks sync failing

Quickbooks sync is failing on let bill_payments = qb.list_bill_payments().await?; with a 400 Bad Request

GitHub sync fails for users with expired invite

If a user in configs has been invite to the GitHub org in the past, but did not accept their invite and instead has let the invite expire, subsequent attempts to add the user to the organization will fail with 422 errors. This is seen in src/providers.rs:242. Currently this causes the sync function to return early with a failure.

This should likely be changed, as we are left currently with a partially provisioned user. Either the remaining provisioning steps could be run and GitHub access can be left in its current state or be manually triaged, or we can cancel the invite and re-issue upon checking that it is expired.

turn back on tracing when it works well w async

see #74

RFD Processing Documentation

Add documentation diagrams detailing what happens when the RFD commit and RFD pull request web hook responders run.

Tailscale key renewal

Research if the keys API is sufficient for automating the provisioning of new Tailscale auth keys when the old ones expire. https://github.com/tailscale/tailscale/blob/main/api.md#keys

If this is not possible, then determine an alerting mechanism for when keys are within a week of expiring.

GoogleDrive authentication race condition

Each time the GoogleDrive client is constructed it reads the GOOGLE_KEY_ENCODED env variable, creates a /tmp/google_key.json file (or truncates if it exists), and then writes the decoded key value to the file. There is a race condition where:

Thread A creates the file
Thread A writes to the file
Thread B truncates the file
Thread A reads out the value form the file

This can be seen in the applicants refresh job where multiple applicants are processed at the same time and the job fails during GoogleDrive authentication.

huddle autocancel has stopped working again

This worked properly until maybe 3 weeks ago. Since then, the host boot huddle never gets canceled even though the meetings are in the calendar and properly picked up by airtable. I haven't had a chance to debug this, and probably won't for a long time, but it might be a good idea to look through changes from a few weeks ago.

Huddle reminder encoding

Huddle reminders were sent with encoded apostrophes and quotes in the notes section.

Notes: I&#x27;d like to ... highlight &quot;what ... making&quot;, ... week&#x27;

See: Control Plane topics 5/24

oxidecomputer / cio Goto Github PK

cio's People

Contributors

Stargazers

Watchers

Forkers

cio's Issues

Motivation

Goals

Recommend Projects

Recommend Topics

Recommend Org

Jobs