GithubHelp home page GithubHelp logo

oxidecomputer / cio Goto Github PK

View Code? Open in Web Editor NEW
260.0 260.0 39.0 8.61 MB

Rust libraries for APIs needed by our automated CIO.

License: Apache License 2.0

Rust 99.44% PLpgSQL 0.06% Dockerfile 0.35% Makefile 0.01% Shell 0.06% JavaScript 0.07%

cio's People

Contributors

20k-ultra avatar ahl avatar augustuswm avatar bcantrill avatar benjaminleonard avatar cbiffle avatar cmoog avatar david-crespo avatar dependabot[bot] avatar github-actions[bot] avatar jclulow avatar jessfraz avatar plainspace avatar rtsuk avatar smklein avatar tylerlafayette avatar wesolows avatar zephraph avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cio's Issues

RFD Processing Documentation

Add documentation diagrams detailing what happens when the RFD commit and RFD pull request web hook responders run.

GitHub sync fails for users with expired invite

If a user in configs has been invite to the GitHub org in the past, but did not accept their invite and instead has let the invite expire, subsequent attempts to add the user to the organization will fail with 422 errors. This is seen in src/providers.rs:242. Currently this causes the sync function to return early with a failure.

This should likely be changed, as we are left currently with a partially provisioned user. Either the remaining provisioning steps could be run and GitHub access can be left in its current state or be manually triaged, or we can cancel the invite and re-issue upon checking that it is expired.

Long running sagas are cancelled on shutdown

When cio is shutdown any long running sagas are marked as cancelled. This means that any deployment will:

  1. Interrupt currently running sagas
  2. Prevent those sagas from completing until next run

This issue may be helped by work on breaking down some of the long running sagas, but ideally there is a way to version these jobs such that post-deployment they can either be resumed (if they are still valid), or cancelled (if they are no longer compatible).

huddle autocancel has stopped working again

This worked properly until maybe 3 weeks ago. Since then, the host boot huddle never gets canceled even though the meetings are in the calendar and properly picked up by airtable. I haven't had a chance to debug this, and probably won't for a long time, but it might be a good idea to look through changes from a few weeks ago.

Add support for specifying GSuite resource category

Resources synced from configs currently are assumed to be CONFERENCE_ROOM resources. This should be generalized to support OTHER and CATEGORY_UNKNOWN so that we can better represent https://github.com/oxidecomputer/configs/issues/90

  • Update ResourceConfig to support a category field. Default this field to CONFERENCE_ROOM.
  • Refactor db backing for ResourceConfig to use a generic Resource name instead of ConferenceRoom. This should additionally create migrations for a resources table.
  • Copy data from conference_rooms table to resources table.
  • Update AIRTABLE_CONFERENCE_ROOMS_TABLE constant.
  • Determine plan for updating AIRTABLE_CONFERENCE_ROOMS_TABLE value.

Google Drive download_by_id does not have error handling

This is fundamentally an issue with our Google Drive client (and other clients), but tracking it here as it has direct impact.

When calling download_by_id, the Drive client will call to request_raw. request_raw will return the response body directly to the caller without can interference, and leaves it to the caller to perform any necessary error handling. download_by_id does not perform any checking of the response status or headers and instead immediately translates the request body into a bytes::Bytes. At this point, any data about the status of the request has been lost and a caller only has the raw bytes of the response to make determinations on.

In the case of cio, the bytes returned are being treated as a String, assuming that if the client had failed to download the file an error would have been returned. The visible results of this is the server error message being written to the database as if it was a successful download. Specifically we have seen this when failing to download a chat log file. This resulted in an error body for a 401 error overwriting the stored chat log data.

To resolve this we will need to address a couple sub-issues:

  • Update Google Drive client to return errors when download files fails
  • Only write chat log data when a file is successfully downloaded (currently unwrap_or_default is used which will overwrite with an empty string on failure)

GoogleDrive authentication race condition

Each time the GoogleDrive client is constructed it reads the GOOGLE_KEY_ENCODED env variable, creates a /tmp/google_key.json file (or truncates if it exists), and then writes the decoded key value to the file. There is a race condition where:

  • Thread A creates the file
  • Thread A writes to the file
  • Thread B truncates the file
  • Thread A reads out the value form the file

This can be seen in the applicants refresh job where multiple applicants are processed at the same time and the job fails during GoogleDrive authentication.

Enable third party api clients to self refresh

Currently third party api clients only authenticate upon construction, generating new access tokens if the current tokens are expired. To be able use a client in a long running task or to share a long lived instance of a client each task needs to check every call it makes for authentication errors and potentially re-authenticate and re-issue the request.

Instead the clients should capture the necessary data for re-authenticating upon construction and internally handle re-authentication (likely behind an option) when an authentication error occurs.

  • docusign
  • google/calendar
  • google/cloud-resource-manager
  • google/drive
  • google/groups-settings
  • google/sheets
  • gsuite
  • gusto
  • mailchimp
  • ramp
  • shopify
  • slack

Update RFD discussion links during refresh

RFD discussion links are generated and stored during a PR "open" action webhook. Once generated any commits that remove the discussion link from the main .adoc file will cause the discussion url to be interpreted as a blank value and erase the old link. Due to generation only happen during the "open" action, discussion links can never self correct.

This was seen in https://github.com/oxidecomputer/rfd/pull/434 where the discussion link was written during https://github.com/oxidecomputer/rfd/runs/6712689245, but subsequently deleted during https://github.com/oxidecomputer/rfd/compare/2d4470fa6c57502e71537007503058e364d03a7d..c8d5d8b3559f2eace92c918074db5afc8ed53438

Huddle reminder encoding

Huddle reminders were sent with encoded apostrophes and quotes in the notes section.

Notes: I'd like to ... highlight "what ... making", ... week'

See: Control Plane topics 5/24

huddle calendar updates not working

I've tried to figure out why this is happening and cannot. My GSuite calendar has recurring host boot huddle entries far into the future; they seem like they should match the fuzzy string from configs. However, the airtable calendar (https://airtable.com/tblb9ACUfEopO5Rx2/viwbDtmYCWzUF8EEO?blocks=bipLfQLiV0dJ5dXRC) has past meetings, then a meeting that I canceled via GSuite, and then the next (and only) meeting listed is July 20. From looking at the code, I would expect to see 13 weeks worth of future meetings automatically populated. Not surprisingly, this also makes it impossible to submit an agenda item for any other meeting date. This has been the same for over a week, so the regular jobs to update huddles aren't fixing it. As none of the diagnostic printfs from the crate end up in the job output and the cargo test used to do this population requires things I don't have to run (right?), I'm not sure how to debug this.

Improve build and deploy times

Currently a build and deploy takes roughly one hour. Determine a best path forward to reduce this. This may be in the shape of reducing complexity or in breaking apart the deployable pieces, but it needs investigation.

Focusing on likely rfd and hr functionality initially.

CloudFlare Rate Limiting

sync-shorturls (and by extension other functions like sync-other) has started running into CloudFlare rate limiting:

More than 1200 requests per 300 seconds reached. Please wait and consider throttling your request speed

Currently every record performs a list request, and as such can easily exhaust the limit. A few options:

  • Add forced delay to ensure we are below the limit
  • Fetch and cache all records up front during the sync process
  • Use DNS resolution to verify that a name resolves as expected instead of using CloudFlare API (this does not provide the same guarantees as what the sync currently does)

Single applicant failure causes sync to fail

During refresh_new_applicants_and_reviews, applicants are processed 3 at a time. If any applicant fails (i.e. arbitrary external service failures) then the error is early returned and short circuits processing. This leaves the remaining applicants unprocessed.

Ideally each applicant can be processed and succeed / fail independently.

Tokio stack overflow

This occurs a few times per day in the webhooky. Looks to be correlated to M-F work (as expected). Needs investigation.

2022-05-05 13:50:35.960 CDT thread 'tokio-runtime-worker' has overflowed its stack
2022-05-05 13:50:35.960 CDT fatal runtime error: stack overflow
2022-05-05 13:50:35.961 CDT Uncaught signal: 6, pid=1, tid=3, fault_addr=0.

Quickbooks sync failing

Quickbooks sync is failing on let bill_payments = qb.list_bill_payments().await?; with a 400 Bad Request

Push lead signups to CRM

Existing and new rack line signups should be pushed to Zoho as Leads. They should be written once, and then marked as complete internally. They should not be synchronized.

OAuth 2.0 support in zoom-api

Hi! Thanks for the great crates pack :)

I'm looking for the library to abstract away interaction with Zoom API, and your crate looks like what I need!
There is one detail though: my Zoom App is OAuth 2.0 App, and it looks like currently your lib supports only JWT-apps.

Is it possible to consider adding some option to allow OAuth 2.0 as well?
It looks like to do so we just need a possibility to provide OAuth access token during the creation of Zoom struct, instead of jwt key and secret.

I'd be glad to be of any help if you consider this feature worthy!

Move to full asynchronous runner

This issue is a tracking issue for a rework of the job runner model that cio-bot uses.

Motivation

CIO does a good job of providing a readable (in code) encapsulation of the business rules and processes that run many day to day operations that are otherwise often spread across multiple departments and people. Currently it struggles though in reporting back what work it has done and why. Parts of the execution (specifically cron jobs) collect logs, but they are unstructured and lose their level designation when globally integrated. Webhook handlers track separate data to global logs storage, GitHub commit comments, and GitHub Check Runs. Additionally all handlers send some portion of logs (warnings and errors) to Sentry.

As such in the current state, asking "What did handler X do?" or "Why did handler Y do Z?" requires tracing the cio codebase to understand how that handler reports its data. While there are some patterns that can be learned as a starting point, we can make this significantly more effective.

To summarize, we are specifically interested in improving "how" cio does the work it does, as opposed to "what" work cio does.

Goals

The key goals that we want to hit during this implementation are:

  • Retain all existing functionality of current cio handlers
  • Collect logs and tracing metadata for all handlers in a global log store
  • Store payloads of accepted incoming events
  • Track handlers that have been run against a given event
  • Update webhook handlers to immediately respond to requests, and execute work asynchronously
  • Allow redelivery of events
  • Allow re-running of handlers
  • Generalize cron and webhook handlers in to a single execution model
  • Merge cron and event handlers so that their core functionality is the same

Airtable user requests failng to decode

Deserialization of AirTable users during sync-configs is started failing on 6/2:

**********@oxidecomputer.com` failed: error decoding response body: premature end of input at line 1 column 156, pids: [308], saga_id: a3a6e586-1446-4625-8bb4-f89f1d44bf1d, cmd: sync-configs

Ensure discussion link during RFD sync

RFD discussion links are only auto-committed during the webhook handler that responds the opening of a PR for a given RFD. If the discussion url is then removed in a future commit, cio will not restore the url to the document. Additionally the url will be removed from the cio database, causing the RFD site to fail to display a link to the relevant PR. handle_rfd_push has a audit for testing and initial fix that will update discussion urls in response to commit push webhooks.

Two tasks are needed to address this:

  • Verify that committing the discussion url in response to a commit will not result in a loop
  • Determine a way to verify / update discussion urls during a sync run (i.e. absent of the data a webhook provides)

Reduce Sentry traffic

We are sending far too many entries to Sentry and quickly exceeding limits. Trace sampling needs to be reduced (down to 10-20%), and custom sampling should be implemented for db query traces (down to 0.1%).

Repo settings sync during webhook fails silently

#185 fixed the issue with repository events failing to be handled. Syncing settings though failed with a repo that was recently added to the org (06/14 14:25 CDT), The handler failed without an error message at some point prior to assigning teams. This particular repo did not have any commits pushed, and it is possible that sync settings needs to be changed to account for this case.

Memory exhaustion

Since deploying the Google auth and CloudFlare rate limit fixes (f433244) there have been two instances (so far) of hitting the memory limit in the cron container. Overall behavior is not dramatically different than previous runs, except for spikes now reaching the 80-90% usage level instead of 65%.

User lists fail to write to Airtable

If any meeting attendee in the list of users sent to Airtable can not be found, then nothing is written to attendees. Currently this is visible for recent Hardware Huddles, but is likely a general issue when trying to write users to Airtable. Need to determine which of these is true:

  1. Any user in the system can be written to a cell
  2. Only users with access to a base / table can be written to a cell

Add label syncing for Repos

GitHub supports creating a set of default labels that will be pre-configured for all new repos in the organization. These labels can be edited or deleted though, and creating default labels does not apply to existing repos.

Add the ability to:

  • Assign a set of labels that are mandatory / default for repos
  • Ensure that the label exists on each repo during repo sync

google drive error

Hello, I'm attempting to use your library and get the following error when trying the Google Drive example:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: APIError: status code -> 400 Bad Request, body -> {
 "error": {
  "errors": [
   {
    "domain": "global",
    "reason": "invalid",
    "message": "Invalid Value",
    "locationType": "parameter",
    "location": "q"
   }
  ],
  "code": 400,
  "message": "Invalid Value"
 }
}

The google sheets example works fine -- any tips on how to debug this?

Order tracking

Add database and Airtable tables or tracking orders. This will need to store:

  • Unique identifier for the recipient
  • Unique identifier for the order
  • Status of the full order
  • One or many external tracking ids
  • Status of individual external jobs

`master` branch broken on 1.60.0-nightly

To repro:

$ rustup update
$ cargo build

Output:

error: could not compile `slack-chat-api` due to previous error
warning: build failed, waiting for other jobs to finish...
error: future cannot be sent between threads safely
   --> slack/src/lib.rs:425:5
    |
425 |     #[async_recursion]
    |     ^^^^^^^^^^^^^^^^^^ future created by async block is not `Send`
    |
    = help: the trait `Sync` is not implemented for `core::fmt::Opaque`
note: future is not `Send` as this value is used across an await
   --> slack/src/lib.rs:433:66
    |
433 |                 bail!("status code: {}, body: {}", s, resp.text().await?);
    |                 -------------------------------------------------^^^^^^--
    |                 |                                                |
    |                 |                                                await occurs here, with `$crate::__export::format_args!($($arg)*)` maybe used later
    |                 has type `ArgumentV1<'_>` which is not `Send`
    |                 `$crate::__export::format_args!($($arg)*)` is later dropped here
    = note: required for the cast to the object type `dyn std::future::Future<Output = Result<FormattedMessageResponse, anyhow::Error>> + Send`
    = note: this error originates in the attribute macro `async_recursion` (in Nightly builds, run with -Z macro-backtrace for more info)

RFD Bot posting incorrect link

Placeholder. Need details on location of RFD bot.

>>> !rfd 273 
bot >>> RFD 158 I²C Multiplexing (discussion) [github](https://158.rfd.oxide.computer/) [rendered](https://rfd.shared.oxide.computer/rfd/0158) [discussion](https://github.com/oxidecomputer/rfd/pull/229)

User configs webhook takes too long to run

Syncing users during in response to a change in users.toml exceeds that time limit allowed for a request. This causes the request to fail mid processing of the users map. There are a number of optimizations that can be made, but we need a failure large increase to get under the required time.

This may end up needing to be addressed by changing to an async job system. If it does seem that we want to defer this, then we will need to increase the frequency at which the sync runs.

Disk usage on Docker image actions

  • docker-image-webhooky frequently runs out of disk space
  • docker-image-cio occasionally runs out of space

Needs investigation in to what is consuming space and if it is necessary.

WARNING: local cache import at /tmp/.buildx-cache not found due to err: could not read /tmp/.buildx-cache/index.json: open /tmp/.buildx-cache/index.json: no such file or directory
error: failed to solve: error writing layer blob: failed to copy: failed to send write: write /tmp/.buildx-cache/ingest/329fb5a686d86abb1d433305b851144eeece294c8ddd20f17be04691017128b1/data: no space left on device: unknown
Error: buildx failed with: error: failed to solve: error writing layer blob: failed to copy: failed to send write: write /tmp/.buildx-cache/ingest/329fb5a686d86abb1d433305b851144eeece294c8ddd20f17be04691017128b1/data: no space left on device: unknown

Failure to parse incoming GitHub webhook

A number of these parse issues have been resolved, but infrequently requests come in with payloads that do not match the struct definitions in webhooky/src/github_types.rs

For reference parsing is currently performed via oxidecomputer/dropshot (handler argument).

Recent example:

May 06 08:00:23.194 INFO request completed, error_message_external: unable to parse body: invalid type: null, expected a string at line 1 column 2714, error_message_internal: unable to parse body: invalid type: null, expected a string at line 1 column 2714, response_code: 400, uri: /github, method: POST, req_id: cb4a404c-d8ff-4be1-b8f4-7cb09ab98e09

Duplicating rows in Functions table

Writing records to the Functions table in Airtable frequently generate duplicate records with the same saga_id. Determine if there are also duplicated internal records or if this is only visible in Airtable.

Webhook tracking

Currently webhook receipt and handling history are not recorded outside of temporary log messages. We want to be able to know historically:

  • The events that were accepted by cio
  • Which handlers accepted the event
  • Which saga(s) were run for the handler
  • The saga log(s) associated with the handler invocation
  • The saga result(s) associated with the handler invocation
  • Status and runtime duration of each handler

PDF generation for RFD fails to launch

Specific RFDs have started failing too generate due to failure to launch puppeteer:

Jun 30 11:53:44.356 INFO /usr/local/lib/node_modules/puppeteer/.local-chromium/linux-1002410/chrome-linux/chrome: error while loading shared libraries: libatk-1.0.so.0: cannot open shared object file: No such file or directory, pids: [3135], saga_id: 3b86eb1e-9e33-4274-9f26-99e055e49135, cmd: sync-rfds

In the most recent sync, 0256, 0273, and 0276 failed.

Airtable company config webhook

The company sync job currently runs every 12 hours, but the company record contains critical data that we want to have update in near real time. Implement a webhook listener for these changes so that the internal Postgres db can be updated faster.

Allow Ramp provisioning for system accounts

System accounts should be able to be (optionally) provisioned in Ramp. This is needed to be able to authenticate as a system account during automation. If possible this account should not have permissions to create new cards.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.