GithubHelp home page GithubHelp logo

syncs's Introduction

@mergestat/syncs

Twitter Follow Slack Community

This repository provides officially supported syncs for mergestat.

About

MergeStat syncs are programs packaged in containers that run a process or analysis on a Git repository, and typically store the results in postgres for downstream querying and analysis.

They are orchestrated and run in the context of a mergestat instance.

For example, the git-commits sync in syncs/git-commits will retrieve the full commit history of a repo and store information about each commit in postgres. This allows for subsequent querying of the commit history of a repo, across all the repos this sync has run on.

License

MIT License Copyright (c) 2023 AskGit, Inc. Refer to LICENSE for full text.

syncs's People

Contributors

amenowanna avatar asancar-thoughtworks avatar caugner avatar gitstart avatar patrickdevivo avatar riyaz-ali avatar robincombrink avatar simonflarup avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

syncs's Issues

Encoding error in MergesStat Explore and Git Files syncs

Example repo with error:
https://github.com/laravel/framework

06/05/2023 10:09:12 DEBUG: running sync 734f93a3-3fdd-4510-ba15-82c9991605e5
06/05/2023 10:09:12 INFO: pulling image docker://ghcr.io/mergestat/sync-mergestat-explore:latest
06/05/2023 10:09:17 INFO: running image docker://ghcr.io/mergestat/sync-mergestat-explore:latest
06/05/2023 10:09:17 INFO: pulling git repository: https://github.com/laravel/framework
06/05/2023 10:09:52 INFO: finished git clone successfully: https://github.com/laravel/framework
06/05/2023 10:09:52 INFO: cloned repository to /tmp/mergestat-repo-fe1910b3-498a-4d2a-82f7-6345bf7e942d-268662736 and mounting it at /mergestat/repo
06/05/2023 10:09:57 INFO: DELETE 0
06/05/2023 10:09:57 DEBUG: ERROR: invalid byte sequence for encoding "UTF8": 0xf8
06/05/2023 10:09:57 DEBUG: CONTEXT: COPY git_commits, line 31390
06/05/2023 10:09:57 ERROR: failed to run image: exit status 1

Latest Docker Image ghcr.io/mergestat/sync-github-pull-requests:sha-6defb5b Not Functioning

Hello,

I am facing an issue with the latest version of the Docker image provided by this repository (ghcr.io/mergestat/sync-github-pull-requests:sha-6defb5b). It seems to be not functioning as expected.

Here are the error logs I encountered:

05/28/2023 19:58:28 DEBUG: running sync 56b2c8bd-78d7-4b8d-a63d-b8e0fe362c59
05/28/2023 19:58:28 INFO: pulling image docker://ghcr.io/mergestat/sync-github-pull-requests:latest
05/28/2023 19:58:38 INFO: running image docker://ghcr.io/mergestat/sync-github-pull-requests:latest
05/28/2023 19:58:38 DEBUG: [Package Error] "[email protected]" could not be built. (Imported by "@octokit/auth-app").
05/28/2023 19:58:38 DEBUG: [1/5] Verifying package is valid…
05/28/2023 19:58:38 DEBUG: [2/5] Installing dependencies from npm…
05/28/2023 19:58:38 DEBUG: [3/5] Building package using esinstall…
05/28/2023 19:58:38 DEBUG: Running esinstall...
05/28/2023 19:58:38 DEBUG: Failed to load node_modules/lru-cache/dist/mjs/index.js
05/28/2023 19:58:38 DEBUG:   Unexpected token (406:4) in lru-cache/dist/mjs/index.js
05/28/2023 19:58:38 DEBUG: Install failed.
05/28/2023 19:58:38 DEBUG: Install failed.
05/28/2023 19:58:38 DEBUG: �[0m�[1m�[31merror�[0m: Uncaught Error: [Package Error] "[email protected]" could not be built. (Imported by "@octokit/auth-app").
05/28/2023 19:58:38 DEBUG: throw new Error("[Package Error] \"[email protected]\" could not be built. (Imported by \"@octokit/auth-app\").");
05/28/2023 19:58:38 DEBUG: �[0m�[31m      ^�[0m
05/28/2023 19:58:38 DEBUG:     at �[0m�[36mhttps://cdn.skypack.dev/error/build:[email protected]?from=@octokit/auth-app�[0m:�[0m�[33m20�[0m:�[0m�[33m7�[0m
05/28/2023 19:58:38 ERROR: failed to run image: exit status 1

As a workaround, I was able to get things working again by reverting back to the previous image version, specifically ghcr.io/mergestat/sync-github-pull-requests:sha-304df07.

Thanks.

CSV Quoting error in MergeStat Explore and Git Files syncs

Example repo with error:
https://github.com/Kong/kong

06/06/2023 12:02:09 DEBUG: running sync fc368997-c418-4040-bb02-4c2a956fc448
06/06/2023 12:02:09 INFO: pulling image docker://ghcr.io/mergestat/sync-mergestat-explore:latest
06/06/2023 12:02:09 INFO: running image docker://ghcr.io/mergestat/sync-mergestat-explore:latest
06/06/2023 12:02:09 INFO: pulling git repository: https://github.com/Kong/kong
06/06/2023 12:02:24 INFO: finished git clone successfully: https://github.com/Kong/kong
06/06/2023 12:02:24 INFO: cloned repository to /tmp/mergestat-repo-5dac4593-2237-4eca-a8ce-d7dea802bbdd-1139085957 and mounting it at /mergestat/repo
06/06/2023 12:02:29 INFO: DELETE 0
06/06/2023 12:02:29 INFO: COPY 9440
06/06/2023 12:03:49 INFO: DELETE 0
06/06/2023 12:03:49 INFO: COPY 42456
06/06/2023 12:03:54 INFO: DELETE 0
06/06/2023 12:03:54 DEBUG: ERROR: unterminated CSV quoted field
06/06/2023 12:03:54 DEBUG: CONTEXT: COPY git_files, line 314676: "5dac4593-2237-4eca-a8ce-d7dea802bbdd,spec/fixtures/perf/500services-each-4-routes.sql,0,"--
06/06/2023 12:03:54 DEBUG: -- Postg..."
06/06/2023 12:03:54 ERROR: failed to run image: exit status 1

Refactor all syncs to use a single db tx

We should ensure that deletes and inserts are done in a single transaction, so that we don’t have situations where a crash/failure results in partial results in the DB

`jq` error when running `github-pull-requests` sync

See here as well: https://gke-non-autopilot-testing.console.mergestat.com/repos/a03863ed-02fa-4a1b-b5dc-7554d55b6af6/container-syncs/aa7a55f6-b4b5-4920-bd95-da69474d8324/e183e3f5-cc45-4271-9f28-54323ab0d750

03/30/2023 12:21:43 INFO: pulling git repository: https://github.com/mergestat/mergestat
03/30/2023 12:21:43 DEBUG: running sync aa7a55f6-b4b5-4920-bd95-da69474d8324
03/30/2023 12:22:38 INFO: finished git clone successfully: https://github.com/mergestat/mergestat
03/30/2023 12:22:38 INFO: pulling image docker://mergestat/sync-github-pull-requests:0.0.1
03/30/2023 12:22:43 WARN: Copying blob sha256:f56be85fc22e46face30e2c3de3f7fe7c15f8fd7c4e5add29d7f64b87abdaa09
03/30/2023 12:22:43 WARN: Copying blob sha256:63d937bb8e9740abe216b4b577c219f7feaa521a64ec5dd364ad564cceeda217
03/30/2023 12:22:43 WARN: Copying blob sha256:bbc643ebf4223d61363b720996dfcaa39a761f09069de35dad5606cd6bdd69ec
03/30/2023 12:22:43 WARN: Copying blob sha256:adf65fff9f2126de69d6e3f1326dbfb7533d880c93cd6f512dd704b53f98172c
03/30/2023 12:22:43 WARN: Copying blob sha256:adf65fff9f2126de69d6e3f1326dbfb7533d880c93cd6f512dd704b53f98172c
03/30/2023 12:22:43 WARN: Copying blob sha256:bbc643ebf4223d61363b720996dfcaa39a761f09069de35dad5606cd6bdd69ec
03/30/2023 12:22:43 WARN: Trying to pull docker.io/mergestat/sync-github-pull-requests:0.0.1...
03/30/2023 12:22:43 WARN: Getting image source signatures
03/30/2023 12:22:43 WARN: Copying blob sha256:63d937bb8e9740abe216b4b577c219f7feaa521a64ec5dd364ad564cceeda217
03/30/2023 12:23:23 WARN: Copying config sha256:57e1f7bd6d5a4bea269439dc2481a51a26240eeb09d8836333a6d2c16295885f
03/30/2023 12:23:23 WARN: Writing manifest to image destination
03/30/2023 12:23:23 WARN: Storing signatures
03/30/2023 12:23:38 INFO: running image docker://mergestat/sync-github-pull-requests:0.0.1
03/30/2023 12:23:38 INFO: 57e1f7bd6d5a4bea269439dc2481a51a26240eeb09d8836333a6d2c16295885f
03/30/2023 12:23:44 WARN: psql:/syncer/schema.sql:47: NOTICE:  relation "github_pull_requests" already exists, skipping
03/30/2023 12:23:44 WARN: jq: 1 compile error
03/30/2023 12:23:44 WARN: [env.MERGESTAT_REPO_ID, .additions, .author.login, .authorAssociation, .author.avatarUrl, .author.name, .baseRefOid, .baseRefName, .baseRepository.name, .body, .changedFiles, .closed, .closedAt .comments.totalCount, .commits.totalCount, .createdAt, .createdViaEmail, .databaseId, deletions, .editor.login, .headRefname, .headRefOid, .headRepository.name, .isDraft, .labels.totalCount, .lastEditedAt, .locked, .maintainerCanModify, .mergeable, .merged, .mergedAt, .mergedBy.login, .number, .participants.totalCount, .publishedAt, .reviewDecision, .state, .title, .updatedAt, .url, .all_labels]                                                                                                                                                                                                                                                                                        
03/30/2023 12:23:44 WARN: psql:/syncer/schema.sql:92: NOTICE:  relation "github_pull_requests_pkey" already exists, skipping
03/30/2023 12:23:44 INFO: COPY 0
03/30/2023 12:23:44 INFO: DELETE 0
03/30/2023 12:23:44 WARN: psql:/syncer/schema.sql:93: NOTICE:  relation "idx_github_pull_requests_repo_id_fkey" already exists, skipping
03/30/2023 12:23:44 WARN: jq: error: deletions/0 is not defined at <top-level>, line 1:
03/30/2023 12:23:54 ERROR: failed to run image: exit status 3

GET ERROR: failed to pull image: exit status 1 for all syncs

for example

02/09/2024 13:50:05 INFO: pulling image docker://ghcr.io/mergestat/sync-mergestat-explore:latest
02/09/2024 13:50:05 ERROR: failed to pull image: exit status 1

pulled ghcr.io/mergestat/sync-mergestat-explore:latest to host then it looked like it works
but after some time it stoped working

Error in Git Files Sync

Getting this error when running the Git Files container sync for https://github.com/mergestat/libgit2

mergestat-postgres-1  | 2023-04-21 21:02:06.819 UTC [2770] ERROR:  invalid byte sequence for encoding "UTF8": 0xff
mergestat-postgres-1  | 2023-04-21 21:02:06.819 UTC [2770] CONTEXT:  COPY git_files, line 83671
mergestat-postgres-1  | 2023-04-21 21:02:06.819 UTC [2770] STATEMENT:  COPY  public.git_files ( repo_id, path, executable, contents ) FROM STDIN (FORMAT csv)

Add Github Network Dependents information

For some repositories that publish packages written in certain langauges, Github has a Dependents feature that allows you to see all the repositories that depend on that package. See for example: https://github.com/chakra-ui/chakra-ui/network/dependents

It would be great to add this information to the MergeStat SQL system. Currently it is not available via the GraphQL API, even though there is a preview API version to get dependents so people have built scrapers to get the data.

This would be a big boost for focusing on security or community engagement of a Github repository/software project!

Implement `GITHUB_ORG_AUDIT` sync

Implement a new sync type for ingesting GitHub org audit logs. This actually may not be possible with our current "per repo" only sync setup, so it may need to wait.

sync-github-pull-requests:0.0.1 jq Error

This sync is failing in our test cluster with the following error

04/04/2023 10:02:21 INFO: pulling image docker://mergestat/sync-github-pull-requests:0.0.1
04/04/2023 10:02:21 WARN: Trying to pull docker.io/mergestat/sync-github-pull-requests:0.0.1...
04/04/2023 10:02:21 WARN: Getting image source signatures
04/04/2023 10:02:21 WARN: Copying blob sha256:adf65fff9f2126de69d6e3f1326dbfb7533d880c93cd6f512dd704b53f98172c
04/04/2023 10:02:21 WARN: Copying blob sha256:f56be85fc22e46face30e2c3de3f7fe7c15f8fd7c4e5add29d7f64b87abdaa09
04/04/2023 10:02:21 WARN: Copying blob sha256:bbc643ebf4223d61363b720996dfcaa39a761f09069de35dad5606cd6bdd69ec
04/04/2023 10:02:21 WARN: Copying blob sha256:63d937bb8e9740abe216b4b577c219f7feaa521a64ec5dd364ad564cceeda217
04/04/2023 10:02:21 WARN: Copying config sha256:57e1f7bd6d5a4bea269439dc2481a51a26240eeb09d8836333a6d2c16295885f
04/04/2023 10:02:21 WARN: Writing manifest to image destination
04/04/2023 10:02:21 WARN: Storing signatures
04/04/2023 10:02:21 INFO: 57e1f7bd6d5a4bea269439dc2481a51a26240eeb09d8836333a6d2c16295885f
04/04/2023 10:02:21 INFO: pulling git repository: https://github.com/mergestat/mergestat
04/04/2023 10:03:12 INFO: finished git clone successfully: https://github.com/mergestat/mergestat
04/04/2023 10:03:12 INFO: running image docker://mergestat/sync-github-pull-requests:0.0.1
04/04/2023 10:03:12 WARN: psql:/syncer/schema.sql:47: NOTICE: relation "github_pull_requests" already exists, skipping
04/04/2023 10:03:12 WARN: psql:/syncer/schema.sql:92: NOTICE: relation "github_pull_requests_pkey" already exists, skipping
04/04/2023 10:03:12 WARN: psql:/syncer/schema.sql:93: NOTICE: relation "idx_github_pull_requests_repo_id_fkey" already exists, skipping
04/04/2023 10:03:12 WARN: jq: error: deletions/0 is not defined at <top-level>, line 1:
04/04/2023 10:03:12 WARN: [env.MERGESTAT_REPO_ID, .additions, .author.login, .authorAssociation, .author.avatarUrl, .author.name, .baseRefOid, .baseRefName, .baseRepository.name, .body, .changedFiles, .closed, .closedAt .comments.totalCount, .commits.totalCount, .createdAt, .createdViaEmail, .databaseId, deletions, .editor.login, .headRefname, .headRefOid, .headRepository.name, .isDraft, .labels.totalCount, .lastEditedAt, .locked, .maintainerCanModify, .mergeable, .merged, .mergedAt, .mergedBy.login, .number, .participants.totalCount, .publishedAt, .reviewDecision, .state, .title, .updatedAt, .url, .all_labels]
04/04/2023 10:03:12 WARN: jq: 1 compile error
04/04/2023 10:03:12 INFO: DELETE 0
04/04/2023 10:03:12 INFO: COPY 0
04/04/2023 10:03:17 ERROR: failed to run image: exit status 3

sync-git-blame:0.0.1 marked as failed when data was copied

Found an instance of this sync where the logs look clean but the job was marked as failed. Could this have been reaped?

04/04/2023 08:17:21 DEBUG: running sync 5a66e3b3-a897-40ec-9c25-01222d0fb419
04/04/2023 08:17:21 INFO: pulling image docker://mergestat/sync-git-blame:0.0.1
04/04/2023 08:17:21 WARN: Trying to pull docker.io/mergestat/sync-git-blame:0.0.1...
04/04/2023 08:17:26 WARN: Getting image source signatures
04/04/2023 08:17:26 WARN: Copying blob sha256:757892be48e61b801e42ab44ad3f53cff785269b4c7cc57323bc80123c0eb7c3
04/04/2023 08:17:26 WARN: Copying blob sha256:9c44c1393ea4452f64aa4c1604986e1e13537d6d9a5684ce9e7ab81b8041f00c
04/04/2023 08:17:26 WARN: Copying blob sha256:5e6793059f0af87e4cd5bfef878e2de016693146ff98a5b038a4736d18eee2cb
04/04/2023 08:17:26 WARN: Copying blob sha256:a24e881083d5f9b4464896480f7ea3f4f780a2a39996af8bb884deb4a6ea76ff
04/04/2023 08:17:26 WARN: Copying blob sha256:b1b93cbefb703b6d9433cc8824fd3640c159383340ca9fdd05987d50da053537
04/04/2023 08:17:26 WARN: Copying blob sha256:f56be85fc22e46face30e2c3de3f7fe7c15f8fd7c4e5add29d7f64b87abdaa09
04/04/2023 08:17:26 WARN: Copying blob sha256:402b235f44678d1e3d1225b264ba5dbcbc453072383036fee5a8521736d4d460
04/04/2023 08:17:26 WARN: Copying blob sha256:ea5757f4b3f88ed50d687e1bd40a5e2f81e4a5c3fced11b753c589d2c6381fd5
04/04/2023 08:17:26 WARN: Copying config sha256:6d9a45c6327c4730d7ae9074674bcb538ce5ef1b6e956e1ce39fd7ba78b9bc65
04/04/2023 08:17:26 WARN: Writing manifest to image destination
04/04/2023 08:17:26 WARN: Storing signatures
04/04/2023 08:17:26 INFO: 6d9a45c6327c4730d7ae9074674bcb538ce5ef1b6e956e1ce39fd7ba78b9bc65
04/04/2023 08:17:26 INFO: pulling git repository: https://github.com/mergestat/mergestat
04/04/2023 08:18:17 INFO: finished git clone successfully: https://github.com/mergestat/mergestat
04/04/2023 08:18:17 INFO: running image docker://mergestat/sync-git-blame:0.0.1
04/04/2023 08:18:17 WARN: psql:/syncer/schema.sql:14: NOTICE: relation "git_blame" already exists, skipping
04/04/2023 08:18:17 WARN: psql:/syncer/schema.sql:25: NOTICE: relation "git_blame_pkey" already exists, skipping
04/04/2023 08:18:17 WARN: psql:/syncer/schema.sql:26: NOTICE: relation "idx_git_blame_repo_id_fkey" already exists, skipping
04/04/2023 08:18:23 INFO: DELETE 339056
04/04/2023 08:18:53 WARN: skipping binary file "docs/docker-general-settings.png"
04/04/2023 08:18:58 WARN: skipping binary file "docs/docker-resources-settings.png"
04/04/2023 08:18:58 WARN: skipping binary file "docs/github-pat-local.png"
04/04/2023 08:18:58 WARN: skipping binary file "docs/logo.png"
04/04/2023 08:18:58 WARN: skipping binary file "docs/queries.gif"
04/04/2023 08:18:58 WARN: skipping binary file "examples/git/code-todos/grafana/screenshots/todos-dashboard.png"
04/04/2023 08:19:08 WARN: skipping binary file "examples/git/commits/grafana/screenshots/commits.png"
04/04/2023 08:19:23 WARN: skipping binary file "examples/git/dependencies/go/grafana/screenshots/dependencies-go.png"
04/04/2023 08:19:29 WARN: skipping binary file "examples/git/dependencies/react/grafana/screenshots/dependencies-react.png"
04/04/2023 08:19:44 WARN: skipping binary file "examples/git/vulnerabilties/trivy/grafana/screenshots/trivy.png"
04/04/2023 08:19:54 WARN: skipping binary file "examples/github/actions/grafana/screenshots/workflows.png"
04/04/2023 08:19:59 WARN: file "examples/github/actions/sql/all" not found; skipping
04/04/2023 08:20:04 WARN: skipping binary file "examples/github/issues/grafana/screenshots/issues.png"
04/04/2023 08:20:19 WARN: skipping binary file "examples/github/pull-requests/grafana/screenshots/pull-requests.png"
04/04/2023 08:20:29 WARN: skipping binary file "examples/github/pull-requests/sql/screenshots/time-to-merge.png"
04/04/2023 08:20:34 WARN: skipping binary file "examples/github/stargazers/grafana/screenshots/stargazers.png"
04/04/2023 08:20:39 WARN: skipping binary file "examples/github/tags/grafana/screenshots/tags.png"
04/04/2023 08:20:49 WARN: skipping binary file "examples/management/grafana/screenshots/mergestat-managment.png"
04/04/2023 08:21:25 WARN: skipping binary file "examples/templates/grafana/screenshots/mergestat-examples.png"
04/04/2023 08:32:44 WARN: skipping binary file "ui/public/assets/illustration-repos.png"
04/04/2023 08:32:44 WARN: skipping binary file "ui/public/favicon.ico"
04/04/2023 08:34:36 INFO: COPY 339056

More stringent success criteria of syncs

Ensure that we only mark syncs as successful when we know it completed to success - be more careful with the "final" status code to ensure it's actually a successful result when a sync completes (or not)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.