semgrep / semgrep-action Goto Github PK

This project is deprecated. Use https://github.com/returntocorp/semgrep instead

Home Page: https://semgrep.dev/docs/semgrep-ci/

Python 96.07% Dockerfile 2.65% Makefile 0.52% Shell 0.77%

semgrep github-actions static-analysis sast ci ci-cd

semgrep-action's Issues

Agent takes longer than running semgrep directly

Looks like semgrep-agent is passing files to semgrep using --includes which causes semgrep to traverse the directory tree to see if any files match the include pattern.

Fix will be to pass the files directly to semgrep as a target argument.

Baseline-ref argument ignored in CI environment

The --baseline-ref is only used in GitMeta, and not in GithubMeta or GitlabMeta. This means that semgrep-action will claim that all files are new, even when the --baseline-ref argument is given.

I expect --baseline-ref to override environment variables.

Running in GitLab CI:

$ python -m semgrep_agent --config /tmp/semgrep.yml --baseline-ref reviewed
=== detecting environment
| versions          - semgrep 0.32.0 on Python 3.7.9
| environment       - running in environment gitlab-ci, triggering event is 'push'
| manage            - not logged in
=== setting up agent configuration
| using semgrep rules from /tmp/semgrep.yml
| using default path ignore rules of common test and dependency directories
| found 911 files in the paths to be scanned
| skipping 43 files based on path ignore rules
=== looking for current issues in 868 files
| 20 current issues found
| No ignored issues found
| 20 current issues found
| No ignored issues found
=== not looking at pre-existing issues since all files with current issues are newly created
...

Should .semgrep folders be scanned by default?

I tested out the .semgrep folder rule passing option for the action and it works great. But if you directly copy the rules from semgrep-rules with the accompanying tests, semgrep-action will also scan the .semgrep folder since the default .semgrepignore folder doesn't have this exception. Should this be the default behaviour?

Anyways, if possible, i vote that by default the .semgrep folder should be under .semgrepignore for the semgrep-action. To help out, i even prepared a pull request: https://github.com/returntocorp/semgrep-action/pull/71

Thank you!

Strip tmp. from internal representation of findings in semgrep_agent

One finding is detected multiple times in the value of `--json`

I am running the semgrep_agent in a gitlab runner with the new --json option. Thanks a lot for the option. It's detecting an eval usage three times.

Sample file is modified from OWASP Juiceshop and contains and eval. This is detected. See the first comment after this for the file (I did not put it here to make the issue more readable).

Expose more information about ignored paths

We could list just the paths that actually have hits. Maybe hide this behind a verbose flag though, not sure how noisy it'd get.

Remove privacy-sensitive findings fields from backend post

We should omit fields that are both:

sensitive
not necessary for backend operation

Specifically, the syntactic_context field falls in this category.

clarify "not logged in" error message

Original "bug" report below, but check out the discussion for more relevant details to this ticket

Sorry for the obscure title, I can't be much more descriptive because I'm not sure what's going on.

These were run within an hour of each other:
https://github.com/returntocorp/dry-runs/pull/7/checks
https://github.com/returntocorp/dry-runs/pull/8/checks

For some reason, it appears that when someone other than me tries to create a PR, semgrep errors out in CI with:

=== detecting environment
| versions    - semgrep 0.30.0 on Python 3.7.9
| environment - running in github-actions, triggering event is 'pull_request'
| semgrep.dev - not logged in
=== setting up agent configuration
Error: OR] you didn't configure what rules semgrep should scan for.

(the first few lines from running on my PR:)

=== detecting environment
| versions    - semgrep 0.30.0 on Python 3.7.9
| environment - running in github-actions, triggering event is 'pull_request'
| semgrep.dev - logged in as deployment #1
=== setting up agent configuration
| using semgrep rules configured on the web UI

The rules to scan for are just from the default policy, so they are not configured in semgrep.yml.

Dependencies of semgrep-agent

Maybe should add the attrs package to https://github.com/returntocorp/semgrep-action/blob/develop/pyproject.toml#L11 since semgrep-agent uses it directly?

Display policy name when running from managed policy

For improved user serviceability

Expose a verbose flag for users

The hidden SEMGREP_AGENT_DEBUG env variable is not exposed to users, we should make this a comfortable public flag instead.

@msorens recommended making it toggleable on the semgrep app's web UI.

Add to logs

semgrep should get the --verbose flag too and we should pipe its output through

Update Dockerfile to work with 0.36

Seems like we'll need more changes than usual as according to semgrep/semgrep#2054 (comment) the PRECOMPILED_LOCATION var we use no longer exists.

latest semgrep-agent never completes its run

Sometime between Thursday and Saturday, semgrep-agent started exceeding the 20-minute timeout in my CI system (Buildkite). Prior to this, it used to take anywhere from 30 seconds to 2 minutes.

I am using the returntocorp/semgrep-agent:v1 docker image and the v1 tag was updated with a new docker image yesterday just prior to my first failing build:
https://hub.docker.com/layers/returntocorp/semgrep-agent/v1/images/sha256-93d7382e52[…]0d0aaf8c42f48ff9decc28950974cf038d7a9a201d405?context=explore

Here is what I know:

All runs were 2 minutes or less until this past Thursday; now all runs are timing out.
All runs used semgrep 0.35.0 on Python 3.7.9. (That is, semgrep-agent on Thursday reported this version and the newer semgrep-agent today reports the same version, leading me to believe the issue is in semgrep-agent code rather than semgrep code.)
All runs used differential mode (--baseline-ref).
All runs were scanning a very small number of files, anywhere from zero to a couple dozen files being scanned. (The zero is due to how I have done some configuration; it is an item on my TODO list but not relevant to this issue.)
All runs used your standard docker image (returntocorp/semgrep-agent:v1) though of course the reference of that tag changes frequently.
I had only recently expanded the number of rulesets in my policy. I just tried reducing the policy size by going back to just the single r2c-ci ruleset (reducing rule count from 508 to 119) but no difference--semgrep-agent still timing out.

This same behavior occurs both in Buildkite and on the command-line when I run it locally. Also, my repository is open-source, so you can use my actual data to observe the problem.

The repo is here: https://github.com/chef/automate

And here are the bits from my Makefile:

SEMGREP_CONTAINER := returntocorp/semgrep-action:v1
SEMGREP_COMMON_PARAMS := -m semgrep_agent --publish-token ${SEMGREP_TOKEN} --publish-deployment ${SEMGREP_ID}
SEMGREP_REPO := --env SEMGREP_REPO_NAME=chef/automate
DOCKER_PARAMS := --volume $(realpath .):/automate --workdir /automate

semgrep: ## runs differential semgrep, checking only changes in the current PR, just as is done in CI
	docker run -it --rm --init $(DOCKER_PARAMS) $(SEMGREP_REPO) $(SEMGREP_CONTAINER) python $(SEMGREP_COMMON_PARAMS) --baseline-ref master

Implementation details:

When run on a git branch, there will be a non-zero number of files. When run on master, though, there will be zero files. But either way, the same problem occurs--semgrep-agent gets stuck.
In Buildkite, semgrep-agent times out at about 20 minutes; on the command-line, I have been letting it run while I wrote this up, so it will not time-out. Thus far, it has run for 40 minutes and is still running. This is on a branch with 32 (rather ordinary) files.

Priority: High
I have had to completely disable semgrep-agent until this can be resolved.

Improve “sapp is configured so we should’ve gotten a scan_id” error message

Semgrep action indicates how many rules ran on a scan

"no valid configuration file found (0 configs were invalid)"

Greetings! Testing out the platform, and enjoying things so far. Got this error in my Github Actions pipeline, and followed your request to post it for analysis. Maybe related to #112?

Run returntocorp/semgrep-action@v1
  with:
    publishToken: ***
    publishDeployment: 203
  env:
    GITHUB_TOKEN: ***
/usr/bin/docker run --name returntocorpsemgrepactionv1_351056 --label 179394 --workdir /github/workspace --rm -e GITHUB_TOKEN -e INPUT_PUBLISHTOKEN -e INPUT_PUBLISHDEPLOYMENT -e INPUT_CONFIG -e INPUT_GENERATESARIF -e HOME -e GITHUB_JOB -e GITHUB_REF -e GITHUB_SHA -e GITHUB_REPOSITORY -e GITHUB_REPOSITORY_OWNER -e GITHUB_RUN_ID -e GITHUB_RUN_NUMBER -e GITHUB_RETENTION_DAYS -e GITHUB_ACTOR -e GITHUB_WORKFLOW -e GITHUB_HEAD_REF -e GITHUB_BASE_REF -e GITHUB_EVENT_NAME -e GITHUB_SERVER_URL -e GITHUB_API_URL -e GITHUB_GRAPHQL_URL -e GITHUB_WORKSPACE -e GITHUB_ACTION -e GITHUB_EVENT_PATH -e GITHUB_ACTION_REPOSITORY -e GITHUB_ACTION_REF -e GITHUB_PATH -e GITHUB_ENV -e RUNNER_OS -e RUNNER_TOOL_CACHE -e RUNNER_TEMP -e RUNNER_WORKSPACE -e ACTIONS_RUNTIME_URL -e ACTIONS_RUNTIME_TOKEN -e ACTIONS_CACHE_URL -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/home/runner/work/_temp/_github_home":"/github/home" -v "/home/runner/work/_temp/_github_workflow":"/github/workflow" -v "/home/runner/work/_temp/_runner_file_commands":"/github/file_commands" -v "/home/runner/work/semgrep-test-repo/semgrep-test-repo":"/github/workspace" returntocorp/semgrep-action:v1
=== detecting environment
| versions          - semgrep 0.32.0 on Python 3.7.9
| environment       - running in environment github-actions, triggering event is 'pull_request'
| manage            - logged in as deployment #203
=== setting up agent configuration
| policy            - using Getting Started
| using semgrep rules configured on the web UI
| using default path ignore rules of common test and dependency directories
| looking at 4 changed paths
| found 4 files in the paths to be scanned
=== looking for current issues in 4 files

=== failed command's STDOUT:

{"results": [], "errors": [{"type": "SemgrepError", "code": 7, "message": "no valid configuration file found (0 configs were invalid)"}]}


=== failed command's STDERR:

A new version of Semgrep is available. Please see https://github.com/returntocorp/semgrep#upgrading for more information.


Error: ROR] `/root/.local/bin/semgrep --skip-unknown-extensions --disable-nosem --json --no-rewrite-rule-ids --config /tmp/tmp3tby2xfe.yml more_fail.py other_feature.py .github/workflows/semgrep.yml should_fail.py` failed with exit code 7

This is an internal error, please file an issue at https://github.com/returntocorp/semgrep-action/issues/new/choose
and include any log output from above.

Add support for adding a number of yml rules from the .semgrep folder

The current semgrep-action can rely on either

The online rule registry
a .semgrep.yml file with a collection of rules, which can be unwieldily to manage for a large number of rules.

The "normal" semgrep, supports a .semgrep folder, that can contain a number of rules in the .semgrep/**/*.yml form, where it is easy to mantain a number of semgrep rules in a single folder for a project.

So, if it is possible, i would suggest a feature, where semgrep-action would support the same method for rule passing, as this would make it easy to manage larger rulesets for CI (where a project can contain it's rulefolder), where a external registry can't be used because of regulatory requirements.

Thanks!

Ignore directories with .semgrepignore records even without the trailing slash

Add tests to your .semgrepignore will ignore only files named tests. To ignore module/tests/test.py, you need to add tests/ to the .semgrepignore instead.

This is unexpected, and not consistent with .gitignore, where writing just tests will ignore module/tests/test.py already.

Make semgrep-agent fail CI only on blocking issues

semgrep.dev will soon support different actions per rule, so

Assume that Finding instances can have a "dev.semgrep.actions" key in their finding.metadata dict in the Results object that semgrep invocations return here: https://github.com/returntocorp/semgrep-action/blob/1e1c2f06307dcdda29c6f8889c2ffb5abb88eb35/src/semgrep_agent/main.py#L120

We should make the agent follow the actions recommended by semgrep-app.

Specifically, we should exit with

5 notify-only findings hidden in output
<exit code 0>

and

| [... actual blocking errors here ...]
| [... actual blocking errors here ...]
| [... actual blocking errors here ...]
1 notify-only finding hidden in output
<exit code 1>

depending on the value of metadata: dev.semgrep.actions: ["notify", "block"]

Downscoping

We can implement this before the backend supports it. Let's just default findings to have ["block"] as their actions when not otherwise specified.

Disambiguate action/agent release names

Confusing things right now:

Why is there an 'action' and a separate 'agent'?
How do you find the agent's code? How come it's in the action repo?
Why is the stable docker image tagged :v1 instead of :stable?
What is agent even supposed to mean anyway? Why not semgrep-ci?
What is the release cadence of the image?
How do you pin a semgrep-agent version?
How do you get the agent outside Docker? Why is there an old PyPI package?
How can one go from running a CI job to running the same job locally?
How is this software connected to the semgrep.dev website?

@dlukeomalley am I missing anything else?

Add the ability to configure Semgrep core via the agent

Currently, the agent can only accept the config file and prints out the results to stdout (absent other switches). There's no way to get a json output to process the results using the command line.

It would be nice to be able to configure the Semgrep core that is running inside the agent. Mainly, the ability to get the JSON output. I think the easiest way to do so is by accepting Semgrep code switches (e.g., --json). This will make the transition between running Semgrep and the agent seamless. The agent is the suggested way of running Semgrep in a CI pipeline so having access to the json output in the container is great.

It seems like this is the location where the context (that contains the config) to Semgrep.

https://github.com/returntocorp/semgrep-action/blob/develop/src/semgrep_agent/main.py#L121

Diff-aware doesn't report new findings if the new finding has the same (rule_id, path, line_of_code) as the one before

Assume you have this diff:

  eval(foo)
  # some code
+ eval(foo)

The addition of a new eval security issue should be warned about on a PR. Right now we use a set to figure out what findings are new, which would make the agent unaware about the new issue. If we used a Counter, we could warn that there are more instances of the same thing now.

Investigate directly uploading SARIF results to github api

Other than the extra workflow step that we currently use, https://docs.github.com/en/free-pro-team@latest/rest/reference/code-scanning#upload-a-sarif-file exists which would allow for direct upload, so users would only have to pass their github token to get the security tab working. And they might already have passed that to get slack notifications working anyway.

Display CI results with short static rule ids

E.g. invoke semgrep from action with --no-rewrite-rule-ids

-[] Results in CI should be named with something short, which probably is just the rule ID (without the registry leader)
-[] Labels of findings in CI should not change with location of rule in a pack or directory

Suggested cheap solution:
Action should call --no-rewrite-rule-ids on the semgrep binary

semgrep failed

Run ./.github/actions/semgrep-action
/usr/bin/docker run --name returntocorpsemgrepactionv1_7098e9 --label 1e5c35 --workdir /github/workspace --rm -e INPUT_CONFIG -e INPUT_PUBLISHTOKEN -e INPUT_PUBLISHDEPLOYMENT -e INPUT_GENERATESARIF -e HOME -e GITHUB_JOB -e GITHUB_REF -e GITHUB_SHA -e GITHUB_REPOSITORY -e GITHUB_REPOSITORY_OWNER -e GITHUB_RUN_ID -e GITHUB_RUN_NUMBER -e GITHUB_RETENTION_DAYS -e GITHUB_ACTOR -e GITHUB_WORKFLOW -e GITHUB_HEAD_REF -e GITHUB_BASE_REF -e GITHUB_EVENT_NAME -e GITHUB_SERVER_URL -e GITHUB_API_URL -e GITHUB_GRAPHQL_URL -e GITHUB_WORKSPACE -e GITHUB_ACTION -e GITHUB_EVENT_PATH -e GITHUB_PATH -e GITHUB_ENV -e RUNNER_OS -e RUNNER_TOOL_CACHE -e RUNNER_TEMP -e RUNNER_WORKSPACE -e ACTIONS_RUNTIME_URL -e ACTIONS_RUNTIME_TOKEN -e ACTIONS_CACHE_URL -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/home/runner/work/_temp/_github_home":"/github/home" -v "/home/runner/work/_temp/_github_workflow":"/github/workflow" -v "/home/runner/work/_temp/_runner_file_commands":"/github/workspace" returntocorp/semgrep-action:v1
Unable to find image 'returntocorp/semgrep-action:v1' locally
v1: Pulling from returntocorp/semgrep-action
df20fa9351a1: Pulling fs layer
36b3adc4ff6f: Pulling fs layer
4db9de03f499: Pulling fs layer
cd38a04a61f4: Pulling fs layer
9a3838385f13: Pulling fs layer
09359e37df4b: Pulling fs layer
2593afa0e612: Pulling fs layer
cff1f9ba2a6e: Pulling fs layer
27800508e272: Pulling fs layer
84c0aae16fc3: Pulling fs layer
fdda4f84e7a3: Pulling fs layer
cd38a04a61f4: Waiting
9a3838385f13: Waiting
09359e37df4b: Waiting
2593afa0e612: Waiting
cff1f9ba2a6e: Waiting
27800508e272: Waiting
84c0aae16fc3: Waiting
fdda4f84e7a3: Waiting
36b3adc4ff6f: Verifying Checksum
36b3adc4ff6f: Download complete
df20fa9351a1: Verifying Checksum
df20fa9351a1: Download complete
4db9de03f499: Verifying Checksum
4db9de03f499: Download complete
cd38a04a61f4: Verifying Checksum
cd38a04a61f4: Download complete
09359e37df4b: Verifying Checksum
09359e37df4b: Download complete
2593afa0e612: Verifying Checksum
2593afa0e612: Download complete
9a3838385f13: Verifying Checksum
9a3838385f13: Download complete
cff1f9ba2a6e: Verifying Checksum
cff1f9ba2a6e: Download complete
fdda4f84e7a3: Verifying Checksum
fdda4f84e7a3: Download complete
df20fa9351a1: Pull complete
27800508e272: Verifying Checksum
27800508e272: Download complete
84c0aae16fc3: Verifying Checksum
84c0aae16fc3: Download complete
36b3adc4ff6f: Pull complete
4db9de03f499: Pull complete
cd38a04a61f4: Pull complete
9a3838385f13: Pull complete
09359e37df4b: Pull complete
2593afa0e612: Pull complete
cff1f9ba2a6e: Pull complete
27800508e272: Pull complete
84c0aae16fc3: Pull complete
fdda4f84e7a3: Pull complete
Digest: sha256:8498aff37222c4f69405b4f4db3a67fcdd12ce60eb3ded39e82b097305fb913e
Status: Downloaded newer image for returntocorp/semgrep-action:v1
=== detecting environment
| versions    - semgrep 0.27.0 on Python 3.7.9
| environment - running in github-actions, triggering event is 'pull_request'
| semgrep.dev - not logged in
=== setting up agent configuration
| using semgrep rules from the committed .semgrep.yml
| using default path ignore rules of common test and dependency directories
| looking at 3399 changed paths
| found 3387 files in the paths to be scanned
| skipping 303 files based on path ignore rules
=== looking for current issues in 3084 files
| No current issues found
| No current issues found
| 1 current issue found
| 2 current issues found
| 2 current issues found
| 2 current issues found
| 2 current issues found

=== failed command's STDOUT:



=== failed command's STDERR:

fatal: No pathspec was given. Which files should I remove?


Error: ROR] `/usr/bin/git rm -f` failed with exit code 128

Docker image is unnecessarily large

Our docker image is 388mb, but semgrep itself is <100MB and semgrep-agent is 100mb. We are adding a few dependnecies but we shouldn't be so large. This has a user impact because CI systems pull semgrep_agent so the smaller we are, the faster we run.

quick ideas:

probably we don't need to use virtualenvs in the docker image, can just install as system deps?
we waste 139MB in the line COPY --from=semgrep /usr/local/bin/semgrep-core /tmp/semgrep-core which we can't delete later because docker layers are append-only. Either squash the image, or maybe we just can just install semgrep from pip?

Some debugging:

ine@imbp4 ~/D/r/semgrep-action (develop)> docker image list
REPOSITORY                               TAG                  IMAGE ID            CREATED             SIZE
deleteme                                 latest               3a2024614c6b        59 seconds ago      388MB

ine@imbp4 ~/D/r/semgrep-action (develop) [1]> docker image history deleteme
IMAGE               CREATED              CREATED BY                                      SIZE                COMMENT
3a2024614c6b        About a minute ago   /bin/sh -c #(nop)  ENV SEMGREP_ACTION=true S…   0B
1248a8253a31        About a minute ago   /bin/sh -c #(nop)  CMD ["python" "-m" "semgr…   0B
ea96aab92e25        About a minute ago   /bin/sh -c #(nop)  ENV PATH=/root/.local/bin…   0B
5e31234a8ae2        About a minute ago   /bin/sh -c #(nop) COPY dir:1fd9cad476546e4a5…   117kB
880a9eeb1cb8        About a minute ago   /bin/sh -c apk add --no-cache --virtual=.bui…   207MB
f9731445eead        3 minutes ago        /bin/sh -c #(nop) COPY file:242636c2950567f1…   139MB
32b1ab530668        3 minutes ago        /bin/sh -c #(nop)  ENV INSTALLED_SEMGREP_VER…   0B
d085a20dee12        3 minutes ago        /bin/sh -c #(nop) COPY file:89f9fdac4917c31a…   597B
1866ac2367b4        3 minutes ago        /bin/sh -c #(nop) COPY file:c53eceb6b503d20b…   8.47kB
c061f1cc2db7        3 minutes ago        /bin/sh -c #(nop) WORKDIR /app                  0B
6b73b71fd64e        8 days ago           /bin/sh -c #(nop)  CMD ["python3"]              0B
<missing>           8 days ago           /bin/sh -c set -ex;   wget -O get-pip.py "$P…   7.24MB
<missing>           8 days ago           /bin/sh -c #(nop)  ENV PYTHON_GET_PIP_SHA256…   0B
<missing>           8 days ago           /bin/sh -c #(nop)  ENV PYTHON_GET_PIP_URL=ht…   0B
<missing>           3 weeks ago          /bin/sh -c #(nop)  ENV PYTHON_PIP_VERSION=20…   0B
<missing>           3 weeks ago          /bin/sh -c cd /usr/local/bin  && ln -s idle3…   32B
<missing>           3 weeks ago          /bin/sh -c set -ex  && apk add --no-cache --…   27.7MB
<missing>           3 weeks ago          /bin/sh -c #(nop)  ENV PYTHON_VERSION=3.7.9     0B
<missing>           3 weeks ago          /bin/sh -c #(nop)  ENV GPG_KEY=0D96DF4D4110E…   0B
<missing>           3 weeks ago          /bin/sh -c apk add --no-cache ca-certificates   512kB
<missing>           3 weeks ago          /bin/sh -c #(nop)  ENV LANG=C.UTF-8             0B
<missing>           3 weeks ago          /bin/sh -c #(nop)  ENV PATH=/usr/local/bin:/…   0B
<missing>           3 weeks ago          /bin/sh -c #(nop)  CMD ["/bin/sh"]              0B
<missing>           3 weeks ago          /bin/sh -c #(nop) ADD file:f17f65714f703db90…   5.57MB

semgrep-action silently succeeds if semgrep fails due to an internal error

I have a repo and set of rules.

Running semgrep-action on the repo, the action succeeds with "no errors":

semgrep-agent --baseline-ref ... --config semgrep.yml
=== detecting environment
| versions    - semgrep 0.31.1 on Python 3.9.0
| environment - running in git, triggering event is 'unknown'
| semgrep.dev - not logged in
=== setting up agent configuration
| using semgrep rules from semgrep.yml
| using path ignore rules from .semgrepignore
| looking at 4 changed paths
| found 4 files in the paths to be scanned
=== looking for current issues in 4 files
| No current issues found
=== not looking at pre-existing issues since there are no current issues
=== exiting with success status

However, running semgrep directly on the code fails:

docker run -v ${PWD}:/src returntocorp/semgrep:0.31.0 --config /src/semgrep.yml ...
running 435 rules...
an internal error occured while invoking semgrep-core:
	unknown exception: Parse_info.NoTokenLocation("Match returned an empty list with no token location information; this may be fixed by adding enclosing token information (e.g. bracket or parend tokens) to the list's enclosing node type.")
An error occurred while invoking the semgrep engine; please help us fix this by creating an issue at https://github.com/returntocorp/semgrep

The consequence of this is that security issues are silently making it through my CI pipeline (see https://github.com/returntocorp/semgrep-app/pull/1123#discussion_r526341734) (!)

As a user, I expect that, if Semgrep fails, my CI job should fail.

Scan for baseline issues only in paths with current issues

This would improve total run time by probably around 40% in the case when you have 1 out of 5 changed files introducing new issues.

A continuation of https://github.com/returntocorp/semgrep-action/pull/25

semgrep-agent from the command-line connects to incorrect dashboard project

Describe the bug

When I attempt to run semgrep-agent on the command-line in the same fashion that I run it in CI, it is not connecting to the right project on the web UI dashboard (and therefore not using the correct policy).
In the figure, my two real projects are highlighted: chef/automate (which exists at https://github.com/chef/automate) and chef/chef-cloud (https://github.com/chef/chef-cloud).

In CI, I use the block of code below (for Buildkite). The relevant portions are highlighted.

The equivalent from the command-line, as I understand it, is this:

$ cd ~/code/go/src/github.com/chef/automate
$ docker run  --rm  \
    --volume $(realpath .):/chef/automate --workdir /chef/automate \
    returntocorp/semgrep-action:v1 \
    python -m semgrep_agent --publish-token $SEMGREP_TOKEN --publish-deployment $SEMGREP_ID --baseline-ref master

Notable:

When executed, that generates a new project in the dashboard at the top -- item (1) -- "automate" compared to the real "/chef/automate".
If I change the docker command above to use just "automate" (without the /chef parent) so the partial line is this: --volume $(realpath .):/automate --workdir /automate, it still results in updating "automate" in the dashboard -- item (1) again.
If I change to --volume $(realpath .):/foo --workdir /foo or --volume $(realpath .):/src --workdir /src, then it creates (or updates) items (2) and (3) respectively.
In all cases, running from the command-line is using the default policy, "Getting Started"; note that the real project chef/automate is using policy "Chef-01".

To Reproduce
As above.

Expected behavior
Should be able to connect to the "chef/automate" project and use "Chef-01" policy.

Screenshots
As above.

What is the priority of the bug to you?
Is this a P0 (blocking your adoption of Semgrep or workflow), P1 (important to fix or quite annoying), P2 (regular bug that should get fixed)?
P2 (a bit frustrating, but I can get by without it for a time)

Environment
docker

Ensure no code content is collected

Consider returning information about rules that are being run.

Currently semgrep-action will return the number of files that are lined up for scanning, and the number of ignored files, but won't print out the number of rules that are loaded or used (and what is being used to load those rules, registry link, .semgrep or .semgrep.yml):

=== detecting environment
| versions    - semgrep 0.25.0 on Python 3.7.9
| environment - running in gitlab-ci, triggering event is 'push'
| semgrep.dev - not logged in
=== setting up agent configuration
| using semgrep rules from the committed .semgrep/ directory
| using path ignore rules from .semgrepignore
| found 100 files in the paths to be scanned
| skipping 5 files based on path ignore rules
=== looking for current issues in 100 files
| 0 current issues found

Can we consider if we want to print out the number of rules used in scanning and where the rules have been fetched from?

Gracefully handle broken rules

Options:

add a configuration option for "Failing Open" and "Failing Closed"
continue Semgrep run in spite of broken rules and intelligently and clearly report the issue with a given rule

Action reports issues on trunk branch

Action appears to be reporting issues that occur in PR as well as any new issues on the merge target (e.g. master).

It should only report issues in the PR itself.

Suggestion:
Calculate --diff-against using git merge-base of PR commit and target branch.

Provide more release tags for CI consumption stability

I mentioned this some time ago as a casual comment in slack, but surfacing here just to give a bit more exposure.

Since the v1 tag on semgrep-agent is continually bumped with new releases, that means that consumers of the v1 docker image are always at risk of having their CI build break due to a new release. My release engineering folks rather frown on that, which means I cannot make use of semgrep's failing a build iff there is a new problem in our code.

It would be nice to have the option to pin to an unchanging version of your docker image so I could eliminate the risk in my CI pipeline. Not saying you have to immobilize "v1" --I understand the desire to keep that at the head for your own needs--but perhaps have additional tags corresponding to the encapsulated semgrep (since I imagine that changes more frequently).

Ambiguous behavior when both `config` and backend set up

Right now, it's not clear whether the action will use the hard-coded config or use the backend-configured config.

Suggestion is to fail hard with a descriptive error message in this case.

Discovered via dog-fooding on returntocorp/semgrep.

Make it possible to unignore .gitignore'd files

Even if we override :include .gitignore in .semgrepignore, semgrep itself ignores those files by default. We could run it with semgrep --no-git-ignore to fix this.

Add ability to configure file globs

User should be able to configure file globs to define run locations when using action without the SaaS backend.

Possible solutions:

--glob à la ripgrep
directly pass --include and --exclude to semgrep

Fix PR merge result being compared to base branch in GHA

the issue is with a surprising behavior of github: let’s assume you have this git history:

main branch  0--1--2--3--4--5
                 \
your branch       A--B

when github starts this job, it actually merges commit B into 5! that’s the codebase semgrep sees and our diff-aware scanning will compare “B merged into 5” against 1 to find new issues

so while you expect that only A and B’s changes will be scanned, right now it’s actually A, B, and 2 through 5 all being scanned, hence the additional changed file count

This might be the right one to use? But we still probably need to add python logic to fetch more commits for the baseline checkout to work https://github.com/actions/checkout#checkout-pull-request-head-commit-instead-of-merge-commit

GithubAction Script for Release

We should create a script that automatically pulls the SHA256 hash from Docker hub and changes the necessary lines in the Dockerfile.

Ease bumping semgrep version

How to change the semgrep version is prone to error for maintenance devs.

Proposal:

Write a script to bump the version
Clearly document in the README + link to this in https://returntocorp.quip.com/zrklAwLqfrm7/Using-the-Semgrep-Repo#DfcACAoWAYY

Update README to include latest CI instructions

Documentation has moved to semgrep.dev/docs and now includes details on using the GitHub Security Dashboard, but neither is discussed in the README. This ticket is to update that per #97

Publish semgrep-agent to pypi

We can rework it to use poetry for this.

github action failure

got this error; not sure whats happening but sharing as requested in the message

=== failed command's STDOUT:
{"results": [], "errors": [{"type": "SemgrepError", "code": 2, "message": "an internal error occured while invoking semgrep-core:\n\tunknown exception: Parse_info.NoTokenLocation(\"Match returned an empty list with no token location information; this may be fixed by adding enclosing token information (e.g. bracket or parend tokens) to the list's enclosing node type.\")\nAn error occurred while invoking the semgrep engine; please help us fix this by creating an issue at https://github.com/returntocorp/semgrep"}]}
=== failed command's STDERR:
running 481 rules...
Error: ROR] `/root/.local/bin/semgrep --skip-unknown-extensions --disable-nosem --json --no-rewrite-rule-ids --config /tmp/tmpstr5uk_i.yml webhooks/json_map.go webhooks/notification_formatter.go webhooks/events.go webhooks/events_test.go` failed with exit code 2

This is an internal error, please file an issue at https://github.com/returntocorp/semgrep-action/issues/new/choose
and include any log output from above.

[discussion] Should semgrepdep go in this repo?

Background

@ievans made a fork of semgrep-action that can also scan changes of yarn.lock etc. and post about how the dependency security hotspots have changed.

Reasons for adding this feature here

From a technical standpoint this feature is pretty well separated, so I'm not worried about unclean code. A wider feature set would also mean the same project would be useful for more people. Some might discover the Semgrep action by looking for a cool dependency change analysis tool.

Reasons against

I think it would lead to a branding nightmare though. Semgrepdep users would expect more support than we'd give it, semgrep-action users would be confused by a weird, unnatural option in the action's config. Users of both would still need to add separate workflows (like we did internally), which is also confusing to maintain since looking at a GHA overview you'd just see 'semgrep-action' running twice.

Add CI integration for Jenkins

Is your feature request related to a problem? Please describe.
Not related to a problem

Describe the solution you'd like
Be able to use Semgrep in Jenkins CI, with something like a plugin.

Describe alternatives you've considered
Call Semgrep from CLI in Jenkins. This would need the Jenkins server to have Semgrep installed, which in some setups is a lot harder to configure. For instance, if Jenkins runs in distributed on-demand cloud / containers / runtime sandboxes, you would need to configure the install for Semgrep each time a container / runtime sandbox is created and used.

Additional context

Don't print output twice

Make PRIVACY.md file/contents more discoverable

Upload SARIF log file

Semgrep already supporting SARIF output, it would be nice if the action could just upload the result so it will be shown under the security tab on GitHub

Support Passing --severity Flag to semgrep

Description

Upstream semgrep in version 0.33.0 recently added a --severity flag that allows filtering rules to WARNING, etc. It would be fantastic if this project could support the ability to optionally configure that. As an example use case, we would love the ability to add rules within our codebase such as:

Create rules with WARNING severity
Incrementally fix existing reports over time due to time/effort
Switch rule to ERROR severity when all existing reports are fixed

With the following logic in our CI configuration:

Fail and report only ERROR severity rules on main branch pushes
Fail and report WARNING and ERROR severity rules during pull request submission so new code can be fixed before going in

Happy to provide more details or submit an implementation if pointed in the right direction! Thanks!

References

Expose name (and perhaps rules) of policy being used

When connecting to semgrep.dev, we should log info about the policy being executed.