alexpovel / srgn Goto Github PK

View Code? Open in Web Editor NEW

426.0 3.0 3.0 14.51 MB

A code surgeon for precise text and code transplantation. A marriage of `tr`/`sed`, `rg` and `tree-sitter`.

Home Page: https://crates.io/crates/srgn/

License: MIT License

Rust 92.16% Just 1.17% Python 1.87% C# 1.86% TypeScript 1.28% Go 1.05% Shell 0.60%

csharp go python regex rust rust-lang tree-sitter typescript abstract-syntax-tree grep

srgn's People

Contributors

Stargazers

Watchers

Forkers

rvolgers ysndr neutric

srgn's Issues

Test `--files`

A bitch to test, but there's currently no coverage at all. Perhaps ignore what's written out, but at least test files are found, globbing works, processing works, ...

--files currently modifies in-place, so testing before vs. after would be pretty nasty. Perhaps extend with a functionality like --files-ext so users can optionally specify an additional file extension of produced files?

Look into stemming

Idea from this repository. A stemmer could further shift the burden to compute (for which we still have breathing room to use more), away from memory (which we're trying to save), as the word list could shrink further. The list currently contains a lot of word derivatives, which could all be removed in favor of a single stem.

The build.rs script could prune the original (which won't be touched, as having raw data available is always good) word list further, using the same approach as used with the compound words (ingest word list, and only write out entries that can neither be constructed as compound words from other entries, nor reproduced via stemming).

Ultimately, the more elaborate the compute/algorithm side, the lower quality a word list we can get away with.

Make binary testing in CI depend on publishing to crates.io

binstall fetches metadata from crates.io so if the desired version isn’t on there yet it fails:

https://github.com/alexpovel/srgn/actions/runs/6676593449/job/18145819987#step:3:15

Unexpected panic while rearranging macro arguments

Invocation:

$ srgn --version
srgn 0.11.0

$ srgn --files '**/*.rs' --rust-query '((macro_invocation macro: (identifier) @name) @macro (#any-of? @name "error" "warn" "info" "debug" "trace" ))' '(?P<msg>.+);(?P<structure>.+)' '$structure,$msg'

Panic:

thread '<unnamed>' panicked at ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/srgn-0.11.0/src/scoping/scope.rs:60:43:
begin <= end (23 <= 0) when slicing `error!( "server error"; "uuid" => %uuid, "error.user_msg" => %user_msg, src)`

Please let me know if more information is needed.

Allow library users to supply their own word lists

Have a feature like standalone, which would ship the word lists even in the library, or use a user-provided one if that's desired. The binary crate would always be standalone.

Set up binary compilation for releases

Let's try amd64 for Linux and Windows, and arm64 for Linux and macOS.

Provide codecov with token

Not sure what this is about yet, but pipelines keep failing:

https://github.com/alexpovel/srgn/actions/runs/6724256557/job/18276098536

They advise to use a special token, so set that up. Don't forget dependabot secrets.

Feature: `import` handling

So far, all languages srgn offers come with comments and string queries. Those are kind of the common set all languages have.

A great, third one is imports: a very valid use case for those is rewriting all imports in a code base, which sometimes cannot be automated using IDE tooling.

Assert scoped view correctness

On building a ScopedViewBuilder, assert that its constituents equal the original input.

Probably make it a hard assert, not a debug assert, as any bug there is a showstopper.

To be a hard assert, it needs to be cheap. For that, a cheap equality method is required (ScopedViewBuilder == &str).

Add context/documentation to `instrament`

The macro:

https://github.com/alexpovel/betterletters/blob/b6c314ebdffdf252f02604cf29ac3b85aed44ba6/common/src/instrament.rs

originated from here, giving much more context on what it does and why it's of utility:

https://web.archive.org/web/20230526110628/https://github.com/la10736/rstest/issues/183#issuecomment-1564021215

Implement multi-threading

Add profiling support

Via https://github.com/flamegraph-rs/flamegraph/ .

Implement `symbols` module

To replace -> with → etc.

Document files aka glob option

This exists (--files option) and should be generally useful, but is hard to find as it's not in the README.

Remove dead code

Remove dead code that's part of the public API but no longer required. Sadly, clippy/ra cannot warn us of such code.

Implement language grammars

A solid set of some of the most popular ones:

And then, for each or at least most, implement all, or most of (if relevant for the language):

Write documentation

README
public API (rustdoc, what will end up on crates.io)
private items

Fix carriage return issue on Windows

CI failed: https://github.com/alexpovel/srgn/actions/runs/6605779360/job/17941307575

On Windows, using tree-sitter-python, its comments parsing eats into \r\n and, e.g. if --delete is used, will mistake \r as part of a "comment" and delete it, leaving a naked \n. That's an error for files for CRLF style line endings.

Idea: for build in ScopedView, check if \n are generally in or out of scope. If all \n are detected as out of scope, shuffle all \r out as well, as some might have been put In scope. This will require copying.

This might also qualify as a bug upstream, but fixing every single tree-sitter parser/grammar, if at all possible, is much harder than fixing it in our application here, for every single use case.

Implement word list performance improvements

Current issues are:

we still include compound words in the word list, even though we now have an algorithm in place to check for compound words at runtime (which is reasonably cheap)
the Linux (unknown, x86-64) binary is 120 meg (woops...), the Windows one c. 70 meg
compilation takes a minute on Windows and 21 minutes in WSL (???)

Solutions are:

filter word list to no longer contain compound words (use existing logic in Rust or a Python script)
do not use a &[&str]: it means storing a (usize, usize) (address, length) pointer 2.5 million (current word list length as of 37aff4d ) times, roughly doubling the binary size (( 32_600_000 + 2_152_639 * (2 * 64 / 8) ) / 1_000_000 == 67.042224 aka the c. 70 meg observed on Windows; why Linux is much higher still on the same arch, no idea).~~

Instead, see and hope the longest word in the list is reasonably short. Let's call that length $x$, and hope it's in the ballpark of, say, 20 bytes. Pad all other words with lengths smaller $x$ with trailing \0s (or whatever...) until they're all of length $x$ as well. Store the result in a single &str, whose pointer/length info now has trivial overhead compared to the multi-megabyte single string. Store $x$ in a const (or static...), implement a simple binary search over that manually. We get the same important characteristics:
- performance (same binary search, which is easy and possible as all element lengths are known at compile time, like a regular &[T: Sized])
- compile-time data structure with zero runtime cost (unlike a hashset, or any form of de/ser... although it sounds an awful lot like badly reimplementing a part of Capnproto)
but the core advantage of no longer wasting a (usize, usize) for each word. On 64bit, that tuple is 16 bytes, whereas the average word length (in char, not bytes) is 14.6, which in bytes should come out to about 16 as well. Hence, there's pretty much 100% overhead. With a single string, that's reduced to a single tuple. Compilation sizes are then down to utterly reasonable levels as well (15s both platforms), and binary sizes are down to a tad over the string length, aka there's no overhead anymore (also confirmed on both platforms).

Storing the single string in a &str already gives us UTF-8 safety, but the binary search could still go awry. Definitely unit-test the shit out of that.

Using uneven search instead of the padding approach, see below.

Get rid of `common` crate

It's always tempting to have it, but it's also a smell. A first step was taken in 5664118 , using itertools' powerset.

Remaining items are:

instrament: can be moved back into core, not used anyway else currently anyway
strings.titlecase: looked around, as it seems very easy for there to be a crate for it, but no dice (funny that this is so "hard"):
- ~~https://crates.io/crates/capitalize: does ascii_lowercase, which we cannot use~~
- ~~https://crates.io/crates/roe: looks good and professional, but it's work in progress, and as of 0.0.4 only has lower- and uppercasing, no titlecasing~~
- ~~https://crates.io/crates/titlecase: specifically ignores words, which we cannot do~~
- https://crates.io/crates/unicode_titlecase: looks promising, similar to my implementation anyway
binary_search_uneven: currently only lives externally because of benchmarks, as Criterion benchmarks can only use the public API
is_compound_word: small function but unlikely to find a suitable crate for that. Lives externally because build.rs prepares the word list using that same algorithm (so that the processed word list doesn't contain compound words, as that would be wasted space)

Fix code scanning alert - matching over () is more explicit

Tracking issue for:

https://github.com/alexpovel/srgn/security/code-scanning/49

Implement pipelines + DevOps tooling

set up main pipeline
- unit and integration tests
- code coverage reports (codecov has worked well in the Python version)
- publish to crates.io
  - controlled by release-please
  - split current bin-only package to bin + lib
set up pre-commit hooks

Add missing words to German dictionary

Found randomly through testing.

Missing

aufwändig
Lötkugeln

Contentious

Massenmarkt (replaced by Maßenmarkt)

tests fail when building from tarball due to calling git restore

---- tests::test_cli_files::case_1 stdout ----
Running: "git" "restore" "tests/files-option/basic-python/in"
thread 'tests::test_cli_files::case_1' panicked at tests/cli.rs:106:24:
Head restoration to not fail: Os { code: 2, kind: NotFound, message: "No such file or directory" }

i'm working on packaging this for nixpkgs and currently i have to disable the check phase due to this.

maybe check for the existence of .git and skip these unit tests if its not found? or just refactor it so it copies files to a temporary directory first before mutating them in-place?

Fix verbs-in-uppercase support

In German, verbs are potentially capitalized at the beginning of sentences. Currently, for example Uebel won't work, whereas uebel will correctly turn into übel.

The Python version already takes care of this.

Replace NFD with NFKD normalization

Look into rayon

I briefly looked into multi-threading in #3 and found it not worth it. However, having used rayon and being able to very quickly benefit from it (69f5f23) was impressive. Perhaps it's worth it. Would probably require reading all of stdin at once, then handing it to par_iter, and not just iterating over its lines one by one. We'd get multi-threading for free, but it might be slower for small inputs (which for my use case represent basically all inputs).

Investigate file size limit

There is a 10 MB file size limit for .crate. Check if this impacts this crate.

Use memoization

When running on large inputs, certain words will be highly common. Memoize those, like @cache in Python.

alexpovel / srgn Goto Github PK

srgn's People

Contributors

Stargazers

Watchers

Forkers

srgn's Issues

Missing

Contentious

Recommend Projects

Recommend Topics

Recommend Org

Jobs