alexpovel / srgn Goto Github PK
View Code? Open in Web Editor NEWA code surgeon for precise text and code transplantation. A marriage of `tr`/`sed`, `rg` and `tree-sitter`.
Home Page: https://crates.io/crates/srgn/
License: MIT License
A code surgeon for precise text and code transplantation. A marriage of `tr`/`sed`, `rg` and `tree-sitter`.
Home Page: https://crates.io/crates/srgn/
License: MIT License
A bitch to test, but there's currently no coverage at all. Perhaps ignore what's written out, but at least test files are found, globbing works, processing works, ...
--files
currently modifies in-place, so testing before vs. after would be pretty nasty. Perhaps extend with a functionality like --files-ext
so users can optionally specify an additional file extension of produced files?
Idea from this repository. A stemmer could further shift the burden to compute (for which we still have breathing room to use more), away from memory (which we're trying to save), as the word list could shrink further. The list currently contains a lot of word derivatives, which could all be removed in favor of a single stem.
The build.rs
script could prune the original (which won't be touched, as having raw data available is always good) word list further, using the same approach as used with the compound words (ingest word list, and only write out entries that can neither be constructed as compound words from other entries, nor reproduced via stemming).
Ultimately, the more elaborate the compute/algorithm side, the lower quality a word list we can get away with.
binstall fetches metadata from crates.io so if the desired version isn’t on there yet it fails:
https://github.com/alexpovel/srgn/actions/runs/6676593449/job/18145819987#step:3:15
Invocation:
$ srgn --version
srgn 0.11.0
$ srgn --files '**/*.rs' --rust-query '((macro_invocation macro: (identifier) @name) @macro (#any-of? @name "error" "warn" "info" "debug" "trace" ))' '(?P<msg>.+);(?P<structure>.+)' '$structure,$msg'
Panic:
thread '<unnamed>' panicked at ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/srgn-0.11.0/src/scoping/scope.rs:60:43:
begin <= end (23 <= 0) when slicing `error!( "server error"; "uuid" => %uuid, "error.user_msg" => %user_msg, src)`
Please let me know if more information is needed.
Have a feature
like standalone
, which would ship the word lists even in the library, or use a user-provided one if that's desired. The binary crate would always be standalone
.
Probably similar to https://github.com/cargo-bins/cargo-binstall/blob/af04e45b5a516b2944f41a2d2db409c1d8e0f15d/.github/workflows/release-packages.yml , just much simpler.
Let's try amd64
for Linux and Windows, and arm64
for Linux and macOS.
Not sure what this is about yet, but pipelines keep failing:
https://github.com/alexpovel/srgn/actions/runs/6724256557/job/18276098536
They advise to use a special token, so set that up. Don't forget dependabot secrets.
So far, all languages srgn
offers come with comments
and string
queries. Those are kind of the common set all languages have.
A great, third one is imports: a very valid use case for those is rewriting all imports in a code base, which sometimes cannot be automated using IDE tooling.
On building a ScopedViewBuilder, assert that its constituents equal the original input.
Probably make it a hard assert, not a debug assert, as any bug there is a showstopper.
To be a hard assert, it needs to be cheap. For that, a cheap equality method is required (ScopedViewBuilder == &str).
The macro:
originated from here, giving much more context on what it does and why it's of utility:
See also alexpovel/betterletter#33 .
To replace ->
with →
etc.
This exists (--files
option) and should be generally useful, but is hard to find as it's not in the README.
Remove dead code that's part of the public API but no longer required. Sadly, clippy/ra cannot warn us of such code.
A solid set of some of the most popular ones:
And then, for each or at least most, implement all, or most of (if relevant for the language):
class
/struct
/enum
names (at definition site)CI failed: https://github.com/alexpovel/srgn/actions/runs/6605779360/job/17941307575
On Windows, using tree-sitter-python
, its comments parsing eats into \r\n
and, e.g. if --delete
is used, will mistake \r
as part of a "comment" and delete it, leaving a naked \n
. That's an error for files for CRLF style line endings.
Idea: for build
in ScopedView
, check if \n
are generally in or out of scope. If all \n
are detected as out of scope, shuffle all \r
out as well, as some might have been put In
scope. This will require copying.
This might also qualify as a bug upstream, but fixing every single tree-sitter parser/grammar, if at all possible, is much harder than fixing it in our application here, for every single use case.
Current issues are:
Solutions are:
filter word list to no longer contain compound words (use existing logic in Rust or a Python script)
do not use a &[&str]
: it means storing a (usize, usize)
(address, length) pointer 2.5 million (current word list length as of 37aff4d ) times, roughly doubling the binary size (( 32_600_000 + 2_152_639 * (2 * 64 / 8) ) / 1_000_000 == 67.042224
aka the c. 70 meg observed on Windows; why Linux is much higher still on the same arch, no idea).~~
Instead, see and hope the longest word in the list is reasonably short. Let's call that length \0
s (or whatever...) until they're all of length &str
, whose pointer/length info now has trivial overhead compared to the multi-megabyte single string. Store const
(or static
...), implement a simple binary search over that manually. We get the same important characteristics:
&[T: Sized]
)but the core advantage of no longer wasting a (usize, usize)
for each word. On 64bit, that tuple is 16 bytes, whereas the average word length (in char
, not bytes) is 14.6, which in bytes should come out to about 16 as well. Hence, there's pretty much 100% overhead. With a single string, that's reduced to a single tuple. Compilation sizes are then down to utterly reasonable levels as well (15s both platforms), and binary sizes are down to a tad over the string length, aka there's no overhead anymore (also confirmed on both platforms).
Storing the single string in a &str
already gives us UTF-8 safety, but the binary search could still go awry. Definitely unit-test the shit out of that.
Using uneven
search instead of the padding
approach, see below.
It's always tempting to have it, but it's also a smell. A first step was taken in 5664118 , using itertools
' powerset
.
Remaining items are:
instrament
: can be moved back into core
, not used anyway else currently anywaystrings.titlecase
: looked around, as it seems very easy for there to be a crate for it, but no dice (funny that this is so "hard"):
ascii_lowercase
, which we cannot use0.0.4
only has lower- and uppercasing, no titlecasingbinary_search_uneven
: currently only lives externally because of benchmarks, as Criterion benchmarks can only use the public APIis_compound_word
: small function but unlikely to find a suitable crate for that. Lives externally because build.rs
prepares the word list using that same algorithm (so that the processed word list doesn't contain compound words, as that would be wasted space)Tracking issue for:
bin
-only package to bin
+ lib
Found randomly through testing.
---- tests::test_cli_files::case_1 stdout ----
Running: "git" "restore" "tests/files-option/basic-python/in"
thread 'tests::test_cli_files::case_1' panicked at tests/cli.rs:106:24:
Head restoration to not fail: Os { code: 2, kind: NotFound, message: "No such file or directory" }
i'm working on packaging this for nixpkgs and currently i have to disable the check phase due to this.
maybe check for the existence of .git and skip these unit tests if its not found? or just refactor it so it copies files to a temporary directory first before mutating them in-place?
In German, verbs are potentially capitalized at the beginning of sentences. Currently, for example Uebel
won't work, whereas uebel
will correctly turn into übel
.
The Python version already takes care of this.
See also https://tonsky.me/blog/unicode/
I briefly looked into multi-threading in #3 and found it not worth it. However, having used rayon
and being able to very quickly benefit from it (69f5f23) was impressive. Perhaps it's worth it. Would probably require reading all of stdin
at once, then handing it to par_iter
, and not just iterating over its lines one by one. We'd get multi-threading for free, but it might be slower for small inputs (which for my use case represent basically all inputs).
There is a 10 MB file size limit for .crate
. Check if this impacts this crate.
When running on large inputs, certain words will be highly common. Memoize those, like @cache
in Python.
See also alexpovel/betterletter#33 .
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.