GithubHelp home page GithubHelp logo

alexpovel / srgn Goto Github PK

View Code? Open in Web Editor NEW
426.0 3.0 3.0 14.51 MB

A code surgeon for precise text and code transplantation. A marriage of `tr`/`sed`, `rg` and `tree-sitter`.

Home Page: https://crates.io/crates/srgn/

License: MIT License

Rust 92.16% Just 1.17% Python 1.87% C# 1.86% TypeScript 1.28% Go 1.05% Shell 0.60%
csharp go python regex rust rust-lang tree-sitter typescript abstract-syntax-tree grep

srgn's People

Contributors

alexpovel avatar alexpovel-ci-machine[bot] avatar dependabot[bot] avatar github-actions[bot] avatar neutric avatar rvolgers avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

srgn's Issues

Test `--files`

A bitch to test, but there's currently no coverage at all. Perhaps ignore what's written out, but at least test files are found, globbing works, processing works, ...

--files currently modifies in-place, so testing before vs. after would be pretty nasty. Perhaps extend with a functionality like --files-ext so users can optionally specify an additional file extension of produced files?

Look into stemming

Idea from this repository. A stemmer could further shift the burden to compute (for which we still have breathing room to use more), away from memory (which we're trying to save), as the word list could shrink further. The list currently contains a lot of word derivatives, which could all be removed in favor of a single stem.

The build.rs script could prune the original (which won't be touched, as having raw data available is always good) word list further, using the same approach as used with the compound words (ingest word list, and only write out entries that can neither be constructed as compound words from other entries, nor reproduced via stemming).

Ultimately, the more elaborate the compute/algorithm side, the lower quality a word list we can get away with.

Unexpected panic while rearranging macro arguments

Invocation:

$ srgn --version
srgn 0.11.0

$ srgn --files '**/*.rs' --rust-query '((macro_invocation macro: (identifier) @name) @macro (#any-of? @name "error" "warn" "info" "debug" "trace" ))' '(?P<msg>.+);(?P<structure>.+)' '$structure,$msg'

Panic:

thread '<unnamed>' panicked at ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/srgn-0.11.0/src/scoping/scope.rs:60:43:
begin <= end (23 <= 0) when slicing `error!( "server error"; "uuid" => %uuid, "error.user_msg" => %user_msg, src)`

Please let me know if more information is needed.

Feature: `import` handling

So far, all languages srgn offers come with comments and string queries. Those are kind of the common set all languages have.

A great, third one is imports: a very valid use case for those is rewriting all imports in a code base, which sometimes cannot be automated using IDE tooling.

Assert scoped view correctness

On building a ScopedViewBuilder, assert that its constituents equal the original input.

Probably make it a hard assert, not a debug assert, as any bug there is a showstopper.

To be a hard assert, it needs to be cheap. For that, a cheap equality method is required (ScopedViewBuilder == &str).

Remove dead code

Remove dead code that's part of the public API but no longer required. Sadly, clippy/ra cannot warn us of such code.

Implement language grammars

A solid set of some of the most popular ones:

  • Python
  • TypeScript
  • Go
  • Rust
  • C#

And then, for each or at least most, implement all, or most of (if relevant for the language):

  • comments
  • "documentation strings"
  • function names (at definition site)
  • function calls
  • class/struct/enum names (at definition site)
  • strings
  • variable names
  • type annotations

Fix carriage return issue on Windows

CI failed: https://github.com/alexpovel/srgn/actions/runs/6605779360/job/17941307575

On Windows, using tree-sitter-python, its comments parsing eats into \r\n and, e.g. if --delete is used, will mistake \r as part of a "comment" and delete it, leaving a naked \n. That's an error for files for CRLF style line endings.

Idea: for build in ScopedView, check if \n are generally in or out of scope. If all \n are detected as out of scope, shuffle all \r out as well, as some might have been put In scope. This will require copying.

This might also qualify as a bug upstream, but fixing every single tree-sitter parser/grammar, if at all possible, is much harder than fixing it in our application here, for every single use case.

Implement word list performance improvements

Current issues are:

  • we still include compound words in the word list, even though we now have an algorithm in place to check for compound words at runtime (which is reasonably cheap)
  • the Linux (unknown, x86-64) binary is 120 meg (woops...), the Windows one c. 70 meg
  • compilation takes a minute on Windows and 21 minutes in WSL (???)

Solutions are:

  • filter word list to no longer contain compound words (use existing logic in Rust or a Python script)

  • do not use a &[&str]: it means storing a (usize, usize) (address, length) pointer 2.5 million (current word list length as of 37aff4d ) times, roughly doubling the binary size (( 32_600_000 + 2_152_639 * (2 * 64 / 8) ) / 1_000_000 == 67.042224 aka the c. 70 meg observed on Windows; why Linux is much higher still on the same arch, no idea).~~

    Instead, see and hope the longest word in the list is reasonably short. Let's call that length $x$, and hope it's in the ballpark of, say, 20 bytes. Pad all other words with lengths smaller $x$ with trailing \0s (or whatever...) until they're all of length $x$ as well. Store the result in a single &str, whose pointer/length info now has trivial overhead compared to the multi-megabyte single string. Store $x$ in a const (or static...), implement a simple binary search over that manually. We get the same important characteristics:

    • performance (same binary search, which is easy and possible as all element lengths are known at compile time, like a regular &[T: Sized])
    • compile-time data structure with zero runtime cost (unlike a hashset, or any form of de/ser... although it sounds an awful lot like badly reimplementing a part of Capnproto)

    but the core advantage of no longer wasting a (usize, usize) for each word. On 64bit, that tuple is 16 bytes, whereas the average word length (in char, not bytes) is 14.6, which in bytes should come out to about 16 as well. Hence, there's pretty much 100% overhead. With a single string, that's reduced to a single tuple. Compilation sizes are then down to utterly reasonable levels as well (15s both platforms), and binary sizes are down to a tad over the string length, aka there's no overhead anymore (also confirmed on both platforms).

    Storing the single string in a &str already gives us UTF-8 safety, but the binary search could still go awry. Definitely unit-test the shit out of that.

    Using uneven search instead of the padding approach, see below.

Get rid of `common` crate

It's always tempting to have it, but it's also a smell. A first step was taken in 5664118 , using itertools' powerset.

Remaining items are:

  • instrament: can be moved back into core, not used anyway else currently anyway
  • strings.titlecase: looked around, as it seems very easy for there to be a crate for it, but no dice (funny that this is so "hard"):
  • binary_search_uneven: currently only lives externally because of benchmarks, as Criterion benchmarks can only use the public API
  • is_compound_word: small function but unlikely to find a suitable crate for that. Lives externally because build.rs prepares the word list using that same algorithm (so that the processed word list doesn't contain compound words, as that would be wasted space)

tests fail when building from tarball due to calling git restore

---- tests::test_cli_files::case_1 stdout ----
Running: "git" "restore" "tests/files-option/basic-python/in"
thread 'tests::test_cli_files::case_1' panicked at tests/cli.rs:106:24:
Head restoration to not fail: Os { code: 2, kind: NotFound, message: "No such file or directory" }

i'm working on packaging this for nixpkgs and currently i have to disable the check phase due to this.

maybe check for the existence of .git and skip these unit tests if its not found? or just refactor it so it copies files to a temporary directory first before mutating them in-place?

Look into rayon

I briefly looked into multi-threading in #3 and found it not worth it. However, having used rayon and being able to very quickly benefit from it (69f5f23) was impressive. Perhaps it's worth it. Would probably require reading all of stdin at once, then handing it to par_iter, and not just iterating over its lines one by one. We'd get multi-threading for free, but it might be slower for small inputs (which for my use case represent basically all inputs).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.