GithubHelp home page GithubHelp logo

alexpovel / srgn Goto Github PK

View Code? Open in Web Editor NEW
395.0 5.0 3.0 14.45 MB

A code surgeon for precise text and code transplantation. A marriage of `tr`/`sed`, `rg` and `tree-sitter`.

Home Page: https://crates.io/crates/srgn/

License: MIT License

Rust 92.08% Just 1.18% Python 1.89% C# 1.88% TypeScript 1.29% Go 1.06% Shell 0.61%
csharp go python regex rust rust-lang tree-sitter typescript abstract-syntax-tree grep

srgn's Introduction

srgn - a code surgeon

A code surgeon for precise text and code transplantation.

Born a Unicode-capable descendant of tr, srgn adds useful actions, acting within precise, optionally language grammar-aware scopes. It suits use cases where...

  • regex doesn't cut it anymore,
  • editor tools such as Rename all are too specific, and not automatable,
  • precise manipulation, not just matching, is required, and lastly and optionally,
  • Unicode-specific trickery is desired.

Usage

For an "end-to-end" example, consider this Python snippet (more languages are supported):

"""GNU module."""

def GNU_says_moo():
    """The GNU -> say moo -> ✅"""

    GNU = """
      GNU
    """  # the GNU...

    print(GNU + " says moo")  # ...says moo

which with an invocation of

cat gnu.py | srgn --python 'doc-strings' '(?<!The )GNU' 'GNU 🐂 is not Unix' | srgn --symbols

can be manipulated to read

"""GNU 🐂 is not Unix module."""

def GNU_says_moo():
    """The GNU → say moo → ✅"""

    GNU = """
      GNU
    """  # the GNU...

    print(GNU + " says moo")  # ...says moo

where the changes are limited to:

- """GNU module."""
+ """GNU 🐂 is not Unix module."""

  def GNU_says_moo():
-     """The GNU -> say moo -> ✅"""
+     """The GNU → say moo → ✅"""

      GNU = """
        GNU
      """  # the GNU...

      print(GNU + " says moo")  # ...says moo

which demonstrates:

  • language grammar-aware operation: only Python docstrings were manipulated; virtually impossible to replicate in just regex

    Skip ahead to more such showcases below.

  • advanced regex features such as, in this case, negative lookbehind are supported

  • Unicode is natively handled

  • features such as ASCII symbol replacement are provided

Hence the concept of surgical operation: srgn allows you to be quite precise about the scope of your actions, combining the power of both regular expressions and parsers.

Note

Without exception, all bash and console code snippets in this README are automatically tested using the actual program binary, facilitated by a tiny bash interpreter. What is showcased here is guaranteed to work.

Installation

Prebuilt binaries

Download a prebuilt binary from the releases.

cargo-binstall

This crate provides its binaries in a format compatible with cargo-binstall:

  1. Install the Rust toolchain
  2. Run cargo install cargo-binstall (might take a while)
  3. Run cargo binstall srgn (couple seconds, as it downloads prebuilt binaries from GitHub)

These steps are guaranteed to work™, as they are tested in CI. They also work if no prebuilt binaries are available for your platform, as the tool will fall back to compiling from source.

CI (GitHub Actions)

All GitHub Actions runner images come with cargo preinstalled, and cargo-binstall provides a convenient GitHub Action:

jobs:
  srgn:
    name: Install srgn in CI
    # All three major OSes work
    runs-on: ubuntu-latest
    steps:
      - uses: cargo-bins/cargo-binstall@main
      - name: Install binary
        run: >
          cargo binstall
          --no-confirm
          srgn
      - name: Use binary
        run: srgn --version

The above concludes in just 5 seconds total, as no compilation is required. For more context, see cargo-binstall's advise on CI.

Cargo (compile from source)

  1. Install the Rust toolchain
  2. A C compiler is required:
    1. On Linux, gcc works (tested).

    2. On macOS, try clang (untested).

    3. On Windows, MSVC works (tested).

      Select "Desktop development with C++" on installation.

  3. Run cargo install srgn

Cargo (as a Rust library)

cargo add srgn

See here for more.

Shell completions

Various shells are supported for shell completion scripts. For example, append eval "$(srgn --completions zsh)" to ~/.zshrc for completions in ZSH.

Walkthrough

The tool is designed around scopes and actions. Scopes narrow down the parts of the input to process. Actions then perform the processing. Generally, both scopes and actions are composable, so more than one of each may be passed. Both are optional (but taking no action is pointless); specifying no scope implies the entire input is in scope.

At the same time, there is considerable overlap with plain tr: the tool is designed to have close correspondence in the most common use cases, and only go beyond when needed.

Actions

The simplest action is replacement. It is specially accessed (as an argument, not an option) for compatibility with tr, and general ergonomics. All other actions are given as flags, or options should they take a value.

Replacement

For example, simple, single-character replacements work as in tr:

$ echo 'Hello, World!' | srgn 'H' 'J'
Jello, World!

The first argument is the scope (literal H in this case). Anything matched by it is subject to processing (replacement by J, the second argument, in this case). However, there is no direct concept of character classes as in tr. Instead, by default, the scope is a regular expression pattern, so its classes can be used to similar effect:

$ echo 'Hello, World!' | srgn '[a-z]' '_'
H____, W____!

The replacement occurs greedily across the entire match by default (note the UTS character class, reminiscent of tr's [:alnum:]):

$ echo 'ghp_oHn0As3cr3T!!' | srgn 'ghp_[[:alnum:]]+' '*' # A GitHub token
*!!

However, in the presence of capture groups, the individual characters comprising a capture group match are treated individually for processing, allowing a replacement to be repeated:

$ echo 'Hide ghp_th15 and ghp_th4t' | srgn '(ghp_[[:alnum:]]+)' '*'
Hide ******** and ********

Advanced regex features are supported, for example lookarounds:

$ echo 'ghp_oHn0As3cr3T' | srgn '(?<=ghp_)([[:alnum:]]+)' '*'
ghp_***********

Take care in using these safely, as advanced patterns come without certain safety and performance guarantees. If they aren't used, performance is not impacted.

The replacement is not limited to a single character. It can be any string, for example to fix this quote:

$ echo '"Using regex, I now have no issues."' | srgn 'no issues' '2 problems'
"Using regex, I now have 2 problems."

The tool is fully Unicode-aware, with useful support for certain advanced character classes:

$ echo 'Mood: 🙂' | srgn '🙂' '😀'
Mood: 😀
$ echo 'Mood: 🤮🤒🤧🦠 :(' | srgn '\p{Emoji_Presentation}' '😷'
Mood: 😷😷😷😷 :(

Beyond replacement

Seeing how the replacement is merely a static string, its usefulness is limited. This is where tr's secret sauce ordinarily comes into play: using its character classes, which are valid in the second position as well, neatly translating from members of the first to the second. Here, those classes are instead regexes, and only valid in first position (the scope). A regular expression being a state machine, it is impossible to match onto a 'list of characters', which in tr is the second (optional) argument. That concept is out the window, and its flexibility lost.

Instead, the offered actions, all of them fixed, are used. A peek at the most common use cases for tr reveals that the provided set of actions covers virtually all of them! Feel free to file an issue if your use case is not covered.

Onto the next action.

Deletion

Removes whatever is found from the input. Same flag name as in tr.

$ echo 'Hello, World!' | srgn -d '(H|W|!)'
ello, orld

Note

As the default scope is to match the entire input, it is an error to specify deletion without a scope.

Squeezing

Squeezes repeats of characters matching the scope into single occurrences. Same flag name as in tr.

$ echo 'Helloooo Woooorld!!!' | srgn -s '(o|!)'
Hello World!

If a character class is passed, all members of that class are squeezed into whatever class member was encountered first:

$ echo 'The number is: 3490834' | srgn -s '\d'
The number is: 3

Greediness in matching is not modified, so take care:

$ echo 'Winter is coming... 🌞🌞🌞' | srgn -s '🌞+'
Winter is coming... 🌞🌞🌞

Note

The pattern matched the entire run of suns, so there's nothing to squeeze. Summer prevails.

Invert greediness if the use case calls for it:

$ echo 'Winter is coming... 🌞🌞🌞' | srgn -s '🌞+?' '☃️'
Winter is coming... ☃️

Note

Again, as with deletion, specifying squeezing without an explicit scope is an error. Otherwise, the entire input is squeezed.

Character casing

A good chunk of tr usage falls into this category. It's very straightforward.

$ echo 'Hello, World!' | srgn --lower
hello, world!
$ echo 'Hello, World!' | srgn --upper
HELLO, WORLD!
$ echo 'hello, world!' | srgn --titlecase
Hello, World!

Normalization

Decomposes input according to Normalization Form D, and then discards code points of the Mark category (see examples). That roughly means: take fancy character, rip off dangly bits, throw those away.

$ echo 'Naïve jalapeño ärgert mgła' | srgn -d '\P{ASCII}' # Naive approach
Nave jalapeo rgert mga
$ echo 'Naïve jalapeño ärgert mgła' | srgn --normalize # Normalize is smarter
Naive jalapeno argert mgła

Notice how mgła is out of scope for NFD, as it is "atomic" and thus not decomposable (at least that's what ChatGPT whispers in my ear).

Symbols

This action replaces multi-character, ASCII symbols with appropriate single-code point, native Unicode counterparts.

$ echo '(A --> B) != C --- obviously' | srgn --symbols
(A ⟶ B) ≠ C — obviously

Alternatively, if you're only interested in math, make use of scoping:

$ echo 'A <= B --- More is--obviously--possible' | srgn --symbols '<='
A ≤ B --- More is--obviously--possible

As there is a 1:1 correspondence between an ASCII symbol and its replacement, the effect is reversible1:

$ echo 'A ⇒ B' | srgn --symbols --invert
A => B

There is only a limited set of symbols supported as of right now, but more can be added.

German

This action replaces alternative spellings of German special characters (ae, oe, ue, ss) with their native versions (ä, ö, ü, ß)2.

$ echo 'Gruess Gott, Neueroeffnungen, Poeten und Abenteuergruetze!' | srgn --german
Grüß Gott, Neueröffnungen, Poeten und Abenteuergrütze!

This action is based on a word list (compile without german feature if this bloats your binary too much). Note the following features about the above example:

  • empty scope and replacement: the entire input will be processed, and no replacement is performed
  • Poeten remained as-is, instead of being naively and mistakenly converted to Pöten
  • as a (compound) word, Abenteuergrütze is not going to be found in any reasonable word list, but was handled properly nonetheless
  • while part of a compound word, Abenteuer remained as-is as well, instead of being incorrectly converted to Abenteür
  • lastly, Neueroeffnungen sneakily forms a ue element neither constituent word (neu, Eröffnungen) possesses, but is still processed correctly (despite the mismatched casings as well)

On request, replacements may be forced, as is potentially useful for names:

$ echo 'Frau Loetter steht ueber der Mauer.' | srgn --german-naive '(?<=Frau )\w+'
Frau Lötter steht ueber der Mauer.

Through positive lookahead, nothing but the salutation was scoped and therefore changed. Mauer correctly remained as-is, but ueber was not processed. A second pass fixes this:

$ echo 'Frau Loetter steht ueber der Mauer.' | srgn --german-naive '(?<=Frau )\w+' | srgn --german
Frau Lötter steht über der Mauer.

Note

Options and flags pertaining to some "parent" are prefixed with their parent's name, and will imply their parent when given, such that the latter does not need to be passed explicitly. That's why --german-naive is named as it is, and --german needn't be passed.

This behavior might change once clap supports subcommand chaining.

Some branches are undecidable for this modest tool, as it operates without language context. For example, both Busse (busses) and Buße (penance) are legal words. By default, replacements are greedily performed if legal (that's the whole point of srgn, after all), but there's a flag for toggling this behavior:

$ echo 'Busse und Geluebte 🙏' | srgn --german
Buße und Gelübte 🙏
$ echo 'Busse 🚌 und Fussgaenger 🚶‍♀️' | srgn --german-prefer-original
Busse 🚌 und Fußgänger 🚶‍♀️

Combining Actions

Most actions are composable, unless doing so were nonsensical (like for deletion). Their order of application is fixed, so the order of the flags given has no influence (piping multiple runs is an alternative, if needed). Replacements always occur first. Generally, the CLI is designed to prevent misuse and surprises: it prefers crashing to doing something unexpected (which is subjective, of course). Note that lots of combinations are technically possible, but might yield nonsensical results.

Combining actions might look like:

$ echo 'Koeffizienten != Bruecken...' | srgn -Sgu
KOEFFIZIENTEN ≠ BRÜCKEN...

A more narrow scope can be specified, and will apply to all actions equally:

$ echo 'Koeffizienten != Bruecken...' | srgn -Sgu '\b\w{1,8}\b'
Koeffizienten != BRÜCKEN...

The word boundaries are required as otherwise Koeffizienten is matched as Koeffizi and enten. Note how the trailing periods cannot be, for example, squeezed. The required scope of \. would interfere with the given one. Regular piping solves this:

$ echo 'Koeffizienten != Bruecken...' | srgn -Sgu '\b\w{1,8}\b' | srgn -s '\.'
Koeffizienten != BRÜCKEN.

Note: regex escaping (\.) can be circumvent using literal scoping. The specially treated replacement action is also composable:

$ echo 'Mooood: 🤮🤒🤧🦠!!!' | srgn -s '\p{Emoji}' '😷'
Mooood: 😷!!!

Emojis are first all replaced, then squeezed. Notice how nothing else is squeezed.

Scopes

Scopes are the second driving concept to srgn. In the default case, the main scope is a regular expression. The actions section showcased this use case in some detail, so it's not repeated here. It is given as a first positional argument.

Language grammar-aware scopes

srgn extends this through premade, language grammar-aware scopes, made possible through the excellent tree-sitter library. It offers a queries feature, which works much like pattern matching against a tree data structure.

srgn comes bundled with a handful of the most useful of these queries. Through its discoverable API (either as a library or via CLI, srgn --help), one can learn of the supported languages and available, premade queries. Each supported language comes with an escape hatch, allowing you to run your own, custom ad-hoc queries. The hatch comes in the form of --lang-query <S EXPRESSION>, where lang is a language such as python. See below for more on this advanced topic.

Note

Language scopes are applied first, so whatever regex aka main scope you pass, it operates on each matched language construct individually.

Premade queries (sample showcases)

This section shows examples for some of the premade queries.

Mass import (module) renaming (Python, Rust)

As part of a large refactor (say, after an acquisition), imagine all imports of a specific package needed renaming:

import math
from pathlib import Path

import good_company.infra
import good_company.aws.auth as aws_auth
from good_company.util.iter import dedupe
from good_company.shopping.cart import *  # Ok but don't do this at home!

good_company = "good_company"  # good_company

At the same time, a move to src/ layout is desired. Achieve this move with:

cat imports.py | srgn --python 'imports' '^good_company' 'src.better_company'

which will yield

import math
from pathlib import Path

import src.better_company.infra
import src.better_company.aws.auth as aws_auth
from src.better_company.util.iter import dedupe
from src.better_company.shopping.cart import *  # Ok but don't do this at home!

good_company = "good_company"  # good_company

Note how the last line remains untouched by this particular operation. To run across many files, see the files option.

Similar import-related edits are supported for other languages as well, for example Rust:

use std::collections::HashMap;

use good_company::infra;
use good_company::aws::auth as aws_auth;
use good_company::util::iter::dedupe;
use good_company::shopping::cart::*;

good_company = "good_company";  // good_company

which, using

cat imports.rs | srgn --rust 'uses' '^good_company' 'better_company'

becomes

use std::collections::HashMap;

use better_company::infra;
use better_company::aws::auth as aws_auth;
use better_company::util::iter::dedupe;
use better_company::shopping::cart::*;

good_company = "good_company";  // good_company
Assigning TODOs (TypeScript)

Perhaps you're using a system of TODO notes in comments:

class TODOApp {
    // TODO app for writing TODO lists
    addTodo(todo: TODO): void {
        // TODO: everything, actually 🤷‍♀️
    }
}

and usually assign people to each note. It's possible to automate assigning yourself to every unassigned note (lucky you!) using

cat todo.ts | srgn --typescript 'comments' 'TODO(?=:)' 'TODO(@poorguy)'

which in this case gives

class TODOApp {
    // TODO app for writing TODO lists
    addTodo(todo: TODO): void {
        // TODO(@poorguy): everything, actually 🤷‍♀️
    }
}

Notice the positive lookahead of (?=:), ensuring an actual TODO note is hit (TODO:). Otherwise, the other TODOs mentioned around the comments would be matched as well.

Converting print calls to proper logging (Python)

Say there's code making liberal use of print:

def print_money():
    """Let's print money 💸."""

    amount = 32
    print("Got here.")

    print_more = lambda s: print(f"Printed {s}")
    print_more(23)  # print the stuff

print_money()
print("Done.")

and a move to logging is desired. That's fully automated by a call of

cat money.py | srgn --python 'function-calls' '^print$' 'logging.info'

yielding

def print_money():
    """Let's print money 💸."""

    amount = 32
    logging.info("Got here.")

    print_more = lambda s: logging.info(f"Printed {s}")
    print_more(23)  # print the stuff

print_money()
logging.info("Done.")

Note

Note the anchors: print_more is a function call as well, but ^print$ ensures it's not matched.

The regular expression applies after grammar scoping, so operates entirely within the already-scoped context.

Remove all comments (C#)

Overdone, comments can turn into smells. If not tended to, they might very well start lying:

using System.Linq;

public class UserService
{
    private readonly AppDbContext _dbContext;

    /// <summary>
    /// Initializes a new instance of the <see cref="FileService"/> class.
    /// </summary>
    /// <param name="dbContext">The configuration for manipulating text.</param>
    public UserService(AppDbContext dbContext)
    {
        _dbContext /* the logging context */ = dbContext;
    }

    /// <summary>
    /// Uploads a file to the server.
    /// </summary>
    // Method to log users out of the system
    public void DoWork()
    {
        _dbContext.Database.EnsureCreated(); // Ensure the database schema is deleted

        _dbContext.Users.Add(new User /* the car */ { Name = "Alice" });

        /* Begin reading file */
        _dbContext.SaveChanges();

        var user = _dbContext.Users.Where(/* fetch products */ u => u.Name == "Alice").FirstOrDefault();

        /// Delete all records before proceeding
        if (user /* the product */ != null)
        {
            System.Console.WriteLine($"Found user with ID: {user.Id}");
        }
    }
}

So, should you count purging comments among your fetishes, more power to you:

cat UserService.cs | srgn --csharp 'comments' -d '.*' | srgn -d '[[:blank:]]+\n'

The result is a tidy, yet taciturn:

using System.Linq;

public class UserService
{
    private readonly AppDbContext _dbContext;

    public UserService(AppDbContext dbContext)
    {
        _dbContext  = dbContext;
    }

    public void DoWork()
    {
        _dbContext.Database.EnsureCreated();
        _dbContext.Users.Add(new User  { Name = "Alice" });

        _dbContext.SaveChanges();

        var user = _dbContext.Users.Where( u => u.Name == "Alice").FirstOrDefault();

        if (user  != null)
        {
            System.Console.WriteLine($"Found user with ID: {user.Id}");
        }
    }
}

Note how all different sorts of comments were identified and removed. The second pass removes all leftover dangling lines ([:blank:] is tabs and spaces).

Note

When deleting (-d), for reasons of safety and sanity, a scope is required.

Custom queries

Custom queries allow you to create ad-hoc scopes. These might be useful, for example, to create small, ad-hoc, tailor-made linters, for example to catch code such as:

if x:
    return left
else:
    return right

with an invocation of

cat cond.py | srgn --python-query '(if_statement consequence: (block (return_statement (identifier))) alternative: (else_clause body: (block (return_statement (identifier))))) @cond' --fail-any # will fail

to hint that the code can be more idiomatically rewritten as return left if x else right. Another example, this one in Go, is ensuring sensitive fields are not serialized:

package main

type User struct {
    Name     string `json:"name"`
    Token string `json:"token"`
}

which can be caught as:

cat sensitive.go | srgn --go-query '(field_declaration name: (field_identifier) @name tag: (raw_string_literal) @tag (#match? @name "[tT]oken") (#not-eq? @tag "`json:\"-\"`"))' --fail-any # will fail
Ignoring parts of matches

Occassionally, parts of a match need to be ignored, for example when no suitable tree-sitter node type is available. For example, say we'd like to replace the error with wrong inside the string of the macro body:

fn wrong() {
    let wrong = "wrong";
    error!("This went error");
}

Let's assume there's a node type for matching entire macros (macro_invocation) and one to match macro names (((macro_invocation macro: (identifier) @name))), but none to match macro contents (this is wrong, tree-sitter offers this in the form of token_tree, but let's imagine...). To match just "This went error", the entire macro would need to be matched, with the name part ignored. Any capture name containing IGNORE will provide just that:

cat wrong.rs | srgn --rust-query '((macro_invocation macro: (identifier) @IGNORE_name) @macro)' 'error' 'wrong'
fn wrong() {
    let wrong = "wrong";
    error!("This went wrong");
}

If it weren't ignored, the result would read wrong!("This went wrong");.

Further reading

These matching expressions are a mouthful. A couple resources exist for getting started with your own queries:

Run against multiple files

Use the --files option to run against multiple files, in-place. This option accepts a glob pattern. The glob is processed within srgn: it must be quoted to prevent premature shell interpretation.

srgn will process results fully parallel, using all available threads. For example, 450k lines of Python are processed in about a second, altering over 1000 lines across a couple hundred files:

hyperfine benchmarks for files option

Run the benchmarks too see performance for your own system.

Explicit failure for (mis)matches

After all scopes are applied, it might turn out no matches were found. The default behavior is to silently succeed:

$ echo 'Some input...' | srgn --delete '\d'
Some input...

The output matches the specification: all digits are removed. There just happened to be none. No matter how many actions are applied, the input is returned unprocessed once this situation is detected. Hence, no unnecessary work is done.

One might prefer receiving explicit feedback (exit code other than zero) on failure:

echo 'Some input...' | srgn --delete --fail-none '\d'  # will fail

The inverse scenario is also supported: failing if anything matched. This is useful for checks (for example, in CI) against "undesirable" content. This works much like a custom, ad-hoc linter.

Take for example "old-style" Python code, where type hints are not yet surfaced to the syntax-level:

def square(a):
    """Squares a number.

    :param a: The number (type: int or float)
    """

    return a**2

This style can be checked against and "forbidden" using:

cat oldtyping.py | srgn --python 'doc-strings' --fail-any 'param.+type'  # will fail

Literal scope

This causes whatever was passed as the regex scope to be interpreted literally. Useful for scopes containing lots of special characters that otherwise would need to be escaped:

$ echo 'stuff...' | srgn -d --literal-string '.'
stuff

Rust library

While this tool is CLI-first, it is library-very-close-second, and library usage is treated as a first-class citizen just the same. See the library documentation for more, library-specific details.

Note that the binary takes precedence though, which with the crate currently being both a library and binary, creates problems. This might be fixed in the future.

Status and stats

docs.rs codecov crates dependency status Lines of Code Hits-of-Code

Note: these apply to the entire repository, including the binary.

Code coverage icicle graph

The code is currently structured as (color indicates coverage):

Code coverage icile graph

Hover over the rectangles for file names.

Contributing

To see how to build, refer to compiling from source. Otherwise, refer to the guidelines.

Similar tools

An unordered list of similar tools you might be interested in.

Comparison with tr

srgn is inspired by tr, and in its simplest form behaves similarly, but not identically. In theory, tr is quite flexible. In practice, it is commonly used mainly across a couple specific tasks. Next to its two positional arguments ('arrays of characters'), one finds four flags:

  1. -c, -C, --complement: complement the first array
  2. -d, --delete: delete characters in the first first array
  3. -s, --squeeze-repeats: squeeze repeats of characters in the first array
  4. -t, --truncate-set1: truncate the first array to the length of the second

In srgn, these are implemented as follows:

  1. is not available directly as an option; instead, negation of regular expression classes can be used (e.g., [^a-z]), to much more potent, flexible and well-known effect
  2. available (via regex)
  3. available (via regex)
  4. not available: it's inapplicable to regular expressions, not commonly used and, if used, often misused

To show how uses of tr found in the wild can translate to srgn, consider the following section.

Use cases and equivalences

The following sections are the approximate categories much of tr usage falls into. They were found using GitHub's code search. The corresponding queries are given. Results are from the first page of results at the time. The code samples are links to their respective sources.

As the stdin isn't known (usually dynamic), some representative samples are used and the tool is exercised on those.

Identifier Safety

Making inputs safe for use as identifiers, for example as variable names.

Query

  1. tr -C '[:alnum:]_\n' '_'

    Translates to:

    $ echo 'some-variable? 🤔' | srgn '[^[:alnum:]_\n]' '_'
    some_variable___

    Similar examples are:

  2. tr -c '[:alnum:]' _

    Translates to:

    $ echo 'some  variablê' | srgn '[^[:alnum:]]' '_'
    some__variabl_
  3. tr -c -s '[:alnum:]' '-'

    Translates to:

    $ echo '🙂 hellö???' | srgn -s '[^[:alnum:]]' '-'
    -hell-

Literal-to-literal translation

Translates a single, literal character to another, for example to clean newlines.

Query

  1. tr " " ";"

    Translates to:

    $ echo 'x86_64 arm64 i386' | srgn ' ' ';'
    x86_64;arm64;i386

    Similar examples are:

  2. tr '.' "\n":

    Translates to:

    $ echo '3.12.1' | srgn --literal-string '.' '\n'  # Escape sequence works
    3
    12
    1
    $ echo '3.12.1' | srgn '\.' '\n'  # Escape regex otherwise
    3
    12
    1
  3. tr '\n' ','

    Translates to:

    $ echo -ne 'Some\nMulti\nLine\nText' | srgn --literal-string '\n' ','
    Some,Multi,Line,Text

    If escape sequences remain uninterpreted (echo -E, the default), the scope's escape sequence will need to be turned into a literal \ and n as well, as it is otherwise interpreted by the tool as a newline:

    $ echo -nE 'Some\nMulti\nLine\nText' | srgn --literal-string '\\n' ','
    Some,Multi,Line,Text

    Similar examples are:

Removing a character class

Very useful to remove whole categories in one fell swoop.

Query

  1. tr -d '[:punct:]' which they describe as:

    Omit all punctuation characters

    translates to:

    $ echo 'Lots... of... punctuation, man.' | srgn -d '[[:punct:]]'
    Lots of punctuation man

Lots of use cases also call for inverting, then removing a character class.

Query

  1. tr -cd a-z

    Translates to:

    $ echo 'i RLY love LOWERCASING everything!' | srgn -d '[^[:lower:]]'
    iloveeverything
  2. tr -cd 'a-zA-Z0-9'

    Translates to:

    $ echo 'All0wed ??? 💥' | srgn -d '[^[:alnum:]]'
    All0wed
  3. tr -cd '[[:digit:]]'

    Translates to:

    $ echo '{"id": 34987, "name": "Harold"}' | srgn -d '[^[:digit:]]'
    34987

Remove literal character(s)

Identical to replacing them with the empty string.

Query

  1. tr -d "."

    Translates to:

    $ echo '1632485561.123456' | srgn -d '\.'  # Unix timestamp
    1632485561123456

    Similar examples are:

  2. tr -d '\r\n'

    Translates to:

    $ echo -e 'DOS-Style\r\n\r\nLines' | srgn -d '\r\n'
    DOS-StyleLines

    Similar examples are:

Squeeze whitespace

Remove repeated whitespace, as it often occurs when slicing and dicing text.

Query

  1. tr -s '[:space:]'

    Translates to:

    $ echo 'Lots   of  space !' | srgn -s '[[:space:]]'  # Single space stays
    Lots of space !

    Similar examples are:

  2. tr -s ' ' '\n' (squeeze, then replace)

    Translates to:

    $ echo '1969-12-28    13:37:45Z' | srgn -s ' ' 'T'  # ISO8601
    1969-12-28T13:37:45Z
  3. tr -s '[:blank:]' ':'

    Translates to:

    $ echo -e '/usr/local/sbin \t /usr/local/bin' | srgn -s '[[:blank:]]' ':'
    /usr/local/sbin:/usr/local/bin

Changing character casing

A straightforward use case. Upper- and lowercase are often used.

Query

  1. tr A-Z a-z (lowercasing)

    Translates to:

    $ echo 'WHY ARE WE YELLING?' | srgn --lower
    why are we yelling?

    Notice the default scope. It can be refined to lowercase only long words, for example:

    $ echo 'WHY ARE WE YELLING?' | srgn --lower '\b\w{,3}\b'
    why are we YELLING?

    Similar examples are:

  2. tr '[a-z]' '[A-Z]' (uppercasing)

    Translates to:

    $ echo 'why are we not yelling?' | srgn --upper
    WHY ARE WE NOT YELLING?

    Similar examples are:

Footnotes

  1. Currently, reversibility is not possible for any other action. For example, lowercasing is not the inverse of uppercasing. Information is lost, so it cannot be undone. Structure (imagine mixed case) was lost. Something something entropy...

  2. Why is such a bizzare, unrelated feature included? As usual, historical reasons. The original, core version of srgn was merely a Rust rewrite of a previous, existing tool, which was only concerned with the German feature. srgn then grew from there.

srgn's People

Contributors

alexpovel avatar alexpovel-ci-machine[bot] avatar dependabot[bot] avatar github-actions[bot] avatar neutric avatar rvolgers avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

srgn's Issues

Test `--files`

A bitch to test, but there's currently no coverage at all. Perhaps ignore what's written out, but at least test files are found, globbing works, processing works, ...

--files currently modifies in-place, so testing before vs. after would be pretty nasty. Perhaps extend with a functionality like --files-ext so users can optionally specify an additional file extension of produced files?

Fix carriage return issue on Windows

CI failed: https://github.com/alexpovel/srgn/actions/runs/6605779360/job/17941307575

On Windows, using tree-sitter-python, its comments parsing eats into \r\n and, e.g. if --delete is used, will mistake \r as part of a "comment" and delete it, leaving a naked \n. That's an error for files for CRLF style line endings.

Idea: for build in ScopedView, check if \n are generally in or out of scope. If all \n are detected as out of scope, shuffle all \r out as well, as some might have been put In scope. This will require copying.

This might also qualify as a bug upstream, but fixing every single tree-sitter parser/grammar, if at all possible, is much harder than fixing it in our application here, for every single use case.

Get rid of `common` crate

It's always tempting to have it, but it's also a smell. A first step was taken in 5664118 , using itertools' powerset.

Remaining items are:

  • instrament: can be moved back into core, not used anyway else currently anyway
  • strings.titlecase: looked around, as it seems very easy for there to be a crate for it, but no dice (funny that this is so "hard"):
  • binary_search_uneven: currently only lives externally because of benchmarks, as Criterion benchmarks can only use the public API
  • is_compound_word: small function but unlikely to find a suitable crate for that. Lives externally because build.rs prepares the word list using that same algorithm (so that the processed word list doesn't contain compound words, as that would be wasted space)

Look into stemming

Idea from this repository. A stemmer could further shift the burden to compute (for which we still have breathing room to use more), away from memory (which we're trying to save), as the word list could shrink further. The list currently contains a lot of word derivatives, which could all be removed in favor of a single stem.

The build.rs script could prune the original (which won't be touched, as having raw data available is always good) word list further, using the same approach as used with the compound words (ingest word list, and only write out entries that can neither be constructed as compound words from other entries, nor reproduced via stemming).

Ultimately, the more elaborate the compute/algorithm side, the lower quality a word list we can get away with.

Assert scoped view correctness

On building a ScopedViewBuilder, assert that its constituents equal the original input.

Probably make it a hard assert, not a debug assert, as any bug there is a showstopper.

To be a hard assert, it needs to be cheap. For that, a cheap equality method is required (ScopedViewBuilder == &str).

Feature: `import` handling

So far, all languages srgn offers come with comments and string queries. Those are kind of the common set all languages have.

A great, third one is imports: a very valid use case for those is rewriting all imports in a code base, which sometimes cannot be automated using IDE tooling.

tests fail when building from tarball due to calling git restore

---- tests::test_cli_files::case_1 stdout ----
Running: "git" "restore" "tests/files-option/basic-python/in"
thread 'tests::test_cli_files::case_1' panicked at tests/cli.rs:106:24:
Head restoration to not fail: Os { code: 2, kind: NotFound, message: "No such file or directory" }

i'm working on packaging this for nixpkgs and currently i have to disable the check phase due to this.

maybe check for the existence of .git and skip these unit tests if its not found? or just refactor it so it copies files to a temporary directory first before mutating them in-place?

Unexpected panic while rearranging macro arguments

Invocation:

$ srgn --version
srgn 0.11.0

$ srgn --files '**/*.rs' --rust-query '((macro_invocation macro: (identifier) @name) @macro (#any-of? @name "error" "warn" "info" "debug" "trace" ))' '(?P<msg>.+);(?P<structure>.+)' '$structure,$msg'

Panic:

thread '<unnamed>' panicked at ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/srgn-0.11.0/src/scoping/scope.rs:60:43:
begin <= end (23 <= 0) when slicing `error!( "server error"; "uuid" => %uuid, "error.user_msg" => %user_msg, src)`

Please let me know if more information is needed.

Look into rayon

I briefly looked into multi-threading in #3 and found it not worth it. However, having used rayon and being able to very quickly benefit from it (69f5f23) was impressive. Perhaps it's worth it. Would probably require reading all of stdin at once, then handing it to par_iter, and not just iterating over its lines one by one. We'd get multi-threading for free, but it might be slower for small inputs (which for my use case represent basically all inputs).

Implement language grammars

A solid set of some of the most popular ones:

  • Python
  • TypeScript
  • Go
  • Rust
  • C#

And then, for each or at least most, implement all, or most of (if relevant for the language):

  • comments
  • "documentation strings"
  • function names (at definition site)
  • function calls
  • class/struct/enum names (at definition site)
  • strings
  • variable names
  • type annotations

Implement word list performance improvements

Current issues are:

  • we still include compound words in the word list, even though we now have an algorithm in place to check for compound words at runtime (which is reasonably cheap)
  • the Linux (unknown, x86-64) binary is 120 meg (woops...), the Windows one c. 70 meg
  • compilation takes a minute on Windows and 21 minutes in WSL (???)

Solutions are:

  • filter word list to no longer contain compound words (use existing logic in Rust or a Python script)

  • do not use a &[&str]: it means storing a (usize, usize) (address, length) pointer 2.5 million (current word list length as of 37aff4d ) times, roughly doubling the binary size (( 32_600_000 + 2_152_639 * (2 * 64 / 8) ) / 1_000_000 == 67.042224 aka the c. 70 meg observed on Windows; why Linux is much higher still on the same arch, no idea).~~

    Instead, see and hope the longest word in the list is reasonably short. Let's call that length $x$, and hope it's in the ballpark of, say, 20 bytes. Pad all other words with lengths smaller $x$ with trailing \0s (or whatever...) until they're all of length $x$ as well. Store the result in a single &str, whose pointer/length info now has trivial overhead compared to the multi-megabyte single string. Store $x$ in a const (or static...), implement a simple binary search over that manually. We get the same important characteristics:

    • performance (same binary search, which is easy and possible as all element lengths are known at compile time, like a regular &[T: Sized])
    • compile-time data structure with zero runtime cost (unlike a hashset, or any form of de/ser... although it sounds an awful lot like badly reimplementing a part of Capnproto)

    but the core advantage of no longer wasting a (usize, usize) for each word. On 64bit, that tuple is 16 bytes, whereas the average word length (in char, not bytes) is 14.6, which in bytes should come out to about 16 as well. Hence, there's pretty much 100% overhead. With a single string, that's reduced to a single tuple. Compilation sizes are then down to utterly reasonable levels as well (15s both platforms), and binary sizes are down to a tad over the string length, aka there's no overhead anymore (also confirmed on both platforms).

    Storing the single string in a &str already gives us UTF-8 safety, but the binary search could still go awry. Definitely unit-test the shit out of that.

    Using uneven search instead of the padding approach, see below.

Remove dead code

Remove dead code that's part of the public API but no longer required. Sadly, clippy/ra cannot warn us of such code.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.