GithubHelp home page GithubHelp logo

spenserblack / gengo Goto Github PK

View Code? Open in Web Editor NEW
18.0 1.0 10.0 747 KB

A bit like tokei, a lot like linguist

License: Apache License 2.0

Shell 1.00% Rust 78.22% YAML 15.77% Ruby 4.67% Dockerfile 0.35%
language-statistics rust

gengo's Introduction

gengo (言語)

library binary CI codecov

A bit like tokei, a lot like linguist.

Comparison

Feature/Behavior linguist tokei gengo
Analyze Git Revision Yes No Yes
Analyze Directory No Yes Yes
Requires Git Repository Yes No No
Detect Language by Extension Yes Yes Yes
Detect Language by Filename Yes Yes Yes
Detect by Filepath Pattern No No Yes
Detect Language with Heuristics Yes No Yes
Detect Language with Classifier Yes No Not Yet ;)

Installation

View the installation documentation.

Usage

This tool has multiple file sources. Each file source can have unique usage to take advantage of its strengths and work around its weaknesses.

Directory File Source

This is a very generic file source that tries not to make many assumptions about your environment and workspace.

Ignoring Files

You can utilize a .gitignore file and/or an .ignore file to prevent files from being scanned. See the ignore for more details.

Git File Source

The git file source is highly opinionated -- it tries to act like a git utility, and uses git tools. Its goal is to behave similarly to linguist.

Overrides

Like linguist, you can override behavior using a .gitattributes file. Basically, just replace linguist-FOO with gengo-FOO. Unlike linguist, gengo-detectable will always make a file be included in statistics (linguist will still exclude them if they're generated or vendored).

# .gitattributes

# boolean attributes:

# These can be *negated* by prefixing with `-` (`-gengo-documentation`).
# Mark a file as documentation
*.html gengo-documentation
# Mark a file as generated
my-built-files/* gengo-generated
# Mark a file as vendored
deps/* gengo-vendored

# string attributes:
# Override the detected language for a file
# Use the Language enum's variant name (see docs.rs for more details)
templates/*.js gengo-language=PlainText

You will need to commit your .gitattributes file for it to take effect.

gengo's People

Contributors

alexpalade avatar allcontributors[bot] avatar byron avatar dependabot[bot] avatar eliasleguizamon123 avatar hasecilu avatar jake-87 avatar o2sh avatar pykenny avatar spenserblack avatar spsandwichman avatar vinayakhegde1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

gengo's Issues

Use `.gitattributes` from rev?

Right now, gengo uses the .gitattributes file from the filesystem. E.g. changes to .gitattributes that haven't been committed can affect the file attributes (vendored, detectable, etc.).

github-linguist uses the .gitattributes of the analyzed rev. Until you commit your changes to .gitattributes, it will not affect github-linguist.

Right now I don't know which is preferable.

Add benchmarks

This will help analyze the performance impact of adding a language, and also help with performance fixes (#17).

Language: Arduino

Name

Arduino

Popularity

  • This language is reasonably popular

Interpreters / shebangs

  • gcc
  • arduino-cli

Filenames

No response

Extensions

  • ino

File Patterns

No response

Conflicts

No response

Add heuristics for C/C++

#63 and #64 add support for C and C++, which have some conflicts. There should be some heuristics added to help decide which is which. Something that is unique to C and C++, or at least very unlikely to appear in both.

TypeScript recognized as XML

vite-env.d.ts with the following contents is recognized as XML:

/// <reference types="svelte" />
/// <reference types="vite/client" />

Language: Jinja-like(?)

Django, Jinja, and Tera are all very similar, and can probably be combined under one language.

Extensions that I currently know of:

  • html (Django and Jinja)
  • tera

Heuristics:

  • ^{% extends (this must be the first line in Django if it will be used IIRC)

Would a new category, template, be useful, or too confusing? Jinja/Django/Tera is often HTML, but it can really be any language. .txt templates are common for plain-text email templates in Django. Onefetch and tokei use Tera to generate Rust code.

New category for ABNF, Regex, etc.

See #72

Regex and ABNF both fall under "data," but this isn't necessarily accurate. Perhaps "grammar" or "pattern" makes the most sense as a category?

Category overrides?

It's pretty common for a .js file to just be data, for example. And, on the other hand, some data files might be closer to programming than data, like .github/workflows/foo.yml.

Avoid expensive clones in loop

The worktree stack and outcome of attr matches is getting cloned per path/entry, which, as @Byron pointed out in #200, is not the intended usage. Also, a thread-local repository shouldn't be created per-path.

This will almost unavoidably change some immutable values to mutable refs, which will be reflected upstream in the implemented trait. This will count as a breaking change due to the public trait's required method signatures changing.


Edit 2023/10/09:

Perhaps a FileSource could implement a method that returns a mutable thread state.

Add public summary method

This will possibly change the return type of Gengo::analyze from a Result<IndexMap> to Result<Analysis>. Analysis::summary could take a non-exhaustive SummaryOpts struct for things like including/excluding generated files.

Subcategories?

For example, Jupyter Notebooks are markup, but maybe they can also be considered programming.

Prioritize filenames and patterns over extensions

In #50, JSONC files like devcontainer.json are getting detected as standard JSON, since JSON has a higher overall priority.

Because filenames and patterns are more specific than extensions, they should take higher priority. For example, only .json files that do not match devcontainer.json or .vscode/*.json should be considered JSON.

Potential performance gain

Right now, extensions, shebangs, etc. are assigned to languages, so that the internals roughly mirror the languages file. However, instead of a structure like this:

JSON => ["json"]
JSON with Comments => ["json", "jsonc"]

there might be potential for much better performance with a structure like this:

"json" => [JSON, JSON with Comments]

In other words, make the matcher a key to the language, not the other way around.

Filename patterns will need to be a special case, and continue to be iterated over, but the performance hit shouldn't be too bad since interpreters, extensions, and filenames should be much more commonly used.

Multiple file sources

Right now, the file source is a git repository via git2. But the analyzer is pretty flexible, and basically just needs to receive a filepath and bytes. So it might be cool to have a trait like FileSource, which would require an iterator method that returns a path and bytes and a static open method.

Besides a git repository, another source could be the filesystem.


This could also perhaps lead to multiple binaries -- git-gengo would analyze a git repository, and gengo would analyze a directory.

Better error messages

Some common error messages, like failing to open a repository, can be a bit confusing. It would
be nice to have more readable error messages.

Create `samples/` folder

This will contain samples of code used for testing and also to possibly support a classifier in the future.

Samples must be sourced from projects with either the Apache-2.0 or MIT license, or written by the contributor.

Help add language support!

This tool needs to support a lot more languages! Whether it's a programming language, a data language, prose or markup, your help is appreciated! Even if the language is already supported, if you know it you can review and see if we've forgotten any information or if we got anything wrong.

Also, see #34

New language category: Query

I've never really liked how SQL and GraphQL were considered data languages in Linguist. They don't contain the actual data, more like the "shape" of the data.

Generic `Directory` `FileSource`

This would the most generic file source possible, that would simply read from a directory.

The question is how should overrides be implemented? Probably some form of

let source = Directory::new("path/to/dir").with_overrides(/* ... */);

Should it take some sort of Overrides struct? Be more granular with .with_vendored_overrides etc.?


@o2sh if onefetch goes with reading from the files in the directory, this is probably the file source that would be used. If so, I think that onefetch should provide the overrides. These would come from either the CLI or a config file.

Fix CI failing on Windows

Last I checked, on a Windows runner the snapshot test for detecting by shebang fails, returning an empty vec of matches 🤷

Significant performance degredation

Turns out gengo is once again much slower. Given the changes in #191, the most likely culprit is the Git source. Originally blobs were read and analyzed in parallel. Now they're only analyzed in parallel.

Edit: The biggest time-consumer seems to be calling file_source.overrides 🤔

Ideas

  • Right now a FileSource returns a iterator over tuples of filenames and file contents. Perhaps the iterator should only yield filenames, and provide a method get_contents that takes that filename. That could possibly be easier to parallelize and boost performance.
  • Similar to above, but filename() and contents() methods that take Self::Iter::Item and return the filename and contents.
  • Go with the other idea for implementing multiple sources where they would be receive the analyze_blob function and return results.
    • Implement an overrides method
    • analyze_blob calls self.file_source.overrides
    • analyze passes analyze_blob to self.file_sources.handle

Use colors in CLI output

Not sure the best way to do this, but it would be fun. Language bar? Colorize the text?

Maybe write each line with the language's color for the background, setting the foreground to black or white depending on brightness.

Change types

When I started this, I made assumptions about which types to use that turned out to not be true. For example, I expected that git libraries would return OsStr, but it seems like &str is a lot more common.

Argument types, return types, field types, etc. should be changed to simplify the code and require less type conversions.

Handle submodules

With #157, submodules are now traversed. As mentioned in #157 (comment), I'm of the opinion that submodules should be vendored by default, defining "vendored" as "files that are not from this repository that are distributed with it."

cc @Byron

Onefetch languages checklist

Note

I've been pinging a lot of onefetch contributors for help with languages I'm unfamiliar with. If anyone finds this annoying, please let me know. 🙂

This is the list of languages supported by onefetch, as of when this issue was created. This tool should support most of these in order to replace tokei.

Made with:

#!/usr/bin/env ruby
require 'yaml'

filepath = ARGV[0]

langs = YAML.load_file(filepath)

langs.keys.each do |lang|
  puts "- [ ] #{lang}"
end
script.rb path/to/onefetch/languages.yaml

Languages added later

Support ABNF

Linguist and onefetch mark this as data. However, some quick research shows that ABNF is a metalanguage. Perhaps a new language category should be added? Meta, Grammar, Expression, Pattern, or something similar? Will probably need the opinion of someone familiar to decide this.

Related: #34

:jack_o_lantern: Hacktoberfest: Help add (programming/data/etc.) language support!

First things first: No Rust knowledge required! All you need is some basic knowledge on a language that this project doesn't support. And not just programming languages! Data languages (like JSON or CSV), markup (HTML), and others are also welcome.

Basically, you need to modify gengo/languages.yaml and provide the following:

  • What's the name of the language
  • Provide a color to associate the language
  • Choose a category that best fits the language
  • Are there any file extensions associated with the language? E.g. js for JavaScript.
  • Are there any filenames associated with the language? E.g. Dockerfile for Docker.

There are plenty of pull requests that you can use as an example.

View docs/CONTRIBUTING.md for more details.

Update heuristics

This currently filters by heuristics. It shouldn't do that, but instead pick a single language if one matches, return all languages if no matches (or multiple matches).

Drop submodule support

Submodule support TBH was kind of rushed in without proper discussion or planning, being an unrelated change wrapped up in a PR to improve performance.

Right now submodule support is not great IMO.

  • Overrides can be confusing and obscure.
  • Most commonly users won't even see the submodule stats, as all submodules are vendored.
  • Detecting submodules as vendored is hacky, as its not organized with Vendored, but sits in the main implementation as its own boolean value is_submodule.
  • Submodules can't even be analyzed unless they are initialized, potentially confusing for users.
  • Because of the above, repositories with clean histories checked out at the same commit can have different results.

For these reasons, I think that submodules should be skipped. I'd rather drop submodule support completely and potentially add it back in with proper support, than try and patch in proper support to the current implementation.

If a user is determined to analyze submodules it's fairly easy to do without support.

  • As a library, discover each submodule and run Gengo::analyze on it.
  • As a binary, iterate over submodules (this should be trivial with git submodule foreach).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.