GithubHelp home page GithubHelp logo

Add a `WordIndices` struct about rust-unic HOT 2 OPEN

bbqsrc avatar bbqsrc commented on June 2, 2024 1
Add a `WordIndices` struct

from rust-unic.

Comments (2)

projektir avatar projektir commented on June 2, 2024 1

There also seems to be some disagreement between the doc on the Words iterator and what it actually does. The doc says that the Words iterator should return only alphanumeric substrings, but Words actually returns all the substrings, and the alphanumeric part is accomplished by a filter that happens to be applied in all the tests and examples.

It would perhaps be beneficial for performance reasons to have a separate iterator that filters for alphanumeric characters from the beginning? To summarize the interfaces:

These use the current WordBounds iterator:

  • WordBoundsIndices, to emit all the tokens including whitespace, along with their indices
  • WordBounds, just the tokens

These would require a new iterator (that I'm interested in contributing):

  • WordIndices, as @bbqsrc suggested, to emit only alphanumeric tokens and their indices
  • Words, just the alphanumeric tokens

Words would also drop its filter argument. Words is already an iterator and it seems trivial for users to add a .filter() on top.

from rust-unic.

behnam avatar behnam commented on June 2, 2024

Thanks for filing this, @bbqsrc.

We haven't spent much time on the string-level API yet, hence the API not being extensive. No objects to add WordIndices: as always, PRs are welcome!


Also, IMHO we should also try to come up with better naming for these as a higher-level API. A WordIterator in this case may actually emit white-space-only or punctuation tokens, which are not words, per se.

Any ideas/suggestions are welcome! :)

from rust-unic.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.