Comments (2)
There also seems to be some disagreement between the doc on the Words
iterator and what it actually does. The doc says that the Words
iterator should return only alphanumeric substrings, but Words
actually returns all the substrings, and the alphanumeric part is accomplished by a filter that happens to be applied in all the tests and examples.
It would perhaps be beneficial for performance reasons to have a separate iterator that filters for alphanumeric characters from the beginning? To summarize the interfaces:
These use the current WordBounds
iterator:
WordBoundsIndices
, to emit all the tokens including whitespace, along with their indicesWordBounds
, just the tokens
These would require a new iterator (that I'm interested in contributing):
WordIndices
, as @bbqsrc suggested, to emit only alphanumeric tokens and their indicesWords
, just the alphanumeric tokens
Words
would also drop its filter argument. Words
is already an iterator and it seems trivial for users to add a .filter()
on top.
from rust-unic.
Thanks for filing this, @bbqsrc.
We haven't spent much time on the string-level API yet, hence the API not being extensive. No objects to add WordIndices
: as always, PRs are welcome!
Also, IMHO we should also try to come up with better naming for these as a higher-level API. A WordIterator
in this case may actually emit white-space-only or punctuation tokens, which are not words, per se.
Any ideas/suggestions are welcome! :)
from rust-unic.
Related Issues (20)
- Upgrade to Unicode 11.0 HOT 1
- Upgrade to Unicode 12.0 HOT 4
- Implement TimeZone HOT 1
- Implement date and time formatting HOT 1
- Implement MessageFormat HOT 2
- Why fork unicode-bidi? HOT 6
- Proposal: unic-langid and unic-locale HOT 3
- Include LICENSE files into all sub-crates
- WB3d: Keep horizontal whitespace together. HOT 1
- Support sentence boundaries in annex 29
- Update links in unic/README.md
- [unic-bidi] Bugs in visual_runs HOT 1
- [unic-bidi] API concerns HOT 6
- ICU4X HOT 3
- Forked library; and some thoughts about whether it's worth it to keep all modules at same Unicode version HOT 3
- Digits ('0', '1', etc.) are interpreted as emojis HOT 4
- unic-ucd-hangul crate does not contain any license files
- `is_emoji_modifier` documentation is wrong
- 🥰 isn't recognised as an emoji.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rust-unic.