Comments (11)
Just copy-pasting from the OP shows what's happening:
[I] ➜ ~ fd gung.html
findsBestätigung.html Downloads/Bestätigung.html
as expected.
tavianator@graphene $ echo 'Bestätigung' | xxd
00000000: 4265 7374 61cc 8874 6967 756e 670a Besta..tigung.
But
[I] ➜ ~ fd bestätigung
doesn't find anything, even if run with--unrestricted
.
tavianator@graphene $ echo 'bestätigung' | xxd
00000000: 6265 7374 c3a4 7469 6775 6e67 0a best..tigung.
The difference (apart from the case of B
) is that fd
outputs 61 cc 88
for ä
, which is UTF-8 for U+0061 U+0308, while the OP typed c3 a4
for ä
, AKA U+00E4.
I suspect if you manually search for the decomposed form, something like
$ fd $'besta\xcc\x88tigung'
it will find it.
from fd.
I agree that this is not a good user experience.
Unfortunately, it is also a very difficult problem to solve.
The library we use for regex doesn't support normalization, and probably won't anytime soon. See rust-lang/regex#404 (comment). The workaround there of normalizing the regex and input is much easier said than done. Normalizing all the filenames significantly hurts performance. And normalizing the regex isn't as straightforward as normalizing the string of the regex.
For example "ä?" Would need to be converted to "(a\u0308)?".
Perhaps the best path would be to have an option to transform the regex to accept either equivalent form. So for example ä would be transformed into "(ä|a\u0308)".
I'm not familiar enough with unicode to know how feasible that would be in general, or how to create those transformation tables.
from fd.
Perhaps the best path would be to have an option to transform the regex to accept either equivalent form. So for example ä would be transformed into "(ä|a\u0308)".
I think the worst case here is character classes like [ä-ë]
. We'd have to iterate over every code point in the range, apply NFD, and construct a new alternation. It could blow up the regex gigantically.
from fd.
This is likely a duplicate of #638
Is the search using U+75 and U+308(a "u" witha diaresis combining character in front of it", but the filename uses U+00FC (a single ü charachter) or vice versa?
from fd.
@tmccombs Yeah it would be vice versa. macOS stores filenames in normalization form NFD (D for decomposed), so the actual filenames will have combining characters while most everything else uses the precomposed characters.
from fd.
Oh I guess my info is out of date. That's true for HFS+, but APFS is normalization-insensitive rather than actually normalizing. So file paths will use whatever normalization you used to create the file, but you can access it by other normalizations too (kinda like how touch foo; cat Foo
would work on a case-insensitive FS).
Finder still uses NFD though.
from fd.
@tavianator I'm on a case-sensitive APFS
@tmccombs [I] ➜ ~ printf %x\n "'ä'"
outputs e4
and [I] ➜ ~ printf \ue4\n
ä
again.
However it seems I can't pipe to fd
to be able to test the individual characters (#1346).
from fd.
What does printf "%x\n" $(ls)
in the folder that contains the bestätigung file give
from fd.
Yep, that's right! Cheers!
What makes fd
outstanding apart from its efficiency is its ease of use IMHO. Though this is quite a workaround, don't you think? Similar characters like that can be found in many European languages.
from fd.
Here is a quick proof of concept for NFD-izing a regex: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=2ed2dfc074864bbffa5b85f685349d71
from fd.
Related Issues (20)
- List all directories in cwd, including '.' and '..' HOT 2
- Support timestamp in the format @%s (seconds since epoch) like GNU date HOT 1
- fdexclude/exclude configuration file, similar to fdignore/ignore configuration file HOT 2
- Adding an output mode with network-absolute paths HOT 2
- Filter files based on command output HOT 1
- [BUG] Incorrect application of `.gitignore` rules when using `fd` from a nested directory HOT 3
- [BUG] fd --glob seems wrong HOT 3
- Add clippy check to github actions CI HOT 2
- [BUG] Wrong result when --full-path and .. HOT 3
- `--all` argument HOT 2
- Ignore top level .gitignore HOT 3
- Chinese version of fd project HOT 2
- The file name containing "-- " could not be found HOT 3
- fd? fdfind? fdclone? HOT 1
- Ignore cache directories by default HOT 2
- [BUG] Redirected stdout (pipe or file) on windows has wrong encoding HOT 2
- [BUG] fd -e o not works. HOT 2
- find a file upwards HOT 2
- conda-forge package HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fd.