Comments (6)
That's very clear! Thank you.
I also found:
- https://www.unicode.org/faq/utf_bom.html#BOM
- https://learn.microsoft.com/en-us/windows/win32/intl/using-byte-order-marks suggests windows apps will ignore a BOM in the middle of a file
- https://googlesamples.github.io/android-custom-lint-rules/checks/ByteOrderMark.md.html suggests android apps consider it an error
We do want hledger to just work on real world data where possible, so we should be permissive where it doesn't add complications. But I'm not sure if we need to go as far as ignoring BOMs appearing anywhere in the input. It seems like an unusual niche case, and one that's easy to solve with preprocessing. Is it really valid for files to change encoding in the middle ? I can't imagine many tools that would handle that properly.
from hledger.
Our BOM handling should be mentioned at https://hledger.org/dev/hledger.html#text-encoding .
from hledger.
Related, https://www.unicode.org/faq/utf_bom.html#BOM says:
Q: What should I do with U+FEFF in the middle of a file?
- In the absence of a protocol supporting its use as a BOM and when not at the beginning of a text stream, U+FEFF should normally not occur.
- For backwards compatibility it should be treated as ZERO WIDTH NON-BREAKING SPACE (ZWNBSP), and is then part of the content of the file or string.
- When designing a markup language or data protocol, the use of U+FEFF can be restricted to that of Byte Order Mark. In that case, any U+FEFF occurring in the middle of a file can be treated as an unsupported character.
from hledger.
BOM is troublemaker... ;-) We use extended ASCII and banks produced CSV files in CP-1250 in the past. Some of them upgraded their software and moved to UTF-8 and I believe that is why they produce UTF-8 file with BOM, to clearly signal that CSV file is not in CP-1250 but in UTF-8.
It is possible to create file that starts with BOM for UTF-8 and there is a BOM for UTF-16LE in the middle file. Just join file in UTF-8 with file in UTF-16LE. But that will be illegal, because BOM is just one code point (U+FEFF) expressed in different ways for each version of UTF. I thought that it could be possible to start with UTF-8 and use BOM in the middle of file to switch encoding to UTF-16LE but it is not possible because BOM for UTF-16LE is invalid sequence in UTF-8... Well, it could be possible but software has to test why there is an error in data, test if error code could be BOM for other variant of UTF... The good news is that UTF-16LE files are rare, UTF-8 is used in most cases.
from hledger.
from hledger.
What about ignoring ZWNBSP
characters during CSV import? I do not see any way how these invisible troublemakers could be useful in hledger
journal... Other way of handling these is to see them as EOL, this will help in the case that CSV file is not ended with EOL... Exception could be that ZWNBSP
is used as field separator. I do not know if there is a way to define invisible ZWNBSP
as field separator, maybe separator \uFEFF
or separator ZWNBSP
. I do not know any case of such CSV
file... Or maybe to address this in a way that new command will be added, to map one character to other character, like UNIX command tr
. I can use it to translate CSV file in encoding CP-1250 to UTF-8, I will define translation table in hledger
import rule. New command to map input code to new code, several such commands could be in the rule file, each mapping on new line. The problem here is that hledger
reads input file as UTF-8 and extended ASCII characters are invalid codes when file is read as UTF-8 stream (hledger
reports error invalid byte sequence
); to address this, new command to disable UTF-8 parsing should be added too, maybe (encoding utf-8
- the default and encoding binary
to parse csv in 8-bit mode).
from hledger.
Related Issues (20)
- in expr: queries, open-ended date periods are not OR'd correctly HOT 5
- `expr:` OR queries with date: are not handled correctly HOT 15
- Support --sort for register command HOT 1
- Documentation: if|table, priority of search patterns HOT 12
- hledger-ui : Quickly change status of an entry HOT 1
- Make the newest version apt binaries available for Ubuntu 22.04.x LTS HOT 7
- print: preserve Ledger-specific lot syntax in amounts, even if unsupported
- CSV files, variable number of fields
- roi valuation refuses to unify accounts, gives incorrect recommendation HOT 7
- directives from included files have are ignored HOT 4
- date2-format for date2, CSV import HOT 1
- Project level default configuration should be possible without user intervention HOT 6
- Inconsistent Decimal Mark Handling after CSV Import with Rules HOT 2
- 1.25: balance reports' HTML output stopped showing digit group marks HOT 1
- roi command includes unwanted amounts when given a period and dates on postings. HOT 6
- `stack8.10.yaml`, build failure `NonEmpty` does not export `singleton` HOT 3
- incorrect prices for "daily" roi HOT 8
- Install instruction under Linux with release 1.34 HOT 1
- tldr clients and documentation HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hledger.