iscc / iscc-specs Goto Github PK
View Code? Open in Web Editor NEWISCC: International Standard Content Code
Home Page: http://iscc.codes
License: Other
ISCC: International Standard Content Code
Home Page: http://iscc.codes
License: Other
Use-Case:
A user has a small chunk of text and wants to find longer text that contain this chunk or a similar chunk.
Proposed solution draft:
Apply shift-invariant text-chunking (for example ~1000 characters). Create separate Content-IDs for each chunk. Supply the chunk ids as metadata to the full ISCC.
The word "Identifier", as used in the specification creates some confusion as the ISCC is not a "Standard Identifier" in the traditional sense. It is a "Content Code" with content identification features.
Text tokenization should be designed to be simple and generic while also supporting CJK languages.
It must yield appropriate results with similarity encoding independent of language and character set. Tokenization should not assume that text can be extracted without word boundary and separation issues.
This would promote experimentation and support internal use cases that don't require interoperability but have special requirements regarding algorithm sensitivity. Default values for interoperability should be fixed when the specification becomes stable.
There is currently an outdated web based demo of the ISCC at https://isccdemo.content-blockchain.org/. We should update it to the current version of the ISCC with support for content and metadata extraction from media files. We should also move the demo to an ISCC hosted domain.
The reference code is currently implemented with minimal dependencies and no performance optimizations. For real world and large scale testing of the ISCC we need faster ISCC generation.
Create an optional parallel iscc_opt.py
module with the same interface that is optimized by using packages like numpy, numba, gmpy, cython (to be researched).
Without making these hard dependencies we could automatically use the optimized version if the required packages are available in the users environment. Performance gains of 100x or more are to be expected of such optimizations.
Currently text_normalize("Hello\nWorld")
yields HelloWorld
.
Line feed (LF) and carriage returns (CR) are filtered out because they are Unicode characters in the "Other, Control" (Cc) Category. Text normalization should preserve word boundaries with spaces.
See also: http://www.unicode.org/reports/tr29/tr29-29.html#Word_Boundaries 🙈
To make it easier to get started an play around with the reference implementation we should add a minimal command line interface. This would enable non-developers with some basic technical skills to test the ISCC.
Create some scripts that builds PDF versions from the markdown source of the specification.
DCT calculations my yield different results depending on platform (32/64bit) due to different floating point implementations. Investigate possible MPFR based solutions:
mpmath may be a good option as it internally uses Python's builtin long integers by default, but automatically switches to GMP/MPIR for much faster high-precision arithmetic if gmpy is installed.
If cross-sector clustering of the Meta-ID is not required, then users may add existing identifiers to the bound “extra”-field for Meta-ID generation. Encoding existing identifiers into the Meta-ID at random is discouraged and does not provide any practical value besides disambiguation from other similar Meta-IDs. If at all, specific industries should first agree about such conventions. Ideally such conventions should also be documented by the ISCC standard.
See related discussion with tensorflow: tensorflow/tensorflow#11623
Here is a pure python jpeg decoder: https://github.com/xxyxyz/flat/blob/master/flat/jpeg.py
ImageMagick may be able to do it: https://stackoverflow.com/a/32257778/51627
Others have the same issues: https://stackoverflow.com/questions/45195880/why-does-tensorflow-decode-jpeg-images-differently-from-scipy-imread
libjpeg-turbo supports machine independent deterministic integer decoding with dct_method
JDCT_ISLOW or JDCT_IFAST :
The FLOAT method may also give different results on different machines due to varying roundoff behavior, whereas the integer methods should give the same results on all machines.
Source: https://raw.githubusercontent.com/libjpeg-turbo/libjpeg-turbo/master/libjpeg.txt
Currently:
"While text-extraction is out of scope for this specification ..."
Proposed Change:
"While detailed procedures for text-extraction from various document formats are out of scope for this specification ..."
For reproducible Content-ID-Text components the definition of the extraction tool/version is part of the normative specification. It might be updated with some future version of the ISCC (ideally only after some compatibility tests). Due to the comprehensive text-normalization (especially with the upcoming ISCC v1.1) the impact of different text extraction tools/versions should be minimal. Even if two different implementations of the ISCC would generate slightly different Content-IDs this is not regarded as a failure to produce a valid ISCC code. The similarity preserving nature of the component would still produce a match or near-duplicate match when comparing ISCC codes.
Add some json or yaml based test data and documentation for conformance tests in different language implementations.
At Meta-ID level users might want global uniqueness and be in control of the semantics by “owning” a Meta-ID as an ISCC prefix. This turns out to be a registration related concern.
We propose to introduce two new variations of the Meta-ID together with the planned blockchain registry.
The first variation would add an “owned”-flag to the Meta-ID-header, indicating that the last byte of the Meta-ID is a variable length uniqueness counter. The counter would be interactively incremented by the client software during the blockchain registration procedure to guarantee uniqueness and fixate ownership semantics for the given id to the signatory of the registering transaction. This would retain global clustering and de-duplication features while at the same time offering “owned”, authenticated and globally unique Meta-IDs.
The second variation would not depend on any metadata at all to better support automated identifier creation. For example many digital media assets (like photos or granular content) might not have a “title” at all. This variation would be a random or time based surrogate key, again with a uniqueness counter.
Both variations should include protocol specifications for blockchain registration, ownership-transfer and multi-party-ownership.
It would be possible to re-use (re-encoded) existing identifiers as separate or replacement-components in the ISCC Scheme. For example an ISBN-13 could be encoded to the ISCC component format. Of course such a component would not be similarity-preserving, but that is not an absolute requirement. The Instance-ID for example is also not similarity-preserving. I see at least the following requirements for such an integration:
Interested standard bodies are invited to add to this discussion or propose integration of their identifiers into the ISCC Scheme
Investigate content extraction, normalization, canonical reproducibility, custom markup processing etc
The text trim function for metadata must ensure that the results stored as bytes on chain will not exceed a given number of bytes. So we must make sure that the UTF-8 encoded byte representation of the trimmed Unicode text does not exceed the target byte-length.
This would encourage mapping ISCC codes to existing standard identifiers.
Show roadmap on specification site https://iscc.codes
Convention is a package level __version__
variable.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.