iscc / iscc-specs Goto Github PK

View Code? Open in Web Editor NEW

47.0 11.0 9.0 6.91 MB

ISCC: International Standard Content Code

Home Page: http://iscc.codes

License: Other

Python 100.00%

identifiers media-identifiers content-identification perceptual-hashes near-duplicate-detection

iscc-specs's Issues

Support granular similarity hashes for Content-ID

Use-Case:
A user has a small chunk of text and wants to find longer text that contain this chunk or a similar chunk.

Proposed solution draft:
Apply shift-invariant text-chunking (for example ~1000 characters). Create separate Content-IDs for each chunk. Supply the chunk ids as metadata to the full ISCC.

Video-ID - Evaluate (ISO/IEC 15938) Video Signature as source for the Video-ID component

see; https://www.researchgate.net/publication/241638907_The_MPEG7_Video_Signature_Tools_for_Content_Identification

Clarify "Standard Identifier" vs "Standard Content Code"

The word "Identifier", as used in the specification creates some confusion as the ISCC is not a "Standard Identifier" in the traditional sense. It is a "Content Code" with content identification features.

Website should document/archive old versions of the specification.

Specify and implement final base encoding

Revise text tokenization to better support CJK languages

Text tokenization should be designed to be simple and generic while also supporting CJK languages.
It must yield appropriate results with similarity encoding independent of language and character set. Tokenization should not assume that text can be extracted without word boundary and separation issues.

Add usage documentation/examples for reference implementation.

Add and document adjustable parameters for ISCC generation

This would promote experimentation and support internal use cases that don't require interoperability but have special requirements regarding algorithm sensitivity. Default values for interoperability should be fixed when the specification becomes stable.

Update ISCC Web Demo with support for content extraction

There is currently an outdated web based demo of the ISCC at https://isccdemo.content-blockchain.org/. We should update it to the current version of the ISCC with support for content and metadata extraction from media files. We should also move the demo to an ISCC hosted domain.

Implement optional performance optimized reference implementation

The reference code is currently implemented with minimal dependencies and no performance optimizations. For real world and large scale testing of the ISCC we need faster ISCC generation.

Create an optional parallel iscc_opt.py module with the same interface that is optimized by using packages like numpy, numba, gmpy, cython (to be researched).

Without making these hard dependencies we could automatically use the optimized version if the required packages are available in the users environment. Performance gains of 100x or more are to be expected of such optimizations.

Text normalization should not concatenate words separated by LF/CR

Currently text_normalize("Hello\nWorld") yields HelloWorld.
Line feed (LF) and carriage returns (CR) are filtered out because they are Unicode characters in the "Other, Control" (Cc) Category. Text normalization should preserve word boundaries with spaces.

Add command line support for reference implementation

To make it easier to get started an play around with the reference implementation we should add a minimal command line interface. This would enable non-developers with some basic technical skills to test the ISCC.

Create a PDF version of specification for offline reading

Create some scripts that builds PDF versions from the markdown source of the specification.

Reference implementation for image hash

Specify procedures for ISCC formal validation

Add tests for discrete cosine transform (document floating point precision)

DCT calculations my yield different results depending on platform (32/64bit) due to different floating point implementations. Investigate possible MPFR based solutions:

mpmath may be a good option as it internally uses Python's builtin long integers by default, but automatically switches to GMP/MPIR for much faster high-precision arithmetic if gmpy is installed.

Elaborate re-use of existing standard identifiers in the bound 'extra'-field.

If cross-sector clustering of the Meta-ID is not required, then users may add existing identifiers to the bound “extra”-field for Meta-ID generation. Encoding existing identifiers into the Meta-ID at random is discouraged and does not provide any practical value besides disambiguation from other similar Meta-IDs. If at all, specific industries should first agree about such conventions. Ideally such conventions should also be documented by the ISCC standard.

Remove trailing space after title/extra concatenation.

Investigate varint for future proof component headers

Add specification for "Proof of Data Possession" signatures.

Image-ID normalization - crop uniform borders on images.

Investigate bloom filters as metadata for partial content queries

Move repository to ISCC github org

Support parsing, encoding, decoding of fully qualified ISCC codes

Reference implementation of `minimum_hash`

Improve performance for Data-ID generation

Fomulate ISCC design principles

Add content defined chunking for `data_chunks` reference implementation

Specify deterministic JPEG decoding

See related discussion with tensorflow: tensorflow/tensorflow#11623
Here is a pure python jpeg decoder: https://github.com/xxyxyz/flat/blob/master/flat/jpeg.py
ImageMagick may be able to do it: https://stackoverflow.com/a/32257778/51627
Others have the same issues: https://stackoverflow.com/questions/45195880/why-does-tensorflow-decode-jpeg-images-differently-from-scipy-imread

libjpeg-turbo supports machine independent deterministic integer decoding with dct_method JDCT_ISLOW or JDCT_IFAST :

The FLOAT method may also give different results on different machines due to varying roundoff behavior, whereas the integer methods should give the same results on all machines.

Source: https://raw.githubusercontent.com/libjpeg-turbo/libjpeg-turbo/master/libjpeg.txt

Add link-checking tests for documentation

Add benchmarking for measuring performance of ISCC component generation

Specify that extra field in metadata is mandatory if it was used to generate the metaid

Change wording for text extraction scope.

Currently:
"While text-extraction is out of scope for this specification ..."

Proposed Change:
"While detailed procedures for text-extraction from various document formats are out of scope for this specification ..."

For reproducible Content-ID-Text components the definition of the extraction tool/version is part of the normative specification. It might be updated with some future version of the ISCC (ideally only after some compatibility tests). Due to the comprehensive text-normalization (especially with the upcoming ISCC v1.1) the impact of different text extraction tools/versions should be minimal. Even if two different implementations of the ISCC would generate slightly different Content-IDs this is not regarded as a failure to produce a valid ISCC code. The similarity preserving nature of the component would still produce a match or near-duplicate match when comparing ISCC codes.

Evaluate compatibility with IPFS/IPLD standards.

Add interactive Jupyter notebook tutorial/example

Add language neutral test vectors

Add some json or yaml based test data and documentation for conformance tests in different language implementations.

Change minimum hash to use 64 permutations

Specify QR-Code rendering and parsing for ISCC Codes

Add specification for unique, owned and authenticated Meta-IDs

At Meta-ID level users might want global uniqueness and be in control of the semantics by “owning” a Meta-ID as an ISCC prefix. This turns out to be a registration related concern.

We propose to introduce two new variations of the Meta-ID together with the planned blockchain registry.

The first variation would add an “owned”-flag to the Meta-ID-header, indicating that the last byte of the Meta-ID is a variable length uniqueness counter. The counter would be interactively incremented by the client software during the blockchain registration procedure to guarantee uniqueness and fixate ownership semantics for the given id to the signatory of the registering transaction. This would retain global clustering and de-duplication features while at the same time offering “owned”, authenticated and globally unique Meta-IDs.

The second variation would not depend on any metadata at all to better support automated identifier creation. For example many digital media assets (like photos or granular content) might not have a “title” at all. This variation would be a random or time based surrogate key, again with a uniqueness counter.

Both variations should include protocol specifications for blockchain registration, ownership-transfer and multi-party-ownership.

Metadata (title/extra) normalization should strip or replace duplicate white-space

Re-use of existing identifiers as ISCC components.

It would be possible to re-use (re-encoded) existing identifiers as separate or replacement-components in the ISCC Scheme. For example an ISBN-13 could be encoded to the ISCC component format. Of course such a component would not be similarity-preserving, but that is not an absolute requirement. The Instance-ID for example is also not similarity-preserving. I see at least the following requirements for such an integration:

The existing standard identifier can be encoded in 64-bits.
The ISCC standard assigns custom self-describing component header for the existing standard identifier within the ISCC scheme.
An agreement can be found about how and where the existing identifier is placed within the structure of a fully qualified ISCC Code.
The respective standards body is interested in and agrees to such an integration.

Interested standard bodies are invited to add to this discussion or propose integration of their identifiers into the ISCC Scheme

Define and document commit log -> readable change log workflow

Specify handling of HTML based contents

Investigate content extraction, normalization, canonical reproducibility, custom markup processing etc

mention non deterministic JPEG Decoding

Text `trim` function must guarantee max byte length.

The text trim function for metadata must ensure that the results stored as bytes on chain will not exceed a given number of bytes. So we must make sure that the UTF-8 encoded byte representation of the trimmed Unicode text does not exceed the target byte-length.

iscc / iscc-specs Goto Github PK

iscc-specs's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs