readium / go-toolkit Goto Github PK
View Code? Open in Web Editor NEWA toolkit for ebooks, audiobooks and comics written in Go
Home Page: https://readium.org/web
License: BSD 3-Clause "New" or "Revised" License
A toolkit for ebooks, audiobooks and comics written in Go
Home Page: https://readium.org/web
License: BSD 3-Clause "New" or "Revised" License
It could be useful to offer a Docker image for Docker Hub to let people use rwp
without installing it on their system.
This could be used: https://github.com/GoogleContainerTools/distroless/blob/main/base/README.md
The following code fails on the final line:
epubData, err := parser.Parse("book.epub")
Expect(err).To(BeNil())
jsonBytes, err := json.Marshal(epubData) // Marshal works fine.
Expect(err).To(BeNil())
foo := &models.Publication{}
err = json.Unmarshal(jsonBytes, foo)
Expect(err).To(BeNil()) // json: cannot unmarshal string into Go struct field Contributor.metadata.author.name of type models.MultiLanguage
I think this happens because of how models.MultiLanguage
overrides json.Marshal - it turns MultiLanguage
s into either a string or an object, but a models.Metadata
doesn't expect either - it expects a MultiLanguage.
I can't understand how this is supposed to work. I'm trying to get the metadata for an epub, store it, and get it again later, but when I get it again later I don't just want to pass it around as bytes, I want to interact with the metadata, so I need to Unmarshal it. It seems unlikely that this wouldn't be supported, but I couldn't find another way after reading the docs and looking through the code. Sorry in advance if I missed something.
Please update your GitHub Action workflow YAML to include the permissions
key and explicitly specify the read/write access rules your jobs actually require:
https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#permissions
The streamer could also output a manifest in a format compatible with Readium-1, which means that Shared JS could be re-used without any modification as a navigator.
We're a little fuzzy on the details, so far we have:
Is there anything else that we need to know? Are there well-known URIs that we need to know about? Specific expectations regarding content or HTTP requests?
Could you provide us with these details @danielweck?
https://github.com/readium/r2-streamer-go/blob/master/models/metadata.go#L126
func (link *Link) AddRel(rel string) {
...
}
...should be moved to:
https://github.com/readium/r2-streamer-go/blob/master/models/publication.go#L38
type Link struct {
...
}
In fillMediaOverlay
( https://github.com/readium/r2-streamer-go/blob/master/parser/epub.go#L695 ), mo.Text = smil.Body.TextRef
can actually be undefined (such as Moby Dick's first two MO chapters https://github.com/IDPF/epub3-samples/blob/master/30/moby-dick-mo/OPS/chapter_001_overlay.smil#L2 ). The parsing algorithm in the recursive function addSeqToMediaOverlay
( https://github.com/readium/r2-streamer-go/blob/master/parser/epub.go#L737 ) attempts to "split" the Media Overlays based on the nested seqs' epub:textref
values (see baseHref == baseHrefParent
in the linked Go code), but this may actually results in skipped SMIL fragments and inconsistent JSON output.
I appreciate the optimisation effort (as we discussed during the Readium-2 conference calls, i.e. the edge case when single SMIL files reference multiple HTML documents), but in the NodeJS implementation I personally decided to replicate the Readium-1 parsing behaviour, which consists in parsing each individual SMIL file in full, and then attaching each resulting root Media Overlay to every HTML spine item that references said SMIL (i.e. OPF manifest items' media-overlay
IDREF). In essence, I removed the special treatment of mo.Text
(which is duplicated in both fillMediaOverlay
and addSeqToMediaOverlay
), and I added a single preliminary parsing step to construct the mapping: https://github.com/edrlab/r2-streamer-js/blob/develop/src/parser/epub.ts#L378-L421
This way, at worst there will be redundant Media Overlays timing data for a given HTML document (in the edge case where the SMIL file references multiple spine items), but in most cases the attached MO will contain exactly what an HTML document needs.
While the toolkit can stream files from local file systems, in many cases users of the toolkit will want to stream publications from an object storage provider, such as Amazon S3 or GCP Cloud Storage. This may seam trivial to implement at first glance ("just hook up S3 to the streamer!") but doing so efficiently is difficult due to the nature of reading ZIP files (EPUB, CBZ etc.).
There is already an optional "minimized read" utility in the ZIP archive reader in the toolkit, but this only works well when paired with lower-level optimizations in the reading of the ZIP itself. The following diagram shows how many reads are needed just to generate a WebPub manifest. For local filesystems, this is perfectly fine and efficient, but when performing the reads on a file located across the web, each additional request adds additional latency. If no optimizations are performed, the latency has a big impact on the performance of whatever software a user of the go-toolkit is writing, not to mention the additional costs of the requests (many object storage providers charge by # of requests). Below is an example of the reads that occur for opening the Moby Dick EPUB file:
I plan on porting my cloud storage reading logic to the go-toolkit to address this issue.
We need to handle the adobe and idpf way.
https://github.com/readium/r2-streamer-go/blob/master/models/metadata.go#L34
Right string `json:"rights,omitempty"`
Missing 's' in the struct's "Right" field name?
In EPUB 2.0 this is handled by dc:date as referenced in http://www.idpf.org/epub/20/spec/OPF_2.0.1_draft.htm#Section2.2.7
Will have to be careful in 2.x about event
since this attribute is largely undefined.
In 3.x this is more clearly established (http://www.idpf.org/epub/301/spec/epub-publications.html#sec-opf-dcmes-optional):
dc:date
elementdcterms:date
Main issue is that ISO8601 is only a recommandation, not a requirement.
When streaming an EPUB's content to web browsers using the go-toolkit, it is possible for the end user's device to be the only one that must perform decompression of the file's contents. This would result in a further decrease in CPU usage of any software using the go-toolkit, as decompression would not have to occur within the code, and additional reductions in resource usage may be avoided due to no longer needing to recompress the resource further up the chain, in e.g. nginx, apache etc.
How does this work in practice? Here's an example of the accept-encoding header of a modern browser (Chrome): Accept-Encoding: gzip, deflate, br, zstd
. The deflate
encoding is the same compression scheme used inside of ZIP files. This means that if you were to provide a browser that supports this encoding the contents of a compressed zip.File
using the OpenRaw()
method, and return content-encoding: deflate
, the browser would be able to decode it directly.
Currently, the properties of either a spine or a manifest item are ignored during the parsing.
Instead of this behavior we'd like to:
rel
values, for instance for the cover or the ToCmathml
or svg
could instead be expressed as "contains": ["mathml", "svg"]
Additional work on the Web Publication Manifest will have to be done in parallel in order to improve how properties are expressed in our publication model.
From the README:
https://github.com/readium/r2-streamer-go#live-demo
https://proto.myopds.com
(dead?)
This one works (from GitHub repository description):
https://readium2.feedbooks.net
We have support for CBZ files, CBR is possible using this package: github.com/mholt/archiver/v4
Calling Get when a manifest.Link Href contains anchors returns resource: error 404: file does not exist
an example Manifest.TableOfContents
manifest.Link{
Href: "/OEBPS/Text/appendice1.xhtml",
Type: "",
Templated: false,
Title: "APPENDICE A ANNALI DEI RE E DEI GOVERNATORI",
Rels: manifest.Strings{},
Properties: manifest.Properties{},
Height: 0x0,
Width: 0x0,
Bitrate: 0.000000,
Duration: 0.000000,
Languages: manifest.Strings{},
Alternates: manifest.LinkList{},
Children: manifest.LinkList{
manifest.Link{
Href: "/OEBPS/Text/appendice1.xhtml#sec1",
Type: "",
Templated: false,
Title: "I. I re Númenóreani",
Rels: manifest.Strings{},
Properties: manifest.Properties{},
Height: 0x0,
Width: 0x0,
Bitrate: 0.000000,
Duration: 0.000000,
Languages: manifest.Strings{},
Alternates: manifest.LinkList{},
Children: manifest.LinkList{},
},
manifest.Link{
Href: "/OEBPS/Text/appendice1.xhtml#sec2",
Type: "",
Templated: false,
Title: "II. La casa di Eorl",
Rels: manifest.Strings{},
Properties: manifest.Properties{},
Height: 0x0,
Width: 0x0,
Bitrate: 0.000000,
Duration: 0.000000,
Languages: manifest.Strings{},
Alternates: manifest.LinkList{},
Children: manifest.LinkList{},
},
A check on anchors could be added to Get
?
In EPUB 3.0.x it's possible to indicate that a title is a subtitle:
<dc:title id="title_2">All About EPUB 3.1</dc:title>
<meta refines="#title_2" property="title-type">subtitle</meta>
<meta refines="#title_2" property="display-seq">2</meta>
A recent revision to the Readium Web Publication Manifest also added a subtitle
element to play the same role.
We need to add support for subtitles in the streamer, using the same multi-lingual model as the title
element.
https://github.com/readium/r2-streamer-go/blob/master/parser/epub/lcp.go#L33
Rights struct {
Print int `json:"print"`
Copy int `json:"copy"`
Start *time.Time `json:"start"`
End *time.Time `json:"end"`
}
User struct {
ID string `json:"id"`
Email string `json:"email"`
Name string `json:"name"`
Encrypted []string `json:"encrypted"`
}
Signature struct {
Algorithm string `json:"algorithm"`
Certificate string `json:"certificate"`
Value string `json:"value"`
}
Missing: 'json:"rights"'
'json:"user"'
and 'json:"signature"'
While the current version of the Go streamer is focused on parsing and serving a single publication, the use case for both the Go and node.js/Typescript versions of the streamer might be primarily on the server side.
To better adapt to such use cases, we should do the following:
publications/
/publications.json
OPDS 2.0 is not truly a thing yet, aside from a few experiments on Gist.
But to reach a point where we're comfortable writing a specification for OPDS 2.0, we need to experiment and this is the perfect opportunity to do it.
Here are a few ground rules:
application/opds+json
publications
(equivalent of an acquisition feed), navigation
and groups
(to replace rel="collection"
and aggregate publications together in a single feed)publications
collectionmetadata
, links
and a new images
collection that contains one or more different covers)Here's a very basic example of what the output will look like:
{
"@context": "http://opds-spec.org/opds.jsonld",
"metadata": {
"@type": "http://schema.org/DataFeed",
"title": "All Publications",
"numberOfItems": 1
},
"links": [
{"rel": "self", "href": "http://example.org/publications.json", "type": "application/opds+json"}
],
"publications": [
{
"metadata": {
"@type": "http://schema.org/Book",
"title": "Moby-Dick",
"author": "Herman Melville",
"identifier": "urn:isbn:978031600000X",
"language": "en",
"modified": "2015-09-29T17:00:00Z"
},
"links": [
{"rel": "self", "href": "http://example.org/manifest.json", "type": "application/webpub+json"},
],
"images": [
{"href": "http://example.org/cover.jpg", "type": "image/jpeg", "height": 600, "width": 400},
]
}
]
}
https://readium2.feedbooks.net/Ym9va3MvY2hpbGRyZW5zLWxpdGVyYXR1cmUuZXB1Yg==/manifest.json
If I am not mistaken, cover.xhtml
, nav.xhtml
and s04.xhtml
should only appear in the spine
collection, not resources
:
...
,
"spine": [
{
"href": "https://aldiko-stats.feedbooks.net/Ym9va3MvY2hpbGRyZW5zLWxpdGVyYXR1cmUuZXB1Yg==/EPUB/cover.xhtml",
"type": "application/xhtml+xml"
},
{
"href": "https://aldiko-stats.feedbooks.net/Ym9va3MvY2hpbGRyZW5zLWxpdGVyYXR1cmUuZXB1Yg==/EPUB/nav.xhtml",
"type": "application/xhtml+xml",
"rel": [
"contents"
],
"properties": {
"contains": [
"js"
]
}
},
{
"href": "https://aldiko-stats.feedbooks.net/Ym9va3MvY2hpbGRyZW5zLWxpdGVyYXR1cmUuZXB1Yg==/EPUB/s04.xhtml",
"type": "application/xhtml+xml"
}
],
"resources": [
...
{
"href": "https://aldiko-stats.feedbooks.net/Ym9va3MvY2hpbGRyZW5zLWxpdGVyYXR1cmUuZXB1Yg==/EPUB/cover.xhtml",
"type": "application/xhtml+xml"
},
{
"href": "https://aldiko-stats.feedbooks.net/Ym9va3MvY2hpbGRyZW5zLWxpdGVyYXR1cmUuZXB1Yg==/EPUB/s04.xhtml",
"type": "application/xhtml+xml"
},
{
"href": "https://aldiko-stats.feedbooks.net/Ym9va3MvY2hpbGRyZW5zLWxpdGVyYXR1cmUuZXB1Yg==/EPUB/nav.xhtml",
"type": "application/xhtml+xml",
"rel": [
"contents"
],
"properties": {
"contains": [
"js"
]
}
}
...
],
...
As it's been pointed out in the EPUB 3 maintenance group, EPUB 3.1 doesn't allow content producers to indicate more than a single role for a contributor.
In EPUB 3.0.x it was possible to indicate as many roles as you wanted:
<dc:contributor id="Olaf">Dr. Olaf Hoffmann</dc:contributor>
<meta refines="#Olaf" property="file-as">Dr. Hoffmann, Olaf</meta>
<meta refines="#Olaf" property="role" scheme="marc:relators">mrk</meta>
<meta refines="#Olaf" property="role" scheme="marc:relators">art</meta>
<meta refines="#Olaf" property="role" scheme="marc:relators">ill</meta>
<meta refines="#Olaf" property="role" scheme="marc:relators">aui</meta>
<meta refines="#Olaf" property="role" scheme="marc:relators">pfr</meta>
The Readium Web Publication Manifest is a direct descendent of the BFF project and was designed with 3.1 round-trippability in mind.
But in the context of Readium-2, we want to maximize compatibility with EPUB 2.0.1 and any 3.x revision, which means that instead of having a single string allowed for the role in a contributor
element, we need to move to an array of string.
To align with the new v3 HREF strategy on the mobile toolkits, leading slashes should not be added in relative HREFs.
See this comment: #82 (comment)
Should be attr
:
https://github.com/readium/r2-streamer-go/blob/master/parser/epub/ncx.go
type NavPoint struct {
PlayerOrder int 'xml:"playOrder"'
}
...
type PageList struct {
Class string 'xml:"class"'
ID string 'xml:"id"'
}
...
type PageTarget struct {
Value string 'xml:"value"'
Type string 'xml:"type"'
PlayOrder int 'xml:"playOrder"'
}
Oh, and PageTarget
is missing ID string 'xml:"id,attr"'
There are various popular standards for comic book metadata, especially in the case of CBZs, that are not yet parsed and used to provide comic metadata. These include the ComicRack ComicInfo.xml
, ComicBookInfo in the ZIP comment, the Advanced Comic Book Format, as a .acbf
file and more.
Useful links:
https://github.com/readium/r2-streamer-go/blob/master/parser/epub/encryption.go#L28
type KeyInfo struct {
Resource string 'xml:",chardata"'
}
...seems incorrect, is missing the ds:RetrievalMethod
content model:
http://www.idpf.org/epub/31/spec/epub-ocf.html#sec-container-metainf-encryption.xml
<ds:KeyInfo>
<ds:RetrievalMethod URI="#EK"
Type="http://www.w3.org/2001/04/xmlenc#EncryptedKey"/>
</ds:KeyInfo>
Description in epic readium/architecture#38
I would like to propose the addition of a new parameter, either -v or --version, to the command-line interface (CLI) of our tool. This parameter would allow users to quickly retrieve information about the version of the tool they are using.
Currently, when working with the CLI, it can be cumbersome to find the version of the tool.
There case where the landmarks link reference a anchor link in the navigation document like #toc.
See details here:
readium/webpub-manifest#24
The current model does not support alt-rep/alt-script to provide a representation of a title or a contributor in another language/script.
This is mostly tied to the fact that we're directly de-serializing the structure to JSON:
https://github.com/Feedbooks/webpub-streamer/blob/master/models/metadata.go#L6
https://github.com/Feedbooks/webpub-streamer/blob/master/models/metadata.go#L44
The Web Publication Manifest has extensive support for alternative representations of a string, more so than EPUB 3.1:
"title": {
"fr": "Vingt mille lieues sous les mers",
"en": "Twenty Thousand Leagues Under the Sea",
"ja": "海底二万里"
}
"author": {
"name": {
"ru": "Михаил Афанасьевич Булгаков",
"en": "Mikhail Bulgakov",
"fr": "Mikhaïl Boulgakov"
}
}
This is one area where streamers in other languages SHOULD NOT copy the current Go project and make sure that their model support the full extent of what the Web Publication Manifest can do.
Is there an easy way to deal with the serialization issue in Go without making everything else far more complex? cc @jpbougie @banux
@HadrienGardeur commented on Mon Mar 13 2017
With the current code, media overlays are not parsed when they're encrypted.
We need to add support for this feature by handling the following behavior:
links
or properties
) are present, even if we can't decrypt the SMIL files yethttps://github.com/readium/r2-streamer-go/blob/master/parser/epub.go#L202
if epubVersion == "3.0" {
}
should be:
if isEpub3OrMore(book) {
}
"Children's Literature"
Go streamer:
https://readium2.feedbooks.net/Ym9va3MvY2hpbGRyZW5zLWxpdGVyYXR1cmUuZXB1Yg==/manifest.json
NodeJS streamer:
https://readium2.herokuapp.com/pub/L2FwcC9taXNjL2VwdWJzL2NoaWxkcmVucy1saXRlcmF0dXJlLmVwdWI%3D/manifest.json
Using JSON comparison tools (links below) to help pin-point discrepancies in the "webpub manifest" streamer output, I found the following errors (ignoring issues related to "base URL" used to explicitly resolve absolute links, and to the lack of consistent normalization for union-type values such as string vs. array-of-strings such as @context
and rel
):
href
even when there is a valid title
, resulting in missing data in the generated toc
JSON: "Abram S. Isaacs", "Samuel Taylor Coleridge", "Hans Christian Andersen", "Frances Browne", "Oscar Wilde", "Raymond MacDonald Alden", "Jean Ingelow", "Frank R. Stockton", "John Ruskin".\n\t\t\t\t\t\t\t\t\t\t\t\tI. The Rabbi and the Diadem\n\t\t\t\t\t\t\t\t\t\t\t
vs. I. The Rabbi and the Diadem
.JSON comparison tools:
In order to improve performance, a number of optimizations can also be handled at a HTTP level:
Cache-Control
header and a long expiration dateJust a heads-up: although your test server does not emit HTTP CORS headers (which would make sense, as a reading system app would most likely be hosted on a different domain, distinct from the content server's origin), you can use a proxy such as https://crossorigin.me , for example:
https://proto.myopds.com/manifest/mobydick.epub/manifest.json
content-type →text/plain; charset=utf-8
date →Wed, 12 Oct 2016 14:37:50 GMT
server →Caddy
status →200
vary →Origin
vs.
https://crossorigin.me/https://proto.myopds.com/manifest/mobydick.epub/manifest.json
access-control-allow-credentials →false
access-control-allow-headers →Content-Type, X-Requested-With
access-control-allow-origin →*
cf-ray →2f0b508a2d2d360e-LHR
content-encoding →gzip
content-type →text/plain; charset=utf-8
date →Wed, 12 Oct 2016 14:41:47 GMT
expires →Thu, 13 Oct 2016 14:41:46 GMT
server →cloudflare-nginx
status →200
In order to support LCP, in addition to handling decryption, the streamer also needs to add two new interactions:
For the LCP license, the fetcher will serve directly META-INF/license.lcpl
through a link at the publication level:
{
"href": "license.lcpl",
"rel": "license",
"type": "application/vnd.readium.lcp.license-1.0+json"
}
The LCP license handler will have the following interactions possible:
The following link is added to the publication's manifest:
{
"href": "license-handler.json",
"rel": "http://readium.org/lcp/handler",
"type": "application/json
}
The two interactions possible respond with the following documents:
GET license-handler.json
{
"identifier": "62b2dfcb-48f0-4e1b-b2b0-8e3444960f13",
"profile": "http://readium.org/lcp/basic-profile",
"key": {
"ready": false,
"check": "jJEjUDipHK3OjGt6kFq7dcOLZuicQFUYwQ+TYkAIWKm6Xv6kpHFhF7LOkUK/Owww"
},
"hint": {
"text": "Enter your library card PIN",
"url": "http://www.example.com/passphraseHint?user_id=1234"
},
"support": {
"mail": "[email protected]",
"url": "http://www.example.com/support",
"tel": "1800836482"
}
}
POST license-handler.json
{
"key": {
"hash": "9728be1c6737759dcba331ebe78276d8c83999b02d410aa2662c763915229a79"
}
}
The POST
request returns the full handler document, with the following HTTP status codes:
printPageNumbers
has been gradually deprecated in favor of two different values:
pageBreakMarkers
which is meant to indicate that the text of the publication contains page break markers using ARIApageNavigation
which is meant to indicate that the publication contains a list of pages, usually based on HTML IDsThe current inference technique is based on the presence of pageList
in the RWPM output which can either come from:
Given the nature of what is inferred here, the toolkit should return pageNavigation
instead of printPageNumbers
.
package -> compression
size -> original-length
In EPUB, various spine items can be declared as non-linear.
As a concept, linearity is very vague and can be handled in a number of ways by reading systems, but most of them simply skip non-linear items.
To avoid this pitfall, the Web Publication Manifest and the in-memory model will not consider non-linear resources to be part of the spine.
This means that the parser should verify each <itemref>
in the spine:
<itemref idref="c1-answerkey" linear="no"/>
If an <itemref>
includes a linear
attribute set to "no", the resource should be added to resources
. Otherwise, the resource should be added to spine
.
cc @danielweck
The current prototype has limited support for metadata:
This should be extended to support all the metadata currently available in models
, both in EPUB 2.x and EPUB 3.x.
https://github.com/readium/r2-streamer-go/blob/master/parser/epub.go#L214
contributor.Name.MultiString = make(map[string]string)
contributor.Name.MultiString[publication.Metadata.Language[0]] = cont.Data
for _, m := range metaAlt {
contributor.Name.MultiString[m.Lang] = m.Data
}
...should lower-case the language code, to be consistent with:
https://github.com/readium/r2-streamer-go/blob/master/parser/epub.go#L280
publication.Metadata.Title.MultiString = make(map[string]string)
publication.Metadata.Title.MultiString[strings.ToLower(mainTitle.Lang)] = mainTitle.Data
for _, m := range metaAlt {
publication.Metadata.Title.MultiString[strings.ToLower(m.Lang)] = m.Data
}
In EPUB 3.0.1, dc:type
is tied to an EPUB controlled vocabulary that we should attempt to store in our in memory model and in the Web Publication Manifest.
Since prior to EPUB 3.0.1 other values could be used, it's probably best to filter this and only support controlled vocabularies.
Subjects prior to EPUB 3.1 are basically a list of tags, that may or may not be concatenated in a single field.
The parser should not attempt to separate concatenated subjects.
In EPUB 3.1, subjects behave much more like Web Publications and the format should be very close to what we have in memory.
https://github.com/readium/r2-streamer-go/blob/master/parser/epub/opf.go#L54
Dir string 'xml:"dir"'
should be:
Dir string 'xml:"dir,attr"'
Both the NCX and the Navigation Document are currently ignored. This should be modified to extract:
toc
and/or page-list
for the NCXtoc
, page-list
, landmarks
, loi
, loa
, lov
and lot
for the Navigation DocumentIn addition to these two documents, we might also treat the EPUB 2.x guide
element as the equivalent of landmarks.
Navigation Document should always takes precedence over the NCX when both of them are available.
The Navigation Document itself should also be marked as such in the spine/resources collection of a publication (in our in-memory model), using rel="contents".
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.