readium / go-toolkit Goto Github PK

View Code? Open in Web Editor NEW

36.0 36.0 14.0 28.63 MB

A toolkit for ebooks, audiobooks and comics written in Go

Home Page: https://readium.org/web

License: BSD 3-Clause "New" or "Revised" License

Go 99.61% Makefile 0.05% Dockerfile 0.33%

accessibility cbz epub epub-parser golang readium

go-toolkit's People

Contributors

Stargazers

Watchers

Forkers

nypl-simplified umeike chocolatkey nagisa-inc evidentpoint strogo jredrado atomotic jpollard-cs mbrukman doomuch step-security-bot

go-toolkit's Issues

Creating a Docker container for `rwp`

It could be useful to offer a Docker image for Docker Hub to let people use rwp without installing it on their system.

This could be used: https://github.com/GoogleContainerTools/distroless/blob/main/base/README.md

Publications cannot be Unmarshalled

The following code fails on the final line:

epubData, err := parser.Parse("book.epub")
Expect(err).To(BeNil())

jsonBytes, err := json.Marshal(epubData) // Marshal works fine.
Expect(err).To(BeNil())

foo := &models.Publication{}
err = json.Unmarshal(jsonBytes, foo)
Expect(err).To(BeNil()) // json: cannot unmarshal string into Go struct field Contributor.metadata.author.name of type models.MultiLanguage

I think this happens because of how models.MultiLanguage overrides json.Marshal - it turns MultiLanguages into either a string or an object, but a models.Metadata doesn't expect either - it expects a MultiLanguage.

I can't understand how this is supposed to work. I'm trying to get the metadata for an epub, store it, and get it again later, but when I get it again later I don't just want to pass it around as bytes, I want to interact with the metadata, so I need to Unmarshal it. It seems unlikely that this wouldn't be supported, but I couldn't find another way after reading the docs and looking through the code. Sorry in advance if I missed something.

[security good practice] GitHub Action workflow, secrets.GITHUB_TOKEN now restricted by default (was permissive)

Please update your GitHub Action workflow YAML to include the permissions key and explicitly specify the read/write access rules your jobs actually require:

https://docs.github.com/en/actions/security-guides/automatic-token-authentication#permissions-for-the-github_token

https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#permissions

Use revamped URL parsing/handling scheme

Experiment with support for Shared JS/Readium-1

The streamer could also output a manifest in a format compatible with Readium-1, which means that Shared JS could be re-used without any modification as a navigator.

We're a little fuzzy on the details, so far we have:

Is there anything else that we need to know? Are there well-known URIs that we need to know about? Specific expectations regarding content or HTTP requests?

Could you provide us with these details @danielweck?

AddRel() should be declared where struct Link is (package namespace 'models')

https://github.com/readium/r2-streamer-go/blob/master/models/metadata.go#L126

func (link *Link) AddRel(rel string) {
	...
}

...should be moved to:
https://github.com/readium/r2-streamer-go/blob/master/models/publication.go#L38

type Link struct {
        ...
}

Media Overlays parsing: incorrect handling of body/seq epub:textref (IMO)

In fillMediaOverlay ( https://github.com/readium/r2-streamer-go/blob/master/parser/epub.go#L695 ), mo.Text = smil.Body.TextRef can actually be undefined (such as Moby Dick's first two MO chapters https://github.com/IDPF/epub3-samples/blob/master/30/moby-dick-mo/OPS/chapter_001_overlay.smil#L2 ). The parsing algorithm in the recursive function addSeqToMediaOverlay ( https://github.com/readium/r2-streamer-go/blob/master/parser/epub.go#L737 ) attempts to "split" the Media Overlays based on the nested seqs' epub:textref values (see baseHref == baseHrefParent in the linked Go code), but this may actually results in skipped SMIL fragments and inconsistent JSON output.

I appreciate the optimisation effort (as we discussed during the Readium-2 conference calls, i.e. the edge case when single SMIL files reference multiple HTML documents), but in the NodeJS implementation I personally decided to replicate the Readium-1 parsing behaviour, which consists in parsing each individual SMIL file in full, and then attaching each resulting root Media Overlay to every HTML spine item that references said SMIL (i.e. OPF manifest items' media-overlay IDREF). In essence, I removed the special treatment of mo.Text (which is duplicated in both fillMediaOverlay and addSeqToMediaOverlay), and I added a single preliminary parsing step to construct the mapping: https://github.com/edrlab/r2-streamer-js/blob/develop/src/parser/epub.ts#L378-L421
This way, at worst there will be redundant Media Overlays timing data for a given HTML document (in the edge case where the SMIL file references multiple spine items), but in most cases the attached MO will contain exactly what an HTML document needs.

Implement efficient object storage streaming handler

While the toolkit can stream files from local file systems, in many cases users of the toolkit will want to stream publications from an object storage provider, such as Amazon S3 or GCP Cloud Storage. This may seam trivial to implement at first glance ("just hook up S3 to the streamer!") but doing so efficiently is difficult due to the nature of reading ZIP files (EPUB, CBZ etc.).

There is already an optional "minimized read" utility in the ZIP archive reader in the toolkit, but this only works well when paired with lower-level optimizations in the reading of the ZIP itself. The following diagram shows how many reads are needed just to generate a WebPub manifest. For local filesystems, this is perfectly fine and efficient, but when performing the reads on a file located across the web, each additional request adds additional latency. If no optimizations are performed, the latency has a big impact on the performance of whatever software a user of the go-toolkit is writing, not to mention the additional costs of the requests (many object storage providers charge by # of requests). Below is an example of the reads that occur for opening the Moby Dick EPUB file:

I plan on porting my cloud storage reading logic to the go-toolkit to address this issue.

Handle obfuscated fonts

We need to handle the adobe and idpf way.

Right(s) typo in metadata?

https://github.com/readium/r2-streamer-go/blob/master/models/metadata.go#L34

	Right           string        `json:"rights,omitempty"`

Missing 's' in the struct's "Right" field name?

Add support for publication date

In EPUB 2.0 this is handled by dc:date as referenced in http://www.idpf.org/epub/20/spec/OPF_2.0.1_draft.htm#Section2.2.7

Will have to be careful in 2.x about event since this attribute is largely undefined.

In 3.x this is more clearly established (http://www.idpf.org/epub/301/spec/epub-publications.html#sec-opf-dcmes-optional):

only one dc:date element
this element contains the publication date, other events have to use dcterms:date

Main issue is that ISO8601 is only a recommandation, not a requirement.

Add ability to read raw compressed archive entries

When streaming an EPUB's content to web browsers using the go-toolkit, it is possible for the end user's device to be the only one that must perform decompression of the file's contents. This would result in a further decrease in CPU usage of any software using the go-toolkit, as decompression would not have to occur within the code, and additional reductions in resource usage may be avoided due to no longer needing to recompress the resource further up the chain, in e.g. nginx, apache etc.
How does this work in practice? Here's an example of the accept-encoding header of a modern browser (Chrome): Accept-Encoding: gzip, deflate, br, zstd. The deflate encoding is the same compression scheme used inside of ZIP files. This means that if you were to provide a browser that supports this encoding the contents of a compressed zip.File using the OpenRaw() method, and return content-encoding: deflate, the browser would be able to decode it directly.

Add support for properties of a Link Object

Currently, the properties of either a spine or a manifest item are ignored during the parsing.

Instead of this behavior we'd like to:

turn a number of these properties into rel values, for instance for the cover or the ToC
express the rest of them in a more consistent way, for example properties such as mathml or svg could instead be expressed as "contains": ["mathml", "svg"]

Additional work on the Web Publication Manifest will have to be done in parallel in order to improve how properties are expressed in our publication model.

Demo link dead? (https://proto.myopds.com)

From the README:
https://github.com/readium/r2-streamer-go#live-demo
https://proto.myopds.com
(dead?)

This one works (from GitHub repository description):
https://readium2.feedbooks.net

Add CBR support

We have support for CBZ files, CBR is possible using this package: github.com/mholt/archiver/v4

(pub.Publication).Get when href contains anchors

Calling Get when a manifest.Link Href contains anchors returns resource: error 404: file does not exist

an example Manifest.TableOfContents

     manifest.Link{
        Href:       "/OEBPS/Text/appendice1.xhtml",
        Type:       "",
        Templated:  false,
        Title:      "APPENDICE A ANNALI DEI RE E DEI GOVERNATORI",
        Rels:       manifest.Strings{},
        Properties: manifest.Properties{},
        Height:     0x0,
        Width:      0x0,
        Bitrate:    0.000000,
        Duration:   0.000000,
        Languages:  manifest.Strings{},
        Alternates: manifest.LinkList{},
        Children:   manifest.LinkList{
          manifest.Link{
            Href:       "/OEBPS/Text/appendice1.xhtml#sec1",
            Type:       "",
            Templated:  false,
            Title:      "I. I re Númenóreani",
            Rels:       manifest.Strings{},
            Properties: manifest.Properties{},
            Height:     0x0,
            Width:      0x0,
            Bitrate:    0.000000,
            Duration:   0.000000,
            Languages:  manifest.Strings{},
            Alternates: manifest.LinkList{},
            Children:   manifest.LinkList{},
          },
          manifest.Link{
            Href:       "/OEBPS/Text/appendice1.xhtml#sec2",
            Type:       "",
            Templated:  false,
            Title:      "II. La casa di Eorl",
            Rels:       manifest.Strings{},
            Properties: manifest.Properties{},
            Height:     0x0,
            Width:      0x0,
            Bitrate:    0.000000,
            Duration:   0.000000,
            Languages:  manifest.Strings{},
            Alternates: manifest.LinkList{},
            Children:   manifest.LinkList{},
          },

A check on anchors could be added to Get?

Add support for subtitles

In EPUB 3.0.x it's possible to indicate that a title is a subtitle:

<dc:title id="title_2">All About EPUB 3.1</dc:title>
<meta refines="#title_2" property="title-type">subtitle</meta>
<meta refines="#title_2" property="display-seq">2</meta>

A recent revision to the Readium Web Publication Manifest also added a subtitle element to play the same role.

We need to add support for subtitles in the streamer, using the same multi-lingual model as the title element.

LCP JSON mapping incomplete

https://github.com/readium/r2-streamer-go/blob/master/parser/epub/lcp.go#L33

	Rights struct {
		Print int        `json:"print"`
		Copy  int        `json:"copy"`
		Start *time.Time `json:"start"`
		End   *time.Time `json:"end"`
	}
	User struct {
		ID        string   `json:"id"`
		Email     string   `json:"email"`
		Name      string   `json:"name"`
		Encrypted []string `json:"encrypted"`
	}
	Signature struct {
		Algorithm   string `json:"algorithm"`
		Certificate string `json:"certificate"`
		Value       string `json:"value"`
	}

Missing: 'json:"rights"' 'json:"user"' and 'json:"signature"'

Output an OPDS 2.0 feed listing publications

While the current version of the Go streamer is focused on parsing and serving a single publication, the use case for both the Go and node.js/Typescript versions of the streamer might be primarily on the server side.

To better adapt to such use cases, we should do the following:

by default, parse and keep in memory all publications stored in publications/
provide an OPDS 2.0 feed for these publications at /publications.json

OPDS 2.0 is not truly a thing yet, aside from a few experiments on Gist.

But to reach a point where we're comfortable writing a specification for OPDS 2.0, we need to experiment and this is the perfect opportunity to do it.

Here are a few ground rules:

OPDS 2.0 will be based on the same abstract model as the Readium Web Publication Manifest, which means collections, Link Object, links and metadata
the media type for OPDS 2.0 is application/opds+json
there won't be a difference between acquisition and navigation feeds in OPDS 2.0, all feeds contain a number of collections
there will be three core collection roles: publications (equivalent of an acquisition feed), navigation and groups (to replace rel="collection" and aggregate publications together in a single feed)
the output that the streamer will provide is a very basic OPDS 2.0 feed where all publications are listed in a publications collection
this collection won't contain the full content of a Readium Web Publication (we'll only include metadata, links and a new images collection that contains one or more different covers)

Here's a very basic example of what the output will look like:

{
  "@context": "http://opds-spec.org/opds.jsonld",

  "metadata": {
    "@type": "http://schema.org/DataFeed",
    "title": "All Publications",
    "numberOfItems": 1
  },

  "links": [
    {"rel": "self", "href": "http://example.org/publications.json", "type": "application/opds+json"}
  ],

  "publications": [
    {
      "metadata": {
        "@type": "http://schema.org/Book",
        "title": "Moby-Dick",
        "author": "Herman Melville",
        "identifier": "urn:isbn:978031600000X",
        "language": "en",
        "modified": "2015-09-29T17:00:00Z"
      },
      "links": [
        {"rel": "self", "href": "http://example.org/manifest.json", "type": "application/webpub+json"},
      ],     
      "images": [
        {"href": "http://example.org/cover.jpg", "type": "image/jpeg", "height": 600, "width": 400},
      ]
    }
  ]
}

webpub manifest JSON `spine` and `resource` lists of links should be mutually-exclusive / non-overlaping?

https://readium2.feedbooks.net/Ym9va3MvY2hpbGRyZW5zLWxpdGVyYXR1cmUuZXB1Yg==/manifest.json

If I am not mistaken, cover.xhtml, nav.xhtml and s04.xhtml should only appear in the spine collection, not resources:

...
, 
"spine": [
  {
   "href": "https://aldiko-stats.feedbooks.net/Ym9va3MvY2hpbGRyZW5zLWxpdGVyYXR1cmUuZXB1Yg==/EPUB/cover.xhtml",
   "type": "application/xhtml+xml"
  },
  {
   "href": "https://aldiko-stats.feedbooks.net/Ym9va3MvY2hpbGRyZW5zLWxpdGVyYXR1cmUuZXB1Yg==/EPUB/nav.xhtml",
   "type": "application/xhtml+xml",
   "rel": [
    "contents"
   ],
   "properties": {
    "contains": [
     "js"
    ]
   }
  },
  {
   "href": "https://aldiko-stats.feedbooks.net/Ym9va3MvY2hpbGRyZW5zLWxpdGVyYXR1cmUuZXB1Yg==/EPUB/s04.xhtml",
   "type": "application/xhtml+xml"
  }
 ],

 "resources": [
...
{
   "href": "https://aldiko-stats.feedbooks.net/Ym9va3MvY2hpbGRyZW5zLWxpdGVyYXR1cmUuZXB1Yg==/EPUB/cover.xhtml",
   "type": "application/xhtml+xml"
  },
  {
   "href": "https://aldiko-stats.feedbooks.net/Ym9va3MvY2hpbGRyZW5zLWxpdGVyYXR1cmUuZXB1Yg==/EPUB/s04.xhtml",
   "type": "application/xhtml+xml"
  },
  {
   "href": "https://aldiko-stats.feedbooks.net/Ym9va3MvY2hpbGRyZW5zLWxpdGVyYXR1cmUuZXB1Yg==/EPUB/nav.xhtml",
   "type": "application/xhtml+xml",
   "rel": [
    "contents"
   ],
   "properties": {
    "contains": [
     "js"
    ]
   }
  }
...
 ],
...

Images are missing `height` `width`

I noticed this change when comparing outputs of the Go streamer and the JS streamer.

Use an array for roles in contributor

As it's been pointed out in the EPUB 3 maintenance group, EPUB 3.1 doesn't allow content producers to indicate more than a single role for a contributor.

In EPUB 3.0.x it was possible to indicate as many roles as you wanted:

<dc:contributor id="Olaf">Dr. Olaf Hoffmann</dc:contributor>
<meta refines="#Olaf" property="file-as">Dr. Hoffmann, Olaf</meta>
<meta refines="#Olaf" property="role" scheme="marc:relators">mrk</meta>
<meta refines="#Olaf" property="role" scheme="marc:relators">art</meta>
<meta refines="#Olaf" property="role" scheme="marc:relators">ill</meta>
<meta refines="#Olaf" property="role" scheme="marc:relators">aui</meta>
<meta refines="#Olaf" property="role" scheme="marc:relators">pfr</meta>

The Readium Web Publication Manifest is a direct descendent of the BFF project and was designed with 3.1 round-trippability in mind.

But in the context of Readium-2, we want to maximize compatibility with EPUB 2.0.1 and any 3.x revision, which means that instead of having a single string allowed for the role in a contributor element, we need to move to an array of string.

Remove leading slashes from relative HREFs

To align with the new v3 HREF strategy on the mobile toolkits, leading slashes should not be added in relative HREFs.

See this comment: #82 (comment)

NCX XML parsing (go automap)

Should be attr:

https://github.com/readium/r2-streamer-go/blob/master/parser/epub/ncx.go

type NavPoint struct {
	PlayerOrder int        'xml:"playOrder"'
}
...
type PageList struct {
	Class      string       'xml:"class"'
	ID         string       'xml:"id"'
}
...
type PageTarget struct {
	Value     string  'xml:"value"'
	Type      string  'xml:"type"'
	PlayOrder int     'xml:"playOrder"'
}

Oh, and PageTarget is missing ID string 'xml:"id,attr"'

Add comic metadata support

There are various popular standards for comic book metadata, especially in the case of CBZs, that are not yet parsed and used to provide comic metadata. These include the ComicRack ComicInfo.xml, ComicBookInfo in the ZIP comment, the Advanced Comic Book Format, as a .acbf file and more.

Useful links:

Encryption, KeyInfo

https://github.com/readium/r2-streamer-go/blob/master/parser/epub/encryption.go#L28

type KeyInfo struct {
	Resource string 'xml:",chardata"'
}

...seems incorrect, is missing the ds:RetrievalMethod content model:

http://www.idpf.org/epub/31/spec/epub-ocf.html#sec-container-metainf-encryption.xml

        <ds:KeyInfo>
            <ds:RetrievalMethod URI="#EK"
                Type="http://www.w3.org/2001/04/xmlenc#EncryptedKey"/>
        </ds:KeyInfo>

Use absolute URIs in the streamer

Description in epic readium/architecture#38

Proposal to Add -v or --version Parameter to CLI Interface

I would like to propose the addition of a new parameter, either -v or --version, to the command-line interface (CLI) of our tool. This parameter would allow users to quickly retrieve information about the version of the tool they are using.

Currently, when working with the CLI, it can be cumbersome to find the version of the tool.

Handle self-referencing link in navigation document

There case where the landmarks link reference a anchor link in the navigation document like #toc.

EPUB spread "portrait" is deprecated and not conform with JSON Schema

See details here:
readium/webpub-manifest#24

Re-add comicrack metadata support in CBZ

No support for alt-rep/alt-script

The current model does not support alt-rep/alt-script to provide a representation of a title or a contributor in another language/script.

This is mostly tied to the fact that we're directly de-serializing the structure to JSON:
https://github.com/Feedbooks/webpub-streamer/blob/master/models/metadata.go#L6
https://github.com/Feedbooks/webpub-streamer/blob/master/models/metadata.go#L44

The Web Publication Manifest has extensive support for alternative representations of a string, more so than EPUB 3.1:

"title": {
  "fr": "Vingt mille lieues sous les mers",
  "en": "Twenty Thousand Leagues Under the Sea",
  "ja": "海底二万里"
}

"author": {
  "name": {
    "ru": "Михаил Афанасьевич Булгаков",
    "en": "Mikhail Bulgakov",
    "fr": "Mikhaïl Boulgakov"
  }
}

This is one area where streamers in other languages SHOULD NOT copy the current Go project and make sure that their model support the full extent of what the Web Publication Manifest can do.

Is there an easy way to deal with the serialization issue in Go without making everything else far more complex? cc @jpbougie @banux

Support encrypted Media Overlay documents

@HadrienGardeur commented on Mon Mar 13 2017

With the current code, media overlays are not parsed when they're encrypted.

We need to add support for this feature by handling the following behavior:

once the proper keys are provided for a DRM, trigger the goroutine that parses SMIL files
make sure that the links for media overlay (in links or properties) are present, even if we can't decrypt the SMIL files yet
return an HTTP error for the media overlay service when SMIL files haven't been decrypted yet

Add support for unpackaged publications

EPUB version check

https://github.com/readium/r2-streamer-go/blob/master/parser/epub.go#L202

if epubVersion == "3.0" {
}

should be:

if isEpub3OrMore(book) {
}

Incorrect TOC parsing, ignores empty hrefs with title, whitespaces in string titles

"Children's Literature"

Go streamer:
https://readium2.feedbooks.net/Ym9va3MvY2hpbGRyZW5zLWxpdGVyYXR1cmUuZXB1Yg==/manifest.json

NodeJS streamer:
https://readium2.herokuapp.com/pub/L2FwcC9taXNjL2VwdWJzL2NoaWxkcmVucy1saXRlcmF0dXJlLmVwdWI%3D/manifest.json

Using JSON comparison tools (links below) to help pin-point discrepancies in the "webpub manifest" streamer output, I found the following errors (ignoring issues related to "base URL" used to explicitly resolve absolute links, and to the lack of consistent normalization for union-type values such as string vs. array-of-strings such as @context and rel):

The Go Navigation Document parser discards tree nodes with empty href even when there is a valid title, resulting in missing data in the generated toc JSON: "Abram S. Isaacs", "Samuel Taylor Coleridge", "Hans Christian Andersen", "Frances Browne", "Oscar Wilde", "Raymond MacDonald Alden", "Jean Ingelow", "Frank R. Stockton", "John Ruskin".
Some TOC titles contain whitespace characters, for example: \n\t\t\t\t\t\t\t\t\t\t\t\tI. The Rabbi and the Diadem\n\t\t\t\t\t\t\t\t\t\t\t vs. I. The Rabbi and the Diadem.

JSON comparison tools:

HTTP Optimizations

In order to improve performance, a number of optimizations can also be handled at a HTTP level:

manifest served with an ETag
all resources from the publication served with Cache-Control header and a long expiration date

Add support for dc:source

HTTP CORS

Just a heads-up: although your test server does not emit HTTP CORS headers (which would make sense, as a reading system app would most likely be hosted on a different domain, distinct from the content server's origin), you can use a proxy such as https://crossorigin.me , for example:

https://proto.myopds.com/manifest/mobydick.epub/manifest.json

content-type →text/plain; charset=utf-8
date →Wed, 12 Oct 2016 14:37:50 GMT
server →Caddy
status →200
vary →Origin

vs.

https://crossorigin.me/https://proto.myopds.com/manifest/mobydick.epub/manifest.json

access-control-allow-credentials →false
access-control-allow-headers →Content-Type, X-Requested-With
access-control-allow-origin →*
cf-ray →2f0b508a2d2d360e-LHR
content-encoding →gzip
content-type →text/plain; charset=utf-8
date →Wed, 12 Oct 2016 14:41:47 GMT
expires →Thu, 13 Oct 2016 14:41:46 GMT
server →cloudflare-nginx
status →200

Implement APIs for LCP support

In order to support LCP, in addition to handling decryption, the streamer also needs to add two new interactions:

a link to the LCP license
an LCP license handler, to communicate User Key/Passphrase to the streamer

For the LCP license, the fetcher will serve directly META-INF/license.lcpl through a link at the publication level:

{
  "href": "license.lcpl",
  "rel": "license",
  "type": "application/vnd.readium.lcp.license-1.0+json"
}

The LCP license handler will have the following interactions possible:

Getting the handler document (GET)
Sending the User Key/Passphrase (POST)

The following link is added to the publication's manifest:

{
  "href": "license-handler.json",
  "rel": "http://readium.org/lcp/handler",
  "type": "application/json
}

The two interactions possible respond with the following documents:

GET license-handler.json
{
  "identifier": "62b2dfcb-48f0-4e1b-b2b0-8e3444960f13",
  "profile": "http://readium.org/lcp/basic-profile",
  "key": {
    "ready": false, 
    "check": "jJEjUDipHK3OjGt6kFq7dcOLZuicQFUYwQ+TYkAIWKm6Xv6kpHFhF7LOkUK/Owww"
  },
  "hint": {
    "text": "Enter your library card PIN",
    "url": "http://www.example.com/passphraseHint?user_id=1234"
  },
  "support": {
    "mail": "[email protected]",
    "url": "http://www.example.com/support",
    "tel": "1800836482"
  }
}

POST license-handler.json
{
  "key": {
    "hash": "9728be1c6737759dcba331ebe78276d8c83999b02d410aa2662c763915229a79"
  }
}

The POST request returns the full handler document, with the following HTTP status codes:

200 OK, when the key/passphrase is valid
401 Unauthorized, when the key/passphrase is invalid

Return pageNavigation instead of pageBreakNumbers when inferring metadata

printPageNumbers has been gradually deprecated in favor of two different values:

pageBreakMarkers which is meant to indicate that the text of the publication contains page break markers using ARIA
and pageNavigation which is meant to indicate that the publication contains a list of pages, usually based on HTML IDs

The current inference technique is based on the presence of pageList in the RWPM output which can either come from:

a Navigation Document in EPUB 3.x
or an NCX in EPUB 2.x

Given the nature of what is inferred here, the toolkit should return pageNavigation instead of printPageNumbers.

Rename fields in Encrypted

package -> compression

size -> original-length

Reference non-linear resources outside of the spine

In EPUB, various spine items can be declared as non-linear.

As a concept, linearity is very vague and can be handled in a number of ways by reading systems, but most of them simply skip non-linear items.

To avoid this pitfall, the Web Publication Manifest and the in-memory model will not consider non-linear resources to be part of the spine.

This means that the parser should verify each <itemref> in the spine:
<itemref idref="c1-answerkey" linear="no"/>

If an <itemref> includes a linear attribute set to "no", the resource should be added to resources. Otherwise, the resource should be added to spine.

cc @danielweck

Improve support for EPUB 2.x and 3.x metadata

The current prototype has limited support for metadata:

support for contributors is limited to author/contributor and not the other roles
the first title is used
the first identifier, and not the one identified as the main one, is used

This should be extended to support all the metadata currently available in models, both in EPUB 2.x and EPUB 3.x.

Multilingual string (metadata), lower case

https://github.com/readium/r2-streamer-go/blob/master/parser/epub.go#L214

			contributor.Name.MultiString = make(map[string]string)
			contributor.Name.MultiString[publication.Metadata.Language[0]] = cont.Data

			for _, m := range metaAlt {
				contributor.Name.MultiString[m.Lang] = m.Data
			}

...should lower-case the language code, to be consistent with:

https://github.com/readium/r2-streamer-go/blob/master/parser/epub.go#L280

			publication.Metadata.Title.MultiString = make(map[string]string)
			publication.Metadata.Title.MultiString[strings.ToLower(mainTitle.Lang)] = mainTitle.Data

			for _, m := range metaAlt {
				publication.Metadata.Title.MultiString[strings.ToLower(m.Lang)] = m.Data
			}

Add support for dc:type

In EPUB 3.0.1, dc:type is tied to an EPUB controlled vocabulary that we should attempt to store in our in memory model and in the Web Publication Manifest.

Since prior to EPUB 3.0.1 other values could be used, it's probably best to filter this and only support controlled vocabularies.

Add support for subjects

Subjects prior to EPUB 3.1 are basically a list of tags, that may or may not be concatenated in a single field.

The parser should not attempt to separate concatenated subjects.

In EPUB 3.1, subjects behave much more like Web Publications and the format should be very close to what we have in memory.

OPF metadata title dir attribute

https://github.com/readium/r2-streamer-go/blob/master/parser/epub/opf.go#L54

Dir string 'xml:"dir"'

should be:

Dir string 'xml:"dir,attr"'

Add support for NCX and Navigation Document parsing

Both the NCX and the Navigation Document are currently ignored. This should be modified to extract:

toc and/or page-list for the NCX
toc, page-list, landmarks, loi, loa, lov and lot for the Navigation Document

In addition to these two documents, we might also treat the EPUB 2.x guide element as the equivalent of landmarks.

Navigation Document should always takes precedence over the NCX when both of them are available.

The Navigation Document itself should also be marked as such in the spine/resources collection of a publication (in our in-memory model), using rel="contents".

readium / go-toolkit Goto Github PK

go-toolkit's People

Contributors

Stargazers

Watchers

Forkers

go-toolkit's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs