GithubHelp home page GithubHelp logo

miquido / parsepub Goto Github PK

View Code? Open in Web Editor NEW
3.0 3.0 0.0 1.13 MB

A universal tool written in Kotlin designed to convert an EPUB publication into a data model used later by a reader. In addition it also provides validation and a system that informing about the inconsistency of the format. The project was made by Miquido. https://www.miquido.com/

License: Apache License 2.0

Dockerfile 0.07% Kotlin 99.11% HTML 0.82%
epub epub-parser

parsepub's Introduction

Build Status Download

parsepub


Overview

parsepub is a universal tool written in Kotlin designed to convert an EPUB publication into a data model used later by a reader. In addition it also provides validation and a system that informing about the inconsistency of the format.


Features

  • converting the publication to a model containing all resources and necessary information
  • providing EPUB format support in versions 2.0 and 3.0 for all major tags
  • handling inconsistency errors or lack of necessary elements in the publication structure
  • support for displaying information when element structure attributes are missing

Restrictions

In order for program to work properly the EPUB file must be created in accordance with the format requirements.
Spec for EPUB 3.0
Spec for EPUB 2.1


Base model - description

The EpubBook class contains all information from an uncompressed EPUB publication. Each of the parameters corresponds to a set of information parsed from the elements of the publication structure.

data class EpubBook (
    val epubOpfFilePath: String? = null,
    val epubTocFilePath: String? = null,
    val epubCoverImage: EpubResourceModel? = null,
    val epubMetadataModel: EpubMetadataModel? = null,
    val epubManifestModel: EpubManifestModel? = null,
    val epubSpineModel: EpubSpineModel? = null,
    val epubTableOfContentsModel: EpubTableOfContentsModel? = null
)

epubOpfFilePath - Contains absolute path to the .opf file.
epubTocFilePath - Contains absolute path to the .toc file.
epubCoverImage - Contains all information about the publication cover image.
epubMetadataModel - Contains all publication resources.
epubManifestModel - Contains all basic information about the publication.
epubSpineModel - Contains list of references in reading order.
epubTableOfContentsModel - Contains table of contents of the publication.

More info about the elements of the publication in the
"Information about epub format for non-developers" section

Quick start

To convert the selected EPUB publication, create an instance of the EpubParser class

val epubParser = EpubParser()

next call parse method on it

epubParser.parse(inputPath, decompressPath)

This method returns an EpubBook class object and has two parameters:
inputPath - the path to the EPUB file,
decompressPath - path to the place where the file should be unpacked

Error handling in the structure of the publication

The structure of the converted file may be incorrect for one main reason - no required elements of publications such as Metadata, Manifest, Spine, Table of Contents.

Solution - ValidationListeners
To limit the unexpected effects of an incorrect structure, we can create an implementation for properly prepared listeners that will alert us when the format will be wrong.
On the previously created instance of the EpubParser() class, we call the setValidationListeners method, in the body of which we create the implementation of our listeners.
Each listener has been assigned to a specific element.

epubParser.setValidationListeners {
   setOnMetadataMissing { Log.e(ERROR_TAG, "Metadata missing") }
   setOnManifestMissing { Log.e(ERROR_TAG, "Manifest missing") }
   setOnSpineMissing { Log.e(ERROR_TAG, "Spine missing") }
   setOnTableOfContentsMissing { Log.e(ERROR_TAG, "Table of contents missing") }
} 

Displaying information about missing attributes

Our parsing method can return unexpected results also when the set of attributes in the file structure element is not complete
e.g. missing language attribute in Metadata element.

Solution - onAttributeMissing
The mechanism that we created is the answer to the problem illustrated above and it is the part of ValidationListener.
When the required attribute is not correct or missing, our listener reports information with name of him and his parent.
As parameters, we receive two values:
parentElement - the name of the main element in which the error occurs
attributeName - name of the missing attribute

setOnAttributeMissing { parentElement, attributeName ->
    Log.e("$parentElement warn", "missing $attributeName attribute")
}

Information about epub format for non-developers

EPUB is an e-book file format that uses the ".epub" file extension. Its structure is based on the main elements, such as: Metadata, Manifest, Spine, Table of Contents.

Metadata - contains all metadata information for a specific EPUB file. Three metadata attributes are required (though many are still available):
title - contains the title of the book.
language - contains the language of the book,
identifier - contains the unique identifier of the book.

<metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
   <dc:title id="title">Title of the book</dc:title>
   <dc:language>en</dc:language>
   <dc:identifier id="pub-id">id-identifier</dc:identifier>

Manifest - element lists all the files. Each file is represented by an element, and has the required attributes:
id - id of the resource
href - location of the resource
media-type - type and format of the resource

Spine - element lists all the XHTML content documents in their linear reading order.

Table of contents - contains the hierarchical table of contents for the EPUB file.
A description of the full TOC specification can be found here:
TOC spec for EPUB 2.0
TOC spec for EPUB 3.0

parsepub's People

Contributors

chris1213 avatar kamilk-miquido avatar marcin-michalek-miquido avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.