setanta / ebookmaker Goto Github PK

View Code? Open in Web Editor NEW

45.0 45.0 14.0 441 KB

Python script that builds epub files from common HTML and a JSON book description.

Python 84.47% CSS 1.58% Shell 0.26% HTML 13.69%

ebookmaker's People

Contributors

Stargazers

Watchers

Forkers

shlomif genericmoniker hplgit katkamrachana arissupriy fpcmotif rockonedege vijo patrick-hogan doconce r3t0x

ebookmaker's Issues

There should be an explicit LICENSE file

Hi!

This project should be explicitly licensed under a preferably FLOSS licence, see https://www.mail-archive.com/[email protected]/msg04541.html .

Update: see my comment below about rebookmaker.

Wild cards are sorted as strings

This issue affects users with more than 9 text files.

# Expand wild cards.
if item['type'] == 'text' and '*' in item['source']:
    files = sorted(glob(item['source']))

The simple sorted function sort the filenames with the wildcard as strings, which in the case where the filenames are 'file1, file2,... , file10' the sorted list ends up as 'file1, file10, file2, ..., file9'.
To fix this I used the natsort package, but I leave it up to the author to choose his prefered method to naturally sort filenames.

from natsort import natsorted

# Expand wild cards.
if item['type'] == 'text' and '*' in item['source']:
    files = natsorted(glob(item['source']))

iliada example depends on nonexistent blog

I think I'd like to use ebookmaker for something like the iliada recipe, except I'm not sure what it actually does because iliadaemportugues.blogspot.com.br/2009/01/ is now 404. I'm having trouble figuring out what the JSON file needs to look like, and I was hoping that recipe might help me figure it out.

Are there any other recipes that still work that might be substituted? ebookmaker looks pretty great if I knew what to put in the JSON.

ebookmaker parses the HTML using regular expressions

Hi,

ebookmaker parses the HTML using regular expresions which is error prone and unreliable:

An example for this is:

            contents = f.read()
            regex = re.compile('<h(\d+)(?:\s+id=\"(.*)\")?>(.*)</h\d>',
                               re.IGNORECASE|re.MULTILINE|re.UNICODE)
            results = regex.findall(contents)

Can you use a parser such as BeautifulSoup instead?

Regards,

-- Shlomi Fish

collectImagesFromEBookContents only checks for lower case html tags

SInce both lower and upper case characters are accepted as html tags, I think a more general solution to

    def collectImagesFromEBookContents(self, htmlFile):
        with open(htmlFile, encoding='utf-8', mode='r') as f:
            soup = BeautifulSoup(f.read())
            return [img.src for img in soup.body.findAll('img') if img.has_attr('src')]

would be better. In my case, I hardcoded capitals to have my ebook created, but I think that it can be easily fixed ;-)
Kudos for all your work

setanta / ebookmaker Goto Github PK

ebookmaker's People

Contributors

Stargazers

Watchers

Forkers

ebookmaker's Issues

There should be an explicit LICENSE file

Wild cards are sorted as strings

iliada example depends on nonexistent blog

ebookmaker parses the HTML using regular expressions

collectImagesFromEBookContents only checks for lower case html tags

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs