GithubHelp home page GithubHelp logo

domenic / worm-scraper Goto Github PK

View Code? Open in Web Editor NEW
192.0 4.0 49.0 2.08 MB

Scrapes the web serial Worm and its sequel Ward into an eBook format

License: Other

JavaScript 98.11% HTML 1.89%
worm epub ward ebook-downloader ebook

worm-scraper's Introduction

Worm Scraper

Scrapes the web serial Worm and its sequel Ward into an eBook format.

How to use

First you'll need a modern version of Node.js. At least v16.13.2 is necessary.

Then, open a terminal (Mac documentation, Windows documentation) and install the program by typing

npm install -g worm-scraper

This will take a while as it downloads this program and its dependencies from the internet. Once it's done, try to run it, by typing:

worm-scraper --help

If this outputs some help documentation, then the installation process went smoothly. You can move on to assemble the eBook by typing

worm-scraper

This will take a while, but will eventually produce a Worm.epub file!

If you'd like to get Ward instead of Worm, use --book=ward, e.g.

worm-scraper --book=ward

EPUB vs. other formats

EPUB is one of the primary eBook formats, but it is not recognized by all readers, including most Amazon Kindle devices. You can use an online converter or other tool to convert EPUB to Kindle MOBI, or any other format.

Alternately, if you are a developer, a pull request adding support for MOBI output would be appreciated; please open an issue to discuss how you plan to proceed.

Text fixups

This project makes a lot of fixups to the original text, mostly around typos, punctuation, capitalization, and consistency. You can get a more specific idea of what these are via the code; there's convert-worker.js, where some things are handled generally, and substitutions.json, for one-off fixes.

This process is designed to be extensible, so if you notice any problems with the original text that you think should be fixed, file an issue to let me know, and we can update the fixup code so that the resulting eBook is improved. (Or better yet, send a pull request!)

worm-scraper's People

Contributors

atbenedict avatar crackedp0t avatar domenic avatar eyoung8 avatar lenwhite avatar s-arambillete avatar tgnyc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

worm-scraper's Issues

Ward build hanging at chapter 16-10

Hello—thank you for making this tool.

After finishing worm a while ago, I recently desired to pick up ward. Unfortunately my book build is failing at chapter 16-10 with the message I will post below.

I installed using the current node.js, 15.6.0 in case that is relevant.

Downloading https://www.parahumans.net/2019/09/15/from-within-16-10/... TypeError: Cannot read property 'textContent' of null
at getChapterTitle (C:\Users\aes\AppData\Roaming\npm\node_modules\worm-scraper\lib\download.js:93:55)
at downloadAllChapters (C:\Users\aes\AppData\Roaming\npm\node_modules\worm-scraper\lib\download.js:49:26)
at processTicksAndRejections (node:internal/process/task_queues:94:5)
at async C:\Users\aes\AppData\Roaming\npm\node_modules\worm-scraper\lib\worm-scraper.js:111:7

'Not enough non-option arguments: got 0. need at least 1'

Sorry for the newbie question, but whenever I try to use one of the single letter option commands (-s, -c, -b, -o) I get the message 'Not enough non-option arguments: got 0. need at least 1.'

Could you tell me exactly what I need to type to use these commands? thanks.

Cannot find module 'mz/fs'

i ran "worm-scraper --help" and got this error:

module.js:472
throw err;
^

Error: Cannot find module 'mz/fs'
at Function.Module._resolveFilename (module.js:470:15)
at Function.Module._load (module.js:418:25)
at Module.require (module.js:498:17)
at require (internal/module.js:20:19)
at Object. (C:\Program Files\Node\node_modules\worm-scraper\lib\download.js:3:12)
at Module._compile (module.js:571:32)
at Object.Module._extensions..js (module.js:580:10)
at Module.load (module.js:488:32)
at tryModuleLoad (module.js:447:12)
at Function.Module._load (module.js:439:3)

Twig

In attempting to use worm-scraper with Twig, I encountered the following issue:

When improperly providing the --start-url parameter, it began to download Worm by default. After that, it would always download Worm. I thought that I was continuing to improperly pass the start url.

Clearing the cache folder resolved the problem.

This appears to be related to starting from the latest position in the existing manifest.

Staging folder not created

I ran -o Downloads, as I downloaded worm-scraper in users/myusername. The program then said, after downloading all files, Error: ENOENT: no such file or directory, rmdir 'C:\Users\myusername\staging\worm\OEBPS\chapters'. I ran the program again, the last interlude was downloaded, and the same error occurred. I then ran npm -g worm-scraper, and the issue persisted. I then created the directory, and everything worked fine. I would suggest using fs to see if the file exists, and if it does not, create it.

Error: ENOENT: no such file or directory, rmdir

I tried running worm-scraper download convert scaffold zip but after downloading the last chapter, it throws the error Error: ENOENT: no such file or directory, rmdir '/Users/vigneshwar/Desktop/staging/worm/OEBPS/chapters'

Issue with running convert.js when no pre-existing "staging" path

When running worm-scraper convert... , if there is no pre-existing "staging" path (aka upon fresh run), it gives this error:
(node:6172) [DEP0147] DeprecationWarning: In future versions of Node.js, fs.rmdir(path, { recursive: true }) will be removed. Use fs.rm(path, { recursive: true }) instead
And that means fs.rmdir no longer allows for no existing path with the recursive: true arg.

The fix is to simply change line 90 in worm-scraper.js to the following:
return fs.rm(chaptersPath, { force: true, recursive: true, maxRetries: 3 })
And that made it work for me.

Invalid Syntax

As the title says. Installed with v7.4.0 of node.js (not the LTS version) on Windows 10 x64 and ran in an elevated command prompt.

Running worm-scraper --help prompts a syntax error (Code: 800A03EA).

The same occurs with worm-scraper download convert scaffold zip

Note that this issue persisted after a reinstallation of worm-scraper. I also tried cding to the directory where worm-scraper was installed, to no avail.

Editing the script to scrap other epub

Hi, i encountered an issue when i try to edit the scripts.

TypeError: Failed to execute 'removeChild' on 'Node': parameter 1 is not of type 'Node'. I know this is from the converting portion but not sure why it failed as it worked for another epub.

Suggest making this less visible to respect the author's wishes?

This repo is the number one result for "worm ebook". The author has explicitly said he doesn't mind people making their own ebooks but that publicising such methods damages his ability to pursue traditional ebook publishing routes.

Not saying this should be taken down (it's super useful!) but can we at least make it less conspicuous? This guy spends 50+ hours per week writing and has a following of millions, I just wish there was a way he could profit more from his awesome work.

Option to download Teaser (Glow-worm)

Hi, first let me thank you for this great project!

It would be nice to have the option to download the teaser for "Ward" aswell.
If you're interested, it could be implemented like this, using worm-scraper with other parameters:

  • Just download it as a standalone book, like the other two (Option example: --book=glow-worm-teaser)
  • Add it as chapter to "Ward" at the beginning (Option examples: --book=ward-include-teaser or --book=ward --include-teaser)

Let me know what you think about it :)

path error with -book=ward (but worm worked)

~/Downloads ⌚ 10:19:31
$ worm-scraper -book=ward
Downloading https://parahumans.wordpress.com/2013/11/19/interlude-end/... done
Converting raw downloaded HTML to EPUB chapters
All chapters converted in 23.4 seconds
EPUB contents assembled into /usr/lib/node_modules/worm-scraper/scaffolding
TypeError [ERR_INVALID_ARG_TYPE]: The "path" argument must be of type string. Received an instance of Array
    at validateString (internal/validators.js:124:11)
    at Object.resolve (path.js:980:7)
    at /usr/lib/node_modules/worm-scraper/lib/worm-scraper.js:105:58
    at /usr/lib/node_modules/worm-scraper/lib/worm-scraper.js:111:13

Typo: "woul" -> "would"

In Worm Extermination 8.4, "it woul be anyone's guess" should be "it would be anyone's guess" (emph mine)

Adapt to other books

Not really an issue, since this does exactly what it says on the tin, but how would I go about making this work for the author's other books, like the Worm sequel?

Generated epub not compatible with Play Books

The generated epub is not compatible with Google Play Books.

An online epub validator (https://www.ebookit.com/tools/bp/Bo/eBookIt/epub-validator) points out possible errors. There are several on chapter 79, some on 211, 249 and 275.

Also it seems to have some problem on the cover and the img tag.

The only two fatal errors are on chapter 79, probably due some tag that was not closed properly:

FATAL(RSC-016): ./books/Bo/databases/eBookIt/temp_uploads/1609380837.epub/OEBPS/chapters/chapter079.xhtml(210,6): Fatal Error while parsing file: The element type "p" must be terminated by the matching end-tag "
".

ERROR(RSC-005): ./books/Bo/databases/eBookIt/temp_uploads/1609380837.epub/OEBPS/chapters/chapter079.xhtml(-1,-1): Error while parsing file: The element type "p" must be terminated by the matching end-tag "
". 

Capitalization of "wretch"

"Wretch" is inconsistently capitalized in Ward.

It's mostly un-capitalized through Shadow 5.8. (However, there are exceptions, such as Shadow 5.4.)

Starting in Shadow 5.9, it's mostly capitalized. However, there are many exceptions; eyeballing the search results, I would guess 30 or so.

There doesn't appear to be a pattern. For example, you could imagine that when used as a proper name, it's capitalized, and otherwise it's not. But both "the wretch" and "the Wretch" are often seen, even in later chapters. There is one instance of "Wretch" (no "the") in Pitch 6.6, but it appears to be the exception, and probably should be fixed.

I'm unsure whether this shift in capitalization represents a narratively-significant change in how Victoria thinks, or an author style update that wasn't back-applied to earlier chapters, or what.

A few options are:

  • Always un-capitalize. This generally fits better with English grammar, since it's not a proper name, and is often prefixed with "the".

  • Always capitalize. I.e., treat "the Wretch" is an entity whose proper name somehow includes a lowercase "the". This might best preserve authorial intent if we assume that "the Wretch" is the intended name throughout, and the author just never went back and corrected earlier instances.

  • Enforce capitalization after 5.8, fixing the ~30 un-capitalized instances. The idea here would be that the capitalization represents a narratively-significant shift in how Victoria thinks of the wretch/Wretch, and we assume that instances where it got left as lowercase later in the book were accidents. It seems weird to use capitalization this way (i.e., it seems like it's just going to cause the reader's eyes to stumble each time, instead of making them see the Wretch as more of a named entity), but it's possible this best preserves authorial intent.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.