poketo / poketo Goto Github PK

View Code? Open in Web Editor NEW

34.0 34.0 4.0 2.08 MB

Node library for scraping manga sites

License: MIT License

JavaScript 100.00%

manga node scraping

poketo's Introduction

poketo-node

Node library for scraping manga sites.

Provides a consistent API for scraping metadata and chapter images from 16+ manga sites. Makes it easy to build manga readers, downloaders, archival tools, and more.

For working examples, check out the Poketo manga reader, or the Poketo CLI!

People should be able to read content on the web in the way that works for them. Manga sites are often a special brand of bad. Each manga page loads a new web page, ads everywhere, yuck! Poketo opens up access to that content to make better tools on top.

🚧 This project is still v0.x.x and the API may change as more sites are added.

Install

npm install poketo --save

You can also use api.poketo.app, a hosted micro-service for this library.

Usage

import poketo from 'poketo';

poketo.getSeries('http://merakiscans.com/senryu-girl/').then(series => {
  console.log(series);
  //=> { id: 'meraki-scans:senryu-girl', title: 'Senryu Girl', chapters: [...], ... }
});

poketo.getChapter('http://merakiscans.com/senryu-girl/5/').then(chapter => {
  console.log(chapter);
  //=> { id: 'meraki-scans:senryu-girl:5', pages: [...], ... }
});

Full documentation of the API can be found below.

Supported Sites

Site	Series Info	Chapter Images
Helvetica Scans	✓	✓
Hot Chocolate Scans	✓	✓
Jaimini’s Box	✓	✓
Kirei Cake	✓	✓
MangaHere	✓	✓ (slow)
MangaUpdates	✓
Mangadex	✓	✓
MangaFox	✓	✓
Mangakakalot	✓	✓
Manganelo	✓	✓
MangaRock	✓	✓
MangaStream	✓	✓
Meraki Scans	✓	✓
Phoenix Serenade	✓	✓
Sense Scans	✓	✓
Sen Manga	✓	✓
Silent Sky Scans	✓	✓
Tuki Scans	✓	✓

If there's a site or group you'd like to see supported, make an issue!

API

Poketo exposes four methods:

poketo.getSeries(idOrUrl: string): Promise<Series>
poketo.getChapter(idOrUrl: string): Promise<Chapter>
poketo.getType(input: string): 'series' | 'chapter'
poketo.constructUrl(id: string): string

To understand what is returned for a Series or Chapter, check out the examples below.

Get series information

Use poketo.getSeries to get the ID, title, cover image, and chapter listing for a manga series.

poketo.getSeries('http://merakiscans.com/senryu-girl').then(series => {
  console.log(series);
});

// {
//   id: 'meraki-scans:senryu-girl',
//   slug: 'senryu-girl',
//   url: 'http://merakiscans.com/senryu-girl',
//   title: 'Senryu Girl',
//   coverImageUrl: 'http://merakiscans.com/.../senryu_200x0.jpg',
//   chapters: [
//     {
//       id: 'meraki-scans:senryu-girl:1',
//       slug: '1',
//       chapterNumber: '1',
//       volumeNumber: '1',
//       title: '5-7-5 Girl',
//       createdAt: 1522811950
//     },
//     ...
//   ],
//   updatedAt: 1522811950,
// }

Get pages for a chapter

Use poketo.getChapter to get a page listing for a chapter. This doesn't include any information about the chapter or series, just the page list.

Depending on the site, the page list will also include image dimensions.

poketo.getChapter('http://merakiscans.com/senryu-girl/5').then(chapter => {
  console.log(chapter);
});

// {
//  id: 'meraki-scans:senryu-girl:5',
//  slug: '1',
//  url: 'http://merakiscans.com/senryu-girl/5',
//  pages: [
//    { id: '01', url: 'http://merakiscans.com/image01.png', width: 800, height: 1200 },
//    { id: '02', url: 'http://merakiscans.com/image02.png', width: 800, height: 1200 },
//    ...
//  ]
// }

Validate a URL or ID

If you have an arbitrary input, you can see if Poketo recognizes it by calling the poketo.getType method. It will return 'series' or 'chapter' if it's supported, or will throw a Poketo error if not.

poketo.getType('http://merakiscans.com/senryu-girl');
//=> 'series'
poketo.getType('http://merakiscans.com/senryu-girl/5');
//=> 'chapter'
poketo.getType('http://merakiscans.com/i/am/a/banana/yo');
//=> throws a `poketo.InvalidUrlError`
poketo.getType('http://google.com');
//=> throws a `poketo.UnsupportedSiteError`

Get a URL from a Poketo ID

If you've stored a Poketo ID, you can get a URL back out by using the poketo.constructUrl method. You can learn more about the difference between IDs and URLs.

poketo.constructUrl('meraki-scans:senryu-girl:5');
//=> http://merakiscans.com/senryu-girl/5

poketo.constructUrl('manga-stream:haikyuu:314/5286');
//=> https://readms.net/r/haikyuu/314/5286/1

What's the difference between an ID and a URL?

Poketo scrapes information from many sites. To identify which site, series (aka. manga), and chapter you're talking about, Poketo lets you provide information in two ways: a Poketo ID or a URL.

These IDs below are equivalent to the URLs on their right:

ID                                URL
mangadex:13127:311433          →  https://mangadex.org/chapter/311433/1
meraki-scans:senryu-girl:5     →  http://merakiscans.com/senryu-girl/5
manga-stream:haikyuu:314/5286  →  https://readms.net/r/haikyuu/314/5286/1

For the getSeries and getChapter methods, you can provide either a URL, or an ID, like so:

// Both lines return the same series
poketo.getSeries('http://merakiscans.com/senryu-girl/');
poketo.getSeries('meraki-scans:senryu-girl');

// Both lines return the same chapter
poketo.getChapter('https://mangadex.org/chapter/311433/1');
poketo.getChapter('mangadex:13127:311433');

Why use an ID?

Poketo IDs have stronger guarantees they won't change.

It's not uncommon for a site to change their domain name or URL structure. For example, MangaDex once changed URLs for manga series from https://mangadex.org/manga/1234 to https://mangadex.org/title/1234. If that happens, your URL might break. But by using an ID, Poketo will know to do the right thing. This makes IDs a more robust way to store information about a series.

Of course, there are no true guarantees with scraping. Even an ID that works one day might break the next — but it's a slightly better guarantee.

Error Handling

Scraping isn't a perfect. When using Poketo you'll inevitably run into an error, so we try to make what happened as clear as possible.

RequestError - unable to make a request to scrape the site
TimeoutError - tried to make a request, but the source site didn't respond in a reasonable time. Defaults to 5 seconds.
HTTPError - tried to scrape the site, but the site returned an error (eg. 404, 500)
LicenseError - tried to scrape the site, but the series/chapter is current blocked, licensed, or was subjected to a DCMA takedown.
UnsupportedSiteError - the site you're trying to scrape from isn't supported. If you'd like to see it supported, make an issue!
UnsupportedOperationError - some sites don't support reading chapters. This error is thrown if you call poketo.getChapter for these sites.

Contributing

Contributions are welcome! Poketo is meant to be built on top of. Feel free to propose ideas or changes that would make it work for your situation — whether it's a bug report, site request, or contributed code. Read more at CONTRIBUTING.md

poketo's People

Contributors

Stargazers

Watchers

Forkers

zsoulweaver vmdao yanchespenda thanhtoan1196 felox2

poketo's Issues

Revise API

This issue is a collection of larger breaking changes to the API that could improve the clarity/ergonomics of the library. These would be batched into the 1.0 release.

These are based on what I've found extending support to a bunch of different sites:

Separate getSeries and getChapterList. For long-running series on sites like Mangadex, fetching the chapter list can take several seconds as it pages through results. It's also unlikely that any of the series metadata has changed since the last fetch, so we can just rely on cached versions for that.
Rename getChapter to getChapterPages (or similar). We're really only fetching page URLs here, so the name should reflect that. Chapter metadata is fetched as a part of the series call.
Move dates to ISO 8601. I've found unix timestamps much easier to deal with, but ISO8601 is undoubtably the standard date format of the web. It'd be nice to not introduce a learning curve for users of the library. (Here's the essay that convinced me)

Add configurable timeout

Right now Poketo hard-codes timeouts to be 5s long (source). This makes sense for the poketo service, but might not make as much sense for someone running poketo locally in a downloader.

It'd be nice to expose the timeout as a configurable option, to support these different use-cases.

poketo.getChapter('meraki-scans:senryu-girl:5', { timeout: 30000 }).then(chapter => {
  // ...
});

Line WebToons

https://www.webtoons.com/

Looks like a number of popular manhwa series are hosted here (eg. Tower of God).

Fix Mangadex HTML-encoded output

MangaDex's API returns HTML encoded responses. This means that series with accents in the title and descriptions return things like Passé. Their description blocks also include custom formatted like bold, italics, and hyperlinks. For example, an API call to Flying Witch returns the following:

Kowata Makoto is an airhead with a bad sense of direction who just moved into her relative's house... but is that all?\r\n\r\nThe Oneshot was originally published under a different pen name (Ishioka Chikai).\r\n\r\n[url=https://myanimelist.net/manga/22589]Flying☆Witch Oneshot MAL[/url]

Mangago

Got a request for Mangago since it has a better selection of manhwas than other sites out there.

Looks like they use CloudFlare's DDoS protection, and also obfuscate images pretty aggressively, so here are a few resources to help with writing the site support:

Add information about series and chapter pages to docs

The service repository includes a few lines about chapter index and chapter pages. This repo should include the same information, since it's important context in knowing what to pass to Poketo.

Bulk download?

Hey! Really fantastic work on this. I was wondering if there's a clean way to do a bulk download of a site, bulk meaning download everything.

If anyone has any ideas i'd love to hear!

Thanks

MangaDex HTML Entities are not decoded

Related to #31, it looks like MangaDex returns several fields with HTML-encoded entities that result in some weird displays.

For example, https://mangadex.org/title/21117 returns the title as "Robot × Laserbeam"

Support searching for series

This is a big one. Right now clients need to pass series URLs to poketo, which is a pretty inhuman input mechanism. It'd be nicer experience to search for a series by names or author and get a list of results back.

There's a number of big questions for this:

Should this even be in the poketo library? Or does it belong in a separate service/library.
What sites do we search? Do we check every supports site? That'll be pretty slow.
What do we need to return? Full information about a series?
How do we sort the search results? Maybe we should leave organizing/filtering search results to the client.

That said, I think we can exclude some stuff for now:

Searching by genre — I don't think Poketo should be a general series discovery platform. Other sites do a better job at that. Poketo is meant for scraping / tracking updates / reading. Not more.

MangaHasu

http://mangahasu.se/

Seems like they have a pretty solid collection without much trouble. At least one person has mentioned using them.

Client-side version

Right now clients using this library (like poketo-reader) manually re-write certain behaviours and information. For example, which sites are supported or which sites allow reading.

It'd be nice for the poketo library to expose this information in a browser-friendly way (ie. not loading an entire http polyfill).

Maybe it could even stub out calls to api.poketo.app or something?

MangaHere & MerikaScans not working

I tried getting series information from MangaHere and MerikaScans with

poketo.getSeries().

For MangaHere, I tried this URL: http://www.mangahere.cc/manga/dice_the_cube_that_changes_everything/ but got this error:

TypeError: Cannot read property 'trim' of undefined

For MerikaScans, I tried this URL: https://merakiscans.com/details/their-color/ but got this error:

NotFoundError: Not Found

Since I dont know how to contribute to look at those errors, I hope you'll found why they occur.
PS: I love this package!!!

Fallen Angels Scans

Link: http://manga.fascans.com/

食戟のソーマ is a popular series that's translated / hosted by Fallen Angels scans.

MangaStream

https://readms.net/

MangaStream is an alternative host for a number of popular series (Shingeki no Kyojin, One Piece, One Punch Man, etc.) plus a few others I couldn't find elsewhere.

Wrap HTTP Errors

HTTP errors, such as a timed-out request or a 404 on a manga site pass through to the client. This is fine, except these errors are straight from got, our request library. They use statusCode which is a different format than Poketo errors, which use code.

We should consolidate how errors are returned so they can be handled without needing to write two separate kinds of error handling code.

Add caching

Got does something interesting to support caching; it provides a cache adapter as a configurable option. This lets users cache results in-memory or with external services like Redis.

Since so much of Poketo's performance is network-constrained, it might be interesting to add a similar option. The poketo service may want to use Redis, whereas a small implementation of Poketo may not want caching at all, or may opt for an in-memory cache.