GithubHelp home page GithubHelp logo

poketo's Introduction

poketo-node

Build Status Coverage npm

Node library for scraping manga sites.

Provides a consistent API for scraping metadata and chapter images from 16+ manga sites. Makes it easy to build manga readers, downloaders, archival tools, and more.

For working examples, check out the Poketo manga reader, or the Poketo CLI!

People should be able to read content on the web in the way that works for them. Manga sites are often a special brand of bad. Each manga page loads a new web page, ads everywhere, yuck! Poketo opens up access to that content to make better tools on top.

๐Ÿšง This project is still v0.x.x and the API may change as more sites are added.

Install

npm install poketo --save

You can also use api.poketo.app, a hosted micro-service for this library.

Usage

import poketo from 'poketo';

poketo.getSeries('http://merakiscans.com/senryu-girl/').then(series => {
  console.log(series);
  //=> { id: 'meraki-scans:senryu-girl', title: 'Senryu Girl', chapters: [...], ... }
});

poketo.getChapter('http://merakiscans.com/senryu-girl/5/').then(chapter => {
  console.log(chapter);
  //=> { id: 'meraki-scans:senryu-girl:5', pages: [...], ... }
});

Full documentation of the API can be found below.

Supported Sites

Site Series Info Chapter Images
Helvetica Scans โœ“ โœ“
Hot Chocolate Scans โœ“ โœ“
Jaiminiโ€™s Box โœ“ โœ“
Kirei Cake โœ“ โœ“
MangaHere โœ“ โœ“ (slow)
MangaUpdates โœ“
Mangadex โœ“ โœ“
MangaFox โœ“ โœ“
Mangakakalot โœ“ โœ“
Manganelo โœ“ โœ“
MangaRock โœ“ โœ“
MangaStream โœ“ โœ“
Meraki Scans โœ“ โœ“
Phoenix Serenade โœ“ โœ“
Sense Scans โœ“ โœ“
Sen Manga โœ“ โœ“
Silent Sky Scans โœ“ โœ“
Tuki Scans โœ“ โœ“

If there's a site or group you'd like to see supported, make an issue!

API

Poketo exposes four methods:

poketo.getSeries(idOrUrl: string): Promise<Series>
poketo.getChapter(idOrUrl: string): Promise<Chapter>
poketo.getType(input: string): 'series' | 'chapter'
poketo.constructUrl(id: string): string

To understand what is returned for a Series or Chapter, check out the examples below.

Docs

Get series information

Use poketo.getSeries to get the ID, title, cover image, and chapter listing for a manga series.

poketo.getSeries('http://merakiscans.com/senryu-girl').then(series => {
  console.log(series);
});

// {
//   id: 'meraki-scans:senryu-girl',
//   slug: 'senryu-girl',
//   url: 'http://merakiscans.com/senryu-girl',
//   title: 'Senryu Girl',
//   coverImageUrl: 'http://merakiscans.com/.../senryu_200x0.jpg',
//   chapters: [
//     {
//       id: 'meraki-scans:senryu-girl:1',
//       slug: '1',
//       chapterNumber: '1',
//       volumeNumber: '1',
//       title: '5-7-5 Girl',
//       createdAt: 1522811950
//     },
//     ...
//   ],
//   updatedAt: 1522811950,
// }

Get pages for a chapter

Use poketo.getChapter to get a page listing for a chapter. This doesn't include any information about the chapter or series, just the page list.

Depending on the site, the page list will also include image dimensions.

poketo.getChapter('http://merakiscans.com/senryu-girl/5').then(chapter => {
  console.log(chapter);
});

// {
//  id: 'meraki-scans:senryu-girl:5',
//  slug: '1',
//  url: 'http://merakiscans.com/senryu-girl/5',
//  pages: [
//    { id: '01', url: 'http://merakiscans.com/image01.png', width: 800, height: 1200 },
//    { id: '02', url: 'http://merakiscans.com/image02.png', width: 800, height: 1200 },
//    ...
//  ]
// }

Validate a URL or ID

If you have an arbitrary input, you can see if Poketo recognizes it by calling the poketo.getType method. It will return 'series' or 'chapter' if it's supported, or will throw a Poketo error if not.

poketo.getType('http://merakiscans.com/senryu-girl');
//=> 'series'
poketo.getType('http://merakiscans.com/senryu-girl/5');
//=> 'chapter'
poketo.getType('http://merakiscans.com/i/am/a/banana/yo');
//=> throws a `poketo.InvalidUrlError`
poketo.getType('http://google.com');
//=> throws a `poketo.UnsupportedSiteError`

Get a URL from a Poketo ID

If you've stored a Poketo ID, you can get a URL back out by using the poketo.constructUrl method. You can learn more about the difference between IDs and URLs.

poketo.constructUrl('meraki-scans:senryu-girl:5');
//=> http://merakiscans.com/senryu-girl/5

poketo.constructUrl('manga-stream:haikyuu:314/5286');
//=> https://readms.net/r/haikyuu/314/5286/1

What's the difference between an ID and a URL?

Poketo scrapes information from many sites. To identify which site, series (aka. manga), and chapter you're talking about, Poketo lets you provide information in two ways: a Poketo ID or a URL.

These IDs below are equivalent to the URLs on their right:

ID                                URL
mangadex:13127:311433          โ†’  https://mangadex.org/chapter/311433/1
meraki-scans:senryu-girl:5     โ†’  http://merakiscans.com/senryu-girl/5
manga-stream:haikyuu:314/5286  โ†’  https://readms.net/r/haikyuu/314/5286/1

For the getSeries and getChapter methods, you can provide either a URL, or an ID, like so:

// Both lines return the same series
poketo.getSeries('http://merakiscans.com/senryu-girl/');
poketo.getSeries('meraki-scans:senryu-girl');

// Both lines return the same chapter
poketo.getChapter('https://mangadex.org/chapter/311433/1');
poketo.getChapter('mangadex:13127:311433');

Why use an ID?

Poketo IDs have stronger guarantees they won't change.

It's not uncommon for a site to change their domain name or URL structure. For example, MangaDex once changed URLs for manga series from https://mangadex.org/manga/1234 to https://mangadex.org/title/1234. If that happens, your URL might break. But by using an ID, Poketo will know to do the right thing. This makes IDs a more robust way to store information about a series.

Of course, there are no true guarantees with scraping. Even an ID that works one day might break the next โ€”ย but it's a slightly better guarantee.

Error Handling

Scraping isn't a perfect. When using Poketo you'll inevitably run into an error, so we try to make what happened as clear as possible.

  • RequestError -ย unable to make a request to scrape the site
  • TimeoutError - tried to make a request, but the source site didn't respond in a reasonable time. Defaults to 5 seconds.
  • HTTPError - tried to scrape the site, but the site returned an error (eg. 404, 500)
  • LicenseError - tried to scrape the site, but the series/chapter is current blocked, licensed, or was subjected to a DCMA takedown.
  • UnsupportedSiteError - the site you're trying to scrape from isn't supported. If you'd like to see it supported, make an issue!
  • UnsupportedOperationError - some sites don't support reading chapters. This error is thrown if you call poketo.getChapter for these sites.

Contributing

Contributions are welcome! Poketo is meant to be built on top of. Feel free to propose ideas or changes that would make it work for your situation โ€” whether it's a bug report, site request, or contributed code. Read more at CONTRIBUTING.md

poketo's People

Contributors

dependabot[bot] avatar rosszurowski avatar zsoulweaver avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

poketo's Issues

Revise API

This issue is a collection of larger breaking changes to the API that could improve the clarity/ergonomics of the library. These would be batched into the 1.0 release.

These are based on what I've found extending support to a bunch of different sites:

  • Separate getSeries and getChapterList. For long-running series on sites like Mangadex, fetching the chapter list can take several seconds as it pages through results. It's also unlikely that any of the series metadata has changed since the last fetch, so we can just rely on cached versions for that.
  • Rename getChapter to getChapterPages (or similar). We're really only fetching page URLs here, so the name should reflect that. Chapter metadata is fetched as a part of the series call.
  • Move dates to ISO 8601. I've found unix timestamps much easier to deal with, but ISO8601 is undoubtably the standard date format of the web. It'd be nice to not introduce a learning curve for users of the library. (Here's the essay that convinced me)

Add configurable timeout

Right now Poketo hard-codes timeouts to be 5s long (source). This makes sense for the poketo service, but might not make as much sense for someone running poketo locally in a downloader.

It'd be nice to expose the timeout as a configurable option, to support these different use-cases.

poketo.getChapter('meraki-scans:senryu-girl:5', { timeout: 30000 }).then(chapter => {
  // ...
});

Fix Mangadex HTML-encoded output

MangaDex's API returns HTML encoded responses. This means that series with accents in the title and descriptions return things like Pass&eacute;. Their description blocks also include custom formatted like bold, italics, and hyperlinks. For example, an API call to Flying Witch returns the following:

Kowata Makoto is an airhead with a bad sense of direction who just moved into her relative's house... but is that all?\r\n\r\nThe Oneshot was originally published under a different pen name (Ishioka Chikai).\r\n\r\n[url=https://myanimelist.net/manga/22589]Flyingโ˜†Witch Oneshot MAL[/url]

Bulk download?

Hey! Really fantastic work on this. I was wondering if there's a clean way to do a bulk download of a site, bulk meaning download everything.

If anyone has any ideas i'd love to hear!

Thanks

Support searching for series

This is a big one. Right now clients need to pass series URLs to poketo, which is a pretty inhuman input mechanism. It'd be nicer experience to search for a series by names or author and get a list of results back.

There's a number of big questions for this:

  • Should this even be in the poketo library? Or does it belong in a separate service/library.
  • What sites do we search? Do we check every supports site? That'll be pretty slow.
  • What do we need to return? Full information about a series?
  • How do we sort the search results? Maybe we should leave organizing/filtering search results to the client.

That said, I think we can exclude some stuff for now:

  • Searching by genre โ€”ย I don't think Poketo should be a general series discovery platform. Other sites do a better job at that. Poketo is meant for scraping / tracking updates / reading. Not more.

MangaHasu

http://mangahasu.se/

Seems like they have a pretty solid collection without much trouble. At least one person has mentioned using them.

Client-side version

Right now clients using this library (like poketo-reader) manually re-write certain behaviours and information. For example, which sites are supported or which sites allow reading.

It'd be nice for the poketo library to expose this information in a browser-friendly way (ie. not loading an entire http polyfill).

Maybe it could even stub out calls to api.poketo.app or something?

MangaHere & MerikaScans not working

I tried getting series information from MangaHere and MerikaScans with

poketo.getSeries().

For MangaHere, I tried this URL: http://www.mangahere.cc/manga/dice_the_cube_that_changes_everything/ but got this error:

TypeError: Cannot read property 'trim' of undefined

For MerikaScans, I tried this URL: https://merakiscans.com/details/their-color/ but got this error:

NotFoundError: Not Found

Since I dont know how to contribute to look at those errors, I hope you'll found why they occur.
PS: I love this package!!!

MangaStream

https://readms.net/

MangaStream is an alternative host for a number of popular series (Shingeki no Kyojin, One Piece, One Punch Man, etc.) plus a few others I couldn't find elsewhere.

Wrap HTTP Errors

HTTP errors, such as a timed-out request or a 404 on a manga site pass through to the client. This is fine, except these errors are straight from got, our request library. They use statusCode which is a different format than Poketo errors, which use code.

We should consolidate how errors are returned so they can be handled without needing to write two separate kinds of error handling code.

Add caching

Got does something interesting to support caching; it provides a cache adapter as a configurable option. This lets users cache results in-memory or with external services like Redis.

Since so much of Poketo's performance is network-constrained, it might be interesting to add a similar option. The poketo service may want to use Redis, whereas a small implementation of Poketo may not want caching at all, or may opt for an in-memory cache.

Critical dependency errors

Every time I try to import Poketo I get this error and my site won't work anymore. How should I go about fixing it?

error

And if I go onto an actual page it will throw this error

error

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.