GithubHelp home page GithubHelp logo

dipu-bd / lightnovel-crawler Goto Github PK

View Code? Open in Web Editor NEW
1.4K 39.0 268.0 32.37 MB

Generate and download e-books from online sources.

Home Page: https://pypi.org/project/lightnovel-crawler/

License: GNU General Public License v3.0

Python 99.10% Shell 0.28% CSS 0.25% Batchfile 0.08% HTML 0.04% Dockerfile 0.09% Procfile 0.01% JavaScript 0.16%
lightnovel termux web-scraper console-app python lightnovel-crawler discord telegram kindle-books

lightnovel-crawler's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lightnovel-crawler's Issues

output_path referenced before assignment

Installed using pip on windows. Created the folders manually. See #10 .

.\ebook_crawler.exe webnovel 8093990805004205
Getting CSRF Token from https://www.webnovel.com/book/8093990805004205
CSRF Token = LI1UErwQVWiDGnvCmmpgBamfOpavmDGUCZJqglkP
Getting book name and chapter list...
1646 chapters found
Traceback (most recent call last):
File "c:\program files\python36\lib\runpy.py", line 193, in run_module_as_main
"main", mod_spec)
File "c:\program files\python36\lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "C:\Users\Karl\AppData\Roaming\Python\Python36\Scripts\ebook_crawler.exe_main
.py", line 9, in
File "C:\Users\Karl\AppData\Roaming\Python\Python36\site-packages\ebook_crawler_init
.py", line 34, in main
end_chapter=sys.argv[4] if len(sys.argv) > 4 else ''
File "C:\Users\Karl\AppData\Roaming\Python\Python36\site-packages\ebook_crawler\webnovel.py", line 44, in start
novel_to_kindle(self.output_path)
File "C:\Users\Karl\AppData\Roaming\Python\Python36\site-packages\ebook_crawler\binding.py", line 83, in novel_to_kindle
for file_name in sorted(os.listdir(output_path)):
UnboundLocalError: local variable 'output_path' referenced before assignment

readlightnovel.org spider duplicate text

readlightnovel.org crawler is duplicating the text. There is a hidden div in the HTML source, which does not contain ads, and a visible DIV. The crawler get's both and is not filtering the ADS sub-div either.

Need New Source

I think 4 source available is really amazing but more source available is great. I think wuxiaworld.co. and novelplanet is perfect candidate for new source to be packed into epub.

Readlightnovel.com problems

I think there is a problem with readlightnovel.com
the rest of the sites are okay.
i Tried: book_crawler readln full-marks-hidden-marriage-pick-up-a-son-get-a-free-husband https://www.readlightnovel.org/full-marks-hidden-marriage-pick-up-a-son-get-a-free-husband/chapter-1830 https://www.readlightnovel.org/full-marks-hidden-marriage-pick-up-a-son-get-a-free-husband/chapter-1835

and this is what i get from terminal.
Visiting https://www.readlightnovel.org/full-marks-hidden-marriage-pick-up-a-son-get-a-free-husband
Getting book name and chapter list...
[Full Marks Hidden Marriage: Pick Up a Son, Get a Free Husband] 1842 chapters found
Downloading https://www.readlightnovel.org/full-marks-hidden-marriage-pick-up-a-son-get-a-free-husband/chapter-1830
Downloading https://www.readlightnovel.org/full-marks-hidden-marriage-pick-up-a-son-get-a-free-husband/chapter-1831
Downloading https://www.readlightnovel.org/full-marks-hidden-marriage-pick-up-a-son-get-a-free-husband/chapter-1832
Downloading https://www.readlightnovel.org/full-marks-hidden-marriage-pick-up-a-son-get-a-free-husband/chapter-1833
Downloading https://www.readlightnovel.org/full-marks-hidden-marriage-pick-up-a-son-get-a-free-husband/chapter-1834
Saving Full Marks Hidden Marriage Pick Up a Son, Get a Free Husband/json/19/01836.json
Saving Full Marks Hidden Marriage Pick Up a Son, Get a Free Husband/json/19/01835.json
Downloading https://www.readlightnovel.org/full-marks-hidden-marriage-pick-up-a-son-get-a-free-husband/chapter-1835
Downloading https://www.readlightnovel.org/full-marks-hidden-marriage-pick-up-a-son-get-a-free-husband/chapter-1841
Saving Full Marks Hidden Marriage Pick Up a Son, Get a Free Husband/json/19/01834.json
Saving Full Marks Hidden Marriage Pick Up a Son, Get a Free Husband/json/19/01837.json
Saving Full Marks Hidden Marriage Pick Up a Son, Get a Free Husband/json/19/01833.json
Saving Full Marks Hidden Marriage Pick Up a Son, Get a Free Husband/json/19/01838.json
Saving Full Marks Hidden Marriage Pick Up a Son, Get a Free Husband/json/19/01839.json
complete
Processing: Full Marks Hidden Marriage Pick Up a Son, Get a Free Husband/json/19
Creating: Full Marks Hidden Marriage Pick Up a Son, Get a Free Husband/epub/Full Marks Hidden Marriage: Pick Up a Son, Get a Free Husband_v19.epub
Traceback (most recent call last):
File "/anaconda3/lib/python3.5/site-packages/ebook_crawler/binding.py", line 128, in novel_to_mobi
generator(KINDLEGEN_PATH_MAC)
File "/anaconda3/lib/python3.5/site-packages/ebook_crawler/binding.py", line 123, in
generator = lambda kindlegen: call([kindlegen, epub_file])
File "/anaconda3/lib/python3.5/subprocess.py", line 247, in call
with Popen(*popenargs, **kwargs) as p:
File "/anaconda3/lib/python3.5/subprocess.py", line 676, in init
restore_signals, start_new_session)
File "/anaconda3/lib/python3.5/subprocess.py", line 1289, in _execute_child
raise child_exception_type(errno_num, err_msg)
PermissionError: [Errno 13] Permission denied

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/anaconda3/lib/python3.5/site-packages/ebook_crawler/binding.py", line 136, in novel_to_mobi
generator('kindlegen')
File "/anaconda3/lib/python3.5/site-packages/ebook_crawler/binding.py", line 123, in
generator = lambda kindlegen: call([kindlegen, epub_file])
File "/anaconda3/lib/python3.5/subprocess.py", line 247, in call
with Popen(*popenargs, **kwargs) as p:
File "/anaconda3/lib/python3.5/subprocess.py", line 676, in init
restore_signals, start_new_session)
File "/anaconda3/lib/python3.5/subprocess.py", line 1289, in _execute_child
raise child_exception_type(errno_num, err_msg)
FileNotFoundError: [Errno 2] No such file or directory: 'kindlegen'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/anaconda3/bin/ebook_crawler", line 11, in
sys.exit(main())
File "/anaconda3/lib/python3.5/site-packages/ebook_crawler/init.py", line 51, in main
volume=volume,
File "/anaconda3/lib/python3.5/site-packages/ebook_crawler/readln.py", line 49, in start
novel_to_mobi(self.output_path)
File "/anaconda3/lib/python3.5/site-packages/ebook_crawler/binding.py", line 138, in novel_to_mobi
if err[1].errno == errno.ENOENT:
TypeError: 'FileNotFoundError' object is not subscriptable

after the result i get is only an epub with all the titles with no content.

LNMTL chapter numbers are wrong

When I compile chapters from LNMTL the chapter numbers don't match what they actually are on the site. For instance I downloaded the last 10 chapters of Martial God Asura and it had them as 3123-3132 when they should be 3620-3629 which is what they are on the site. The chapters them selves match, just not the chapter numbers.

Implement novel searching feature

Make it more intelligence by following this workflow:

  • Take any string as input
  • If it a supported URL. start the crawler immediately
  • Otherwise display a list of websites that supports searching. User has to choose one.
  • Display search results from selected website. User will select one from the result list.
  • Start the crawler using the lightnovel URL

Implemented it for:

X represents the sites that does not support searching or could not implement searching

Add Intro Page to generated book

Many of source book has synopsis and infor regarding the book. maybe we can crawled it and app intro pages to generated book, maybe adding info that this book generated using this script, etc

Windows Binary Does nothing after inputting the novel link

the provided windows binary simply exits after entering the login information with your github link reference
Checked on a fresh install of windows with no python or any other possible dependency installed but pure windows
screenshot

UPDATE---
calling the exe with all the proper parameters instead of using the interactive menu seems to make it work absolutely fine

LNMTL login issues

I'm trying to download from this url https://lnmtl.com/novel/forty-millenniums-of-cultivation
This is one of the novels on the site that needs logging in to read.
exact command I'm using
lncrawl --login <username> <password> -s https://lnmtl.com/novel/forty-millenniums-of-cultivation
username and password redacted for obvious reasons.
followed onscreen prompts for output directory and selected option for first 10 chapters
"body is empty" for every chapter; the generated file are indeed empty
lncrawl --version returns 2.7.6

also i tried downloading the above novel by just going through prompts instead of using option flags and i never got prompted to login despite the novel requiring it.

Discord bot closed when processing

When i use discord bot. if i ask it to generate book for novel with more than 200 chapter it usually generate error like this
[ERROR] (asyncio) Task was destroyed but it is pending! task: <Task pending coro=<Client._run_event() running at /home/yudi/.local/lib/python3.6/site-packages/discord/client.py:307> wait_for=<Future pending cb=[BaseSelectorEventLoop._sock_connect_done(15)(), <TaskWakeupMethWrapper object at 0x7f52f3c029a8>()]>>
and bot will destroyed(closed). But i think this not happened with lightnovel-crawler bot in readme link. Is there something i miss when deploying discord bot?

Chapter Title from TOC and from chapter page is different

While converting scrapper to new style, i found that in the new style template chapter title is scrapped from TOC page not from chapter page. In old template the crawler get it from chapter page. In some source in toc they only write the title in number like in boxnovel and novel planet. I think we need to get chapter title from chapter page rather than from toc page.

lnmtl.com terminates when chapter_no not purely numeric

Visiting: https://lnmtl.com/chapter/the-amber-sword-book-3-chapter-531-1
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "./__main__.py", line 74, in <module>
    main()
  File "./__main__.py", line 28, in main
    end_url=sys.argv[4] if len(sys.argv) > 4 else ''
  File "./EbookCrawler/lnmtl.py", line 57, in start
    self.crawl_chapters(browser)
  File "./EbookCrawler/lnmtl.py", line 89, in crawl_chapters
    self.parse_chapter(browser)
  File "./EbookCrawler/lnmtl.py", line 112, in parse_chapter
    chapter_no = re.search(r'chapter-\d+$', url).group().strip('chapter-')
AttributeError: 'NoneType' object has no attribute 'group'

There are two chapters labeled # 531 on the website.

Discord bot cannot get novel url in channel chat

Discord bot works great in 1 on 1 chat but not on channel chat because it can read novel url or search item in chat. Unlike telegram bot, in discord while in channel they don't have reply feature. Maybe we need shorter command feature for chat in channel but also retain conversation like request for 1 on 1 chat. shorter comman is similar to argument in console bot for example
for searching we can do :
!lncrawl search novel_title novel_source --> to generate novel url
for generating book
!lncrawl format_book novel_url pack_by_volume all -> to generate all chapter in format_book format
etc

Request: novelfull.com

I've noticed it isn't supported I was wondering if you could add it. It has very similar layout to some of the other supported sites like BoxNovel, Readinglightnovel and Novelplanet.

code from sekindo

I've noticed it before but thought it was a one off. Sometimes I get this kind of error/message in my books.

This is a sentence in the book. "Hello, I'm a example." said Me.
code from sekindo - Readlightnovel.org In-article - outstream
code from sekindo
/339474670/ReadLightNovel/InStory_1
This is a sentence in the book. "Hello, I'm a example." said Me.

As you can see I get this weird message and duplicate sentences, there's also others through out as well at random points, I've listed them below.

/339474670/ReadLightNovel/InStory_3
/339474670/ReadLightNovel/InStory_2
/339474670/ReadLightNovel/BottomStory

I don't know what causes it, I was downloading this novel https://www.readlightnovel.org/mo-tian-ji
If you need anymore info let me know and I will try and provide it.

Add new sources

MAc user

Hi, A beginner here,
Can you please provide us an instruction for mac users? thank you

Windows cygwin: FileNotFoundError: [WinError 3] The system cannot find the path specified

Laptop@Lenovo ~/novel
$ ebook_crawler webnovel 7931338406001705 1 10 false
Traceback (most recent call last):
  File "c:\users\laptop\appdata\local\programs\python\python36\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\users\laptop\appdata\local\programs\python\python36\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\laptop\AppData\Roaming\Python\Python36\Scripts\ebook_crawler.exe\__main__.py", line 9, in <module>
  File "C:\Users\laptop\AppData\Roaming\Python\Python36\site-packages\ebook_crawler\__init__.py", line 44, in main
    volume=volume,
  File "C:\Users\laptop\AppData\Roaming\Python\Python36\site-packages\ebook_crawler\webnovel.py", line 49, in start
    novel_to_mobi(self.output_path)
  File "C:\Users\laptop\AppData\Roaming\Python\Python36\site-packages\ebook_crawler\binding.py", line 110, in novel_to_mobi
    for file_name in sorted(os.listdir(epub_path)):
FileNotFoundError: [WinError 3] The system cannot find the path specified: '7931338406001705\\epub'
Getting CSRF Token from  https://www.webnovel.com/book/7931338406001705
CSRF Token = mNpxvSI9mDTA9EixtmCelAFwmC3Ifgv4TzZOQRqM
Getting book name and chapter list...
1159 chapters found
7931338406001705 does not exists

I get this error. I am using cygwin 64bit on Windows10. Install using pip install ebook-crawler

UnicodeEncodeError: 'cp932' codec can't encode character '\xa0' in position 48: illegal multibyte sequence

�[?25hCreated: I’m in Hollywood.epub
Failed to generate mobi for I’m in Hollywood.epub
Traceback (most recent call last):
File "c:\program files (x86)\python36-32\lib\runpy.py", line 193, in run_module_as_main
"main", mod_spec)
File "c:\program files (x86)\python36-32\lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "C:\Program Files (x86)\Python36-32\Scripts\lightnovel-crawler.exe_main
.py", line 9, in
File "c:\program files (x86)\python36-32\lib\site-packages\lightnovel_crawler_init
.py", line 65, in main
start_app(crawler_list)
File "c:\program files (x86)\python36-32\lib\site-packages\lightnovel_crawler\app_init_.py", line 34, in start_app
Program().run(crawler())
File "c:\program files (x86)\python36-32\lib\site-packages\lightnovel_crawler\app\program.py", line 38, in run
bind_books(self)
File "c:\program files (x86)\python36-32\lib\site-packages\lightnovel_crawler\app\bind_books.py", line 56, in bind_books
file.write(text)
UnicodeEncodeError: 'cp932' codec can't encode character '\xa0' in position 48: illegal multibyte sequence

LNMTL not working

LNMTL doesnt seem to be working. For example I trried the last 10 chapters of Martial God Asura and it came back with this:

"? Enter an url or novel name to find: https://lnmtl.com/novel/martial-god-asura
Retrieving novel info...
NOVEL: Martial God Asura
? Which chapters to download? Last 10 chapters
Getting cover image...
�[KBody is empty: https://lnmtl.com/chapter/martial-god-asura-book-8-chapter-3623
�[KBody is empty: https://lnmtl.com/chapter/martial-god-asura-book-8-chapter-3621
�[KBody is empty: https://lnmtl.com/chapter/martial-god-asura-book-8-chapter-3624
�[KBody is empty: https://lnmtl.com/chapter/martial-god-asura-book-8-chapter-3620
�[KBody is empty: https://lnmtl.com/chapter/martial-god-asura-book-8-chapter-3622
�[KBody is empty: https://lnmtl.com/chapter/martial-god-asura-book-8-chapter-3625
�[KBody is empty: https://lnmtl.com/chapter/martial-god-asura-book-8-chapter-3628
�[KBody is empty: https://lnmtl.com/chapter/martial-god-asura-book-8-chapter-3627
�[KBody is empty: https://lnmtl.com/chapter/martial-god-asura-book-8-chapter-3626
�[KBody is empty: https://lnmtl.com/chapter/martial-god-asura-book-8-chapter-3629
�[KDownloading chapters |████████████████████████████████| 10/10
�[?25hCreated: 10 text files
Created: 10 html files
Created: Martial God Asura.epub
Created: Martial God Asura.mobi"

And the files contain no text. I have been able to get other sites to work fine just not LNMTL.
I have been trying with and without the login option. If it works than can you please give an example of the commands to get it to work, thanks.

Add uploader for google drive

For issue #52 we can add uploader for google drive and share google drive link using send message. I have already create function to do that in my forked repository. should i pull request to master or to other branch repository?

include function/parameter to bypass "Press Enter to exit"

I'm trying to integrate and automate using the windows version of the crawler. when I try to batch out the executable, it stops after each execution due to the requirement for the "Enter" key to be pressed. I tried the -f and --suppress flags, but execution still requires the "Enter" key press. Adding this flag/parameter would help automation.

Webnovel error invalid literal for int() with base 10:

While i try to create epub from webnovel i got error invalid literal for int() with base 10:

python3 main.py webnovel 10377938706023605 https://www.webnovel.com/book/10377938706023605/27858104469219628/Last-Wish-System/Yale-Roanmad https://www.webnovel.com/book/10377938706023605/30154170363336851/Last-Wish-System/Crossing-the-Border
Getting CSRF Token from https://www.webnovel.com/book/10377938706023605
CSRF Token = 9eJJFX5txT0r9s3004p1rDY61DZrTfvslGGHmp61
Getting book name and chapter list...
148 chapters found
Traceback (most recent call last):
File "main.py", line 2, in
main()
File "/home/yudi/book/Web Scrapper/ebook_crawler/init.py", line 34, in main
end_chapter=sys.argv[4] if len(sys.argv) > 4 else ''
File "/home/yudi/book/Web Scrapper/ebook_crawler/webnovel.py", line 43, in start
self.get_chapter_bodies()
File "/home/yudi/book/Web Scrapper/ebook_crawler/webnovel.py", line 83, in get_chapter_bodies
start = int(self.start_chapter)
ValueError: invalid literal for int() with base 10: 'https://www.webnovel.com/book/10377938706023605/27858104469219628/Last-Wish-System/Yale-Roanmad'

Thanks for helping

Telegram Bot when upload file finished not destroy crawler instance

Hi Sudipto,

After trying and testing the telegram bot i found that after upload zip file finished session not closed, and crawler instance not destroyed so when i call /start again, i have to call command /cancel first so bot can accept new job, is that the right flow?

And the other problem i found is even after call /cancel and i can call command /start again volume and chapter number is counted (past session + this session).

Is crawler instance passed betwen session?

LNMTL partially missing output when splitting by volume

all chapters download fine, but some chapters are missing from output formats when opting to generate separate files for each volume.

on version 2.7.7
was downloading https://lnmtl.com/novel/forty-millenniums-of-cultivation
opted to generate separate file for each volume
should have 34 volumes
only 33 volumes (for all output formats except json)

with the json format
all the volume_title fields show 1 volume higher than they are supposed to be (except the very last volume)
ex. the json file for the very first chapter:
{"id": 1, "url": "https://lnmtl.com/chapter/forty-millenniums-of-cultivation-chapter-1", "volume": 1, "title": "Chapter #1 - Magical Artifact Graveyard", "volume_title": "Volume 2",...

Discord bot file size limit

Discord won't allow uploading files larger than 8MB. But most of the compressed file has size > 8MB. Need another solution to send files to discord.

Novel binding fails

Processing: _novel\8093990805004205
!! Failed to bind: _novel\8093990805004205

This is with when providing the chapter numbers.

python3 main.py webnovel 8093990805004205
Getting CSRF Token from https://www.webnovel.com/book/8093990805004205
CSRF Token = BEwNDFH7yoADt2uvgWH9Y2ZdxMHKanalcugCh9WI
Getting book name and chapter list...
1646 chapters found
Processing: _novel\8093990805004205
Creating: NA\NA_v.epub


Amazon kindlegen(Windows) V2.9 build 1029-0897292
A command line e-book compiler
Copyright Amazon.com and its Affiliates 2014


Info:I9007:option: -c2: Kindle Huffdic compression
Error(opfparser):E20004: the id in the spine does not match any item in the manifest: cover

That is without giving chapter numbers.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.