niqdev / packtpub-crawler Goto Github PK

View Code? Open in Web Editor NEW

753.0 69.0 179.0 165 KB

Download your daily free Packt Publishing eBook https://www.packtpub.com/packt/offers/free-learning

License: MIT License

HTML 28.21% JavaScript 0.72% Python 71.06%

packtpub free-ebook google-drive onedrive ifttt firebase heroku docker

packtpub-crawler's People

Contributors

Stargazers

Watchers

Forkers

de1905 shade34321 feeeper danielhilpoltsteiner bunam armohan diwahars nengineer yuryja marcopsm almightyyeh devnuhl yadavali1 rahulvramesh cswang f0x2501 huytv593 cyber-wisdom atouhou windhamwong e050110 joaopaulo164 lucabonifacio blackstile binsoopark emkorybski respencer 861 jirsis harshanimmagadda44 alexkit williamthiago kuchy juzim bagaswidodo spuds51 cybernetics qlycool derados miklobit sodri126 trancen oidatiftla ryanmaclean gleopoldo rcjavier lxchen2001 gmlp jhidalgor diogenesleonel wizzro tuksik optionalg bookronin sivaone mkarpiarz crazysergo k5hv mauriliofilho zivzone kkoralsky programming-fun abarrile exceltior lyonc viniciusbig asrulhadi naksdad semtle zhf459 leoluyi viktormavericks a2393439531 aoktox rowhit randyzhong momentum-tn jennyluciav hnwolf cfirmo33 miznokruge ntk148v rbetogt nandoprates arywidiantara geekwolverine noahcse slimlime hngouveia01 namnh68 personalbackup hizagalilo doraeball22 anhnguyendepocen sbambach mclohrk kjaylee febridev venky18 zengtsu

packtpub-crawler's Issues

Possible work around?

I came across this repo. I am not sure whether or not it works (sadly won't work for what I'm wanting to do as it uses Selenium) but maybe its worth looking in to?

https://github.com/eastee/rebreakcaptcha

My thinking is somehow use this to get around the ReCAPTCHA on the PacktPub site.

IndexError: list index out of range

Hello,

having the below issue and hoping that you could give me a little help.

Cheers
Marcus
[] 2017-06-20 15:05 - fetching today's eBooks
[] configuration file: /opt/packtpub-crawler/config/prod.cfg
[] getting daily free eBook
[] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[-] <type 'exceptions.IndexError'> list index out of range | spider.py@97
Traceback (most recent call last):
File "/opt/packtpub-crawler/script/spider.py", line 97, in main
packtpub.runDaily()
File "/opt/packtpub-crawler/script/packtpub.py", line 161, in runDaily
self.__parseDailyBookInfo(soup)
File "/opt/packtpub-crawler/script/packtpub.py", line 93, in __parseDailyBookInfo
self.info['url_claim'] = self.__url_base + div_target.select('a.twelve-days-claim')[0]['href']
IndexError: list index out of range
[] no free eBook from newsletter right now
[*] done

pip install -r requirements succeeded fine, using Ubuntu 16.04. :-)

Google Drive folder support

Hello, I love your script and I would like to ask for an extra feature!
Putting files into a folder on Google Drive would be much better and tidier.
Here is the reference of the Python code:
https://developers.google.com/drive/v3/web/folder#creating_a_folder

If you have no time to do it and you don't mind me to make a pull request, I am happy to help.

video courses

hi
can you please make code to download packtpub video courses which they allow to watch online only

plz

Errors upon executing script

I have cloned the repository and installed all of the dependencies, but I am getting several errors upon running the following in command line:
>>C:\Python27\python.exe C:\Users\****\*******\my_ebooks\PACKTPUB_CRAWLER\packtpub-crawler\script\spider.py --config C:\Users\****\*******\my_ebooks\PACKTPUB_CRAWLER\packtpub-crawler\config\prod_example.cfg

The Errors I'm getting are:

Traceback (most recent call last):
  File "C:\Users\****\*******\my_ebooks\PACKTPUB_CRAWLER\packtpub-crawler\script\spider.py", line 9, in <module>
    from upload import Upload, SERVICE_GOOGLE_DRIVE, SERVICE_ONEDRIVE, SERVICE_DROPBOX, SERVICE_SCP
  File "C:\Users\****\*******\my_ebooks\PACKTPUB_CRAWLER\packtpub-crawler\script\upload.py", line 1, in <module>
    from googledrive import GoogleDrive
  File "C:\Users\****\*******\my_ebooks\PACKTPUB_CRAWLER\packtpub-crawler\script\googledrive.py", line 6, in <module>
    import magic
  File "build\bdist.win-amd64\egg\magic.py", line 176, in <module>
ImportError: failed to find libmagic.  Check your installation

I'm using the prod_example.cfg file as my config file and I have changed the email and password values in it

ImportError: No module named onedrivesdk

Getting the ImportError: No module named onedrivesdk error even though I have it installed.


Traceback (most recent call last):
  File "script/spider.py", line 9, in <module>
    from upload import Upload, SERVICE_GOOGLE_DRIVE, SERVICE_ONEDRIVE, SERVICE_DROPBOX, SERVICE_SCP
  File "/home/abc/packtpub-crawler/script/upload.py", line 2, in <module>
    from onedrive import OneDrive
  File "/home/abc/packtpub-crawler/script/onedrive.py", line 4, in <module>
    import onedrivesdk
ImportError: No module named onedrivesdk


abc@xyz:~/packtpub-crawler$ sudo pip install onedrivesdk
Requirement already satisfied: onedrivesdk in /usr/local/lib/python3.5/site-packages
Requirement already satisfied: requests>=2.6.1 in /usr/local/lib/python3.5/site-packages (from onedrivesdk)

Support for uploading to Nextcloud

It would be very nice to have an automatic upload to Nextcloud (/ownCloud) instances.

Examine if there is an existing API to upload files
Research if Python libraries are available
Since Nextcloud is a fork of ownCloud, maybe make a cross-compatible version

Disclaimer, I might take this issue and work on a PR - depending on the amount of spare time.

Subscribe to download ebook

i have subscribe account on packtpub i just can read all books online , now can i download books with this way ?

config prod default

The parameter -c config/prod.cfg should be the default unless specified to remove verbosity

i got some errors after update

:~/packtpub-crawler/script# python spider.py --config config/prod.cfg
Traceback (most recent call last):
  File "spider.py", line 9, in <module>
    from upload import Upload, SERVICE_DRIVE, SERVICE_DROPBOX, SERVICE_SCP
  File "/root/packtpub-crawler/script/upload.py", line 2, in <module>
    from scpUpload import ScpUpload
  File "/root/packtpub-crawler/script/scpUpload.py", line 5, in <module>
    import paramiko
ImportError: No module named paramiko

cron suggestion

I don't think adding cron feature onto the script is a good idea.
However, if you can have a guide for adding a line on crontab to run it every day in README.MD, it would be better so it won't make the script heavy loaded.
What I have used in crontab is:
0 9 * * * python /home/<username>/packtpub_crawler/script/spider.py --config /home/<username>/packtpub_crawler/config/prod.cfg --all --extras --upload drive

New login issue

Looks like the login page has changed. My pull request added form id and thing however it is still not working. Do you have any idea?

Telegram bot

Use python-telegram-bot

upload issue

i use ubuntu 16.04 server and no any gui , how can i do with this ?

Only the first time you will be prompted to login in a browser which has javascript enabled

i dont have any browser on my server .

Store data on Firebase

Use this library

Crash

Hi!
I'm trying to use this crawler but i'm having only errors.

I use Python 2 and 3 (using pylauncher) and even after running the requeriments txt file:
py -2 -m pip install -r requirements.txt

I've got this error when running the script (py -2 script/spider.py --config config/prod.cfg -t pdf --extras):

https://gist.github.com/vpontin/cfa2e42556624a3a5b9252351bf02f72

Using Win10 Creators update x64

video

hi
i want to download video course from packtpub
can you plz add this option also ?
thanks

Not working since 2016 May 14

I just cloned, correct the prod.cfg with my data and try to run... if fails like the lines below.

[Tue 31/May 10:52] alexgv@PROC-PE0ZW ~/workspace/packtpub-crawler $ python script/spider.py --config config/prod.cfg

     __             __         __           __           __    __        __
    /\ \     __    /\ \       /\ \         /\ \         /\ \  /\ \    _ / /\
   /  \ \   /\_\   \ \ \     /  \ \       /  \ \____   /  \ \ \ \ \  /_/ / /
  / /\ \ \_/ / /   /\ \_\   / /\ \ \     / /\ \_____\ / /\ \ \ \ \ \ \___\/
 / / /\ \___/ /   / /\/_/  / / /\ \ \   / / /\/___  // / /\ \_\/ / /  \ \ \
/ / /  \/____/   / / /    / / /  \ \_\ / / /   / / // /_/_ \/_/\ \ \   \_\ \

/ / / / / / / / / / / / _ / / // / / / / // //\ \ \ \ / / /
/ / / / / / / / / / / / /\ / // / / / / // /**/ \ \ / / /
/ / / / / /**/ / /__ / / /**\ \ / \ \ **/ / // / /_____ \ \ / /
/ / / / / ////**/ / /**\ \ \ \ _**/ // / /**\ \ \ /
// // //// /__/ /__________/ _/

Download FREE eBook every day from www.packtpub.com
@see github.com/niqdev/packtpub-crawler

[+] configuration file: config/prod.cfg
[-] <class 'requests.exceptions.ConnectionError'> HTTPSConnectionPool(host='www.packtpub.com', port=443): Max retries exceeded with url: /packt/offers/free-learning (Caused by <class 'httplib.BadStatusLine'>: '') | spider.py@41
Traceback (most recent call last):
File "script/spider.py", line 41, in main
packpub.run()
File "/home/alexgv/workspace/packtpub-crawler/script/packtpub.py", line 110, in run
self.__GET_login()
File "/home/alexgv/workspace/packtpub-crawler/script/packtpub.py", line 46, in __GET_login
response = self.__session.get(url, headers=self.__headers)
File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 467, in get
return self.request('GET', url, *_kwargs)
File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 455, in request
resp = self.send(prep, *_send_kwargs)
File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 558, in send
r = adapter.send(request, **kwargs)
File "/usr/lib/python2.7/dist-packages/requests/adapters.py", line 378, in send
raise ConnectionError(e)
ConnectionError: HTTPSConnectionPool(host='www.packtpub.com', port=443): Max retries exceeded with url: /packt/offers/free-learning (Caused by <class 'httplib.BadStatusLine'>: '')
[-] something weird occurred, exiting...
[Tue 31/May 10:52] alexgv@PROC-PE0ZW ~/workspace/packtpub-crawler $

SCP upload feature or other upload method

The script is cron running for 17 days now. The Un!ty books have large zip files and they are filling up my drive. Looks like we do need alternative method to do the upload. SCP would be great I think? Or should we make a feature of running script every time it finished download?

PR 27 changed behavior of extras path

I just saw that #27 changed the behavior of the path.extras value.

Before it was used on it's own equally to the download path:
directory = self.__config.get('path', 'path.extras')
now it's appended
` if self.__config.has_option('path', 'path.group'):

        folder_name = self.info['title'].encode('ascii', 'ignore').replace(' ', '_') + \
                      self.info['author'].encode('ascii', 'ignore').replace(' ', '_')

        directory = base_path + join(self.__config.get('path', 'path.ebooks'), folder_name,         self.__config.get('path', 'path.extras'))
    else:
        directory = base_path + self.__config.get('path', 'path.extras')`

@lszeremeta could you have a look? Since the extras be quite large, it should be possible to move them somewhere else.

Can you also add your changes to the example prod file and the readme? It took me quite some time to figure out what was happening.

Otherwise, if we decide which behavior we want to keep, I can refactor it while working on #56

Log support?

Not sure why, but i see some of the pictures are not there. Maybe it is an upload issue on Google Drive API or download issue.
Shall we have some logs supported?

An issue with getting info self.info['url_claim']. DOM has been changed

    self.__parseDailyBookInfo(soup)
  File "/home/developer/packtpub-crawler/script/packtpub.py", line 93, in __parseDailyBookInfo
    self.info['url_claim'] = self.__url_base + div_target.select('a.twelve-days-claim')[0]['href']

Heroku Scheduler instructions

It makes sense to use the Heroku Scheduler add-on as it requires much less dyno hours since it doesn't run continuously like the clock. The README is not clear on how to use the add-on however. Some questions:

Is it enough to just add the Scheduler add-on to the Heroku app?
The Heroku docs mention that the Scheduler occasionally can miss the execution of scheduled jobs. Can this result in missing out on free books, or is this risk minimized by running the job every hour instead of only once a day?
If the job is run every hour (or every 10 minutes), will I receive the notification email each time or does packtpub-crawler somehow detect that the free book of the day was already claimed?

content-length not in header make spyder fail

Sometimes there is no content-length, i didn't know python.
It could be nice if the download is done even without progress bar

Traceback (most recent call last): File "script/spider.py", line 44, in main packpub.download_ebooks(types) File "/Users/Shared/bin/packtpub-crawler/script/packtpub.py", line 129, in download_ebooks download_file(self.__session, download['url'], directory, download['filename'], self.__headers)) File "/Users/Shared/bin/packtpub-crawler/script/utils.py", line 63, in download_file total_length = int(response.headers.get('content-length')) TypeError: int() argument must be a string or a number, not 'NoneType'

Add Heroku Scheduler support

I would like to schedule the crawler every day on heroku

Download all subscribed ebook

An improvement would be to add an argument to download and backup all subscribed ebooks at once

How to open browser for googledrive verification?

Hi, the crawler runs on my headless raspi via SSH, so no X Session or anything. How am I supposed to "open a browser" and "enter verfication code" in the process? Any hints?

Deprecated NoBookException references need removed

From packtpub.py and spider.py
Else you get:
Traceback (most recent call last):
File "script/spider.py", line 8, in
from packtpub import Packtpub
File "/home/pclarity/packt-crawler2/script/packtpub.py", line 6, in
from noBookException import NoBookException
ImportError: No module named noBookException

[Newsletter] Problems with parsing extracting the book title from an url

As I mentioned in #47 (comment), the newsletter parser gets the book title from the url behind the image cover.

packtpub-crawler/script/packtpub.py

Line 101 in e604cc1

urlWithTitle = div_target.select('div.promo-landing-book-picture a')[0]['href']

This will work fine if the link on the landing page points to the main book page like it was the case here: https://www.packtpub.com/packt/free-ebook/amazon-web-services-free

<a href="/networking-and-servers/mastering-aws-development">
  <img src="//d1ldz4te4covpm.cloudfront.net/sites/default/files/3632EN_Mastering AWS Development.jpg" class="bookimage" />
</a>

but will yield some unexpected results when this href points to, for example, a cover image - like here: https://www.packtpub.com/packt/free-ebook/what-you-need-know-about-angular-2

<a class="fancybox" href="///d1ldz4te4covpm.cloudfront.net/sites/default/files/imagecache/nano_main_image/5612_WYNTKAngular_eBook_500x617.jpg">
  <img src="//d1ldz4te4covpm.cloudfront.net/sites/default/files/imagecache/nano_main_image/5612_WYNTKAngular_eBook_500x617.jpg" class="bookimage" />
</a>

The latter will result in

packtpub-crawler/script/packtpub.py

Line 102 in e604cc1

title = urlWithTitle.split('/')[-1].replace('-', ' ').title()

becoming '5612_Wyntkangular_Ebook_500X617.Jpg' instead of the correct title. And a wrong title will also mess up the filename under which the books is written to the disk making it '5612_Wyntkangular_Ebook_500X617.Jpg.{pdf,mobi,epub}'.

An alternative to this would be to use the string inside the h1 tag of the title-bar-title div like here: mkarpiarz@c583d37.
But this also doesn't seem to be always reliable, e.g.:

<div id="title-bar-title"><h1>Free Amazon Web Services eBook</h1></div>

Slack bot

I suppose this is the end of packtpub-crawler?

https://www.packtpub.com/packt/offers/free-learning

"IndexError: list index out of range" while fetching newsletter

Since this week's newsletter, it started to throw the following exception. My guess is, that the structure of the landing page changed somewhat.

[*] fetching url... 200 | https://www.packtpub.com/packt/free-ebook/what-you-need-know-about-angular-2
[-] <type 'exceptions.IndexError'> list index out of range | spider.py@125
Traceback (most recent call last):
  File "script/spider.py", line 125, in main
    packtpub.runNewsletter(currentNewsletterUrl)
  File "/home/mira/packtpub-crawler/script/packtpub.py", line 169, in runNewsletter
    self.__parseNewsletterBookInfo(soup)
  File "/home/mira/packtpub-crawler/script/packtpub.py", line 101, in __parseNewsletterBookInfo
    urlWithTitle = div_target.select('div.promo-landing-book-picture a')[0]['href']
IndexError: list index out of range

Error when getting config file

Updated to the latest version.

I have the following in a crontab, been using this for a few weeks working correctly. I noticed that I wasn't getting any new books so I ran the commend manually :

python /home/david/packtpub-crawler/script/spider.py --config /home/david/packtpub-crawler/config/prod.cfg -t pdf --extras

But I'm getting the following error.

[*] 2017-01-31 08:34 - fetching today's eBooks
[-] <type 'exceptions.IOError'> file not found! | spider.py@89
Traceback (most recent call last):
  File "/home/david/packtpub-crawler/script/spider.py", line 89, in main
    config = config_file(dir_path + args.config)
  File "/home/david/packtpub-crawler/script/utils.py", line 24, in config_file
    raise IOError('file not found!')
IOError: file not found!
[*] done

But if I move to the packtpub-crawler directory it works:

david@server:~/packtpub-crawler$ python script/spider.py --config config/prod.cfg -t pdf --extras


[*] 2017-01-31 08:39 - fetching today's eBooks
[*] configuration file: /home/david/packtpub-crawler/config/prod.cfg
[*] getting daily free eBook
[*] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[*] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[*] fetching url... 200 | https://www.packtpub.com/account/my-ebooks
[+] book successfully claimed

mapt books

hi
i have mapt subscription account and they have many packt books to watch online
but there is any solution to download these books ?

MIT License

Change License from Creative Commons to MIT

https://wiki.creativecommons.org/index.php/Frequently_Asked_Questions#Can_I_apply_a_Creative_Commons_license_to_software.3F

Use this template

Check if there is a free book today

Today packtpub has a different free offer and no ebook.

We should check if the correct heading/button exists on https://www.packtpub.com/packt/offers/free-learning to prevent confusing error messages.
I'm not sure if we should still send out error notifications or just log and skip it.

Book thumbnail would be useful

I have this setup with IFTTT to send a Slack notification.

It would be nice to include the book's thumbnail as part of the notification.

Newsletter store

@juzim The ebook downloaded from the newsletter can't be uploaded to drive or stored on firebase and break the script, I would suggest at least to change the newsletter feature to optional i.e. add --newsletter or -n as parameter

[*] getting free ebook from newsletter
[*] fetching url... 200 | https://www.packtpub.com/packt/free-ebook/javascript-high-performance
[*] fetching url... 200 | https://www.packtpub.com/account/my-ebooks
[+] book successfully claimed
[-] downloading file from url: https://www.packtpub.com/ebook_download/20590/pdf
[################################] 12594/12594 - 00:00:00
[+] new download: XXX/packtpub-crawler/ebooks/Mastering_Javascript_High_Performance.pdf
[+] new file upload on Drive:
[+] uploading file...
-[+] updating file permissions...

-       [path] XXX/packtpub-crawler/ebooks/Mastering_Javascript_High_Performance.pdf
	[download_url] https://drive.google.com/uc?id=XXX&export=download
	[name] Mastering_Javascript_High_Performance.pdf
	[mime_type] application/pdf
	[id] XXX
[-] skip store info: missing upload info
[-] <type 'exceptions.TypeError'> can't multiply sequence by non-int of type 'str' | spider.py@119
Traceback (most recent call last):
  File "script/spider.py", line 119, in main
    handleClaim(packpub, args, config, dir_path)
  File "script/spider.py", line 55, in handleClaim
    Notify(config, packpub.info, upload_info, args.notify).run()
  File "XXX/github/packtpub-crawler/script/notify.py", line 30, in run
    self.service.send()
  File "XXX/packtpub-crawler/script/notification/gmail.py", line 98, in send
    message = self.__prepare_message()
  File "XXX/packtpub-crawler/script/notification/gmail.py", line 41, in __prepare_message
    html *= "</ul>"
TypeError: can't multiply sequence by non-int of type 'str'

More free ebooks

From @juzim:
"Not daily but regularly. Lots of indie books or the 1$ base tier on HumbleBundle.
Amazon also has free books (but they are horrible).
Storybundle.com often includes books by authors like Neil Gaiman. I think even O'Reilly has some free offers."

Just to keep track of your suggestions

ImportError: cannot import name Client

Hi,
I cannot run the script because of import error:
python script/spider.py -c config/prod.cfg -t pdf --notify None
Traceback (most recent call last):
File "script/spider.py", line 12, in
from notify import Notify, SERVICE_GMAIL, SERVICE_IFTTT, SERVICE_JOIN, SERVICE_PUSHOVER
File "/home/boria/repos/test/packtpub-crawler/script/notify.py", line 5, in
from notification.mypushover import Pushover
File "/home/boria/repos/test/packtpub-crawler/script/notification/mypushover.py", line 3, in
from notification.mypushover import Client
ImportError: cannot import name Client

Google Drive upload not working

Hello,

Firstly, thank you for writing this script! I got the spider.py working to the point where it attempts to upload to my Google Drive. However, I don't believe the script ever prompted me via web browser to generate an auth_token.json. Here's the backtrace (occurs after successful download of .pdf) --

[-] <type 'exceptions.AttributeError'> 'module' object has no attribute 'from_file' | spider.py@54
Traceback (most recent call last):
File "script/spider.py", line 54, in main
upload.run(packpub.info['paths'])
File "/Users/mmcconnell/_HerokuProjects/packtpub-crawler/script/upload.py", line 26, in run
self.service.upload(path)
File "/Users/mmcconnell/_HerokuProjects/packtpub-crawler/script/drive.py", line 125, in upload
self.__guess_info(file_path)
File "/Users/mmcconnell/_HerokuProjects/packtpub-crawler/script/drive.py", line 28, in __guess_info
'mime_type': magic.from_file(file_path, mime=True),
AttributeError: 'module' object has no attribute 'from_file'
[-] something weird occurred, exiting...

I'm a python newb, so any assistance is appreciated.

Thanks,
Michael

Collection of books from newsletter

Has anyone downloaded some of this books? I really would like to read some of them! Thanks in advance! Sorry for making it this way.

Error attempting to claim book from newsletter

~ $ python script/spider.py --config config/prod.cfg --notify ifttt --claimOnly

                      __   __              __                                   __
    ____  ____ ______/ /__/ /_____  __  __/ /_        ______________ __      __/ /__  _____
   / __ \/ __ `/ ___/ //_/ __/ __ \/ / / / __ \______/ ___/ ___/ __ `/ | /| / / / _ \/ ___/
  / /_/ / /_/ / /__/ ,< / /_/ /_/ / /_/ / /_/ /_____/ /__/ /  / /_/ /| |/ |/ / /  __/ /
 / .___/\__,_/\___/_/|_|\__/ .___/\__,_/_.___/      \___/_/   \__,_/ |__/|__/_/\___/_/
/_/                       /_/

Download FREE eBook every day from www.packtpub.com
@see github.com/niqdev/packtpub-crawler

[*] 2017-01-31 10:30 - fetching today's eBooks
[*] configuration file: /app/config/prod.cfg
[*] getting daily free eBook
[*] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[*] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[*] fetching url... 200 | https://www.packtpub.com/account/my-ebooks
[+] book successfully claimed
[+] notification sent to IFTTT
[*] getting free eBook from newsletter
[*] fetching url... 200 | https://www.packtpub.com/packt/free-ebook/practical-data-analysis
[-] <type 'exceptions.IndexError'> list index out of range | spider.py@123
Traceback (most recent call last):
  File "script/spider.py", line 123, in main
    packtpub.runNewsletter(currentNewsletterUrl)
  File "/app/script/packtpub.py", line 160, in runNewsletter
    self.__parseNewsletterBookInfo(soup)
  File "/app/script/packtpub.py", line 98, in __parseNewsletterBookInfo
    title = urlWithTitle.split('/')[4].replace('-', ' ').title()
IndexError: list index out of range
[+] error notification sent to IFTTT
[*] done
~ $

It has successfully claimed the book from the newsletter already, but on subsequent days I'm getting the above error.

And it sends an IFTTT notification for the second one :(

ip_address()

Hi, why do you need to get the current ip?

thanks for your work!

Err: cannot find module 'express'

Excuse me, niqdev & other developers,

I have an issue like this, after, execute node dev/server.js

this is my prod.cfg

[url]
url.base=https://www.packtpub.com
url.login=/packt/offers/free-learning
# params: 0=id, 1=format
url.download=/ebook_download/{0}/{1}

#time in seconds
[delay]
delay.requests=2

[credential]
credential.email=mygoogleemail
credential.password=mygooglepassword

[path]
path.ebooks=ebooks
path.extras=ebooks/extras

[drive]
drive.oauth2_scope=https://www.googleapis.com/auth/drive
drive.client_secrets=config/client_secrets.json
drive.auth_token=config/auth_token.json
drive.gmail=mygoogleemail

[notify]
notify.host=smtp.gmail.com
notify.port=587
notify.username=mygoogleemail
notify.password=mygooglepassword
[email protected]
notify.to=mygoogleemail, mysecondgoogleemail

also i want to ask, where should i put my packtpub's username and password ?

thank you very much for share this crawler,
i'm sure it will be great for everyone.

invalid syntax error

I've placed my credentials in prod.cfg
when I run python script/spider.py --config config/prod.cfg --claimOnly I get this error:

Traceback (most recent call last):
File "script/spider.py", line 7, in
from utils import ip_address, config_file
File "/home/csllc4/packtpub-crawler/script/utils.py", line 16
print '[-] GET {0} | {1}'.format(response.status_code, response.url)
^
SyntaxError: invalid syntax

recaptcha cause daily free ebook acquisition failure.

Hi ,
recently packtpub change a little bit for getting daily free ebook and need to pass reCAPTCHA to acquire the free ebook, and it cause the daily job failure, you need manually pass reCAPTCHA and get ebook in your account and using download all ebook(-dall) to get ebook, does it possible to improvement on this? Thanks!

Docker support

Create docker image based on busybox

libmagic

I get following error on windows.
I did pip install -r requirements.txt before but it seems to not resolve all requirements.
File "C:\Python27\lib\site-packages\magic.py", line 171, in
raise ImportError('failed to find libmagic. Check your installation')
ImportError: failed to find libmagic. Check your installation

Captcha and DOM issues with new account page

Packtpub apparently worked on their account page and it broke a few things (sometimes it works so it might be a A/B test). I'm not sure if I can find the time to fix everything right now.

niqdev / packtpub-crawler Goto Github PK

packtpub-crawler's People

Contributors

Stargazers

Watchers

Forkers

packtpub-crawler's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs