niqdev / packtpub-crawler Goto Github PK
View Code? Open in Web Editor NEWDownload your daily free Packt Publishing eBook https://www.packtpub.com/packt/offers/free-learning
License: MIT License
Download your daily free Packt Publishing eBook https://www.packtpub.com/packt/offers/free-learning
License: MIT License
I came across this repo. I am not sure whether or not it works (sadly won't work for what I'm wanting to do as it uses Selenium) but maybe its worth looking in to?
https://github.com/eastee/rebreakcaptcha
My thinking is somehow use this to get around the ReCAPTCHA on the PacktPub site.
Hello,
having the below issue and hoping that you could give me a little help.
Cheers
Marcus
[] 2017-06-20 15:05 - fetching today's eBooks
[] configuration file: /opt/packtpub-crawler/config/prod.cfg
[] getting daily free eBook
[] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[-] <type 'exceptions.IndexError'> list index out of range | spider.py@97
Traceback (most recent call last):
File "/opt/packtpub-crawler/script/spider.py", line 97, in main
packtpub.runDaily()
File "/opt/packtpub-crawler/script/packtpub.py", line 161, in runDaily
self.__parseDailyBookInfo(soup)
File "/opt/packtpub-crawler/script/packtpub.py", line 93, in __parseDailyBookInfo
self.info['url_claim'] = self.__url_base + div_target.select('a.twelve-days-claim')[0]['href']
IndexError: list index out of range
[] no free eBook from newsletter right now
[*] done
pip install -r requirements succeeded fine, using Ubuntu 16.04. :-)
Hello, I love your script and I would like to ask for an extra feature!
Putting files into a folder on Google Drive would be much better and tidier.
Here is the reference of the Python code:
https://developers.google.com/drive/v3/web/folder#creating_a_folder
If you have no time to do it and you don't mind me to make a pull request, I am happy to help.
hi
can you please make code to download packtpub video courses which they allow to watch online only
plz
I have cloned the repository and installed all of the dependencies, but I am getting several errors upon running the following in command line:
>>C:\Python27\python.exe C:\Users\****\*******\my_ebooks\PACKTPUB_CRAWLER\packtpub-crawler\script\spider.py --config C:\Users\****\*******\my_ebooks\PACKTPUB_CRAWLER\packtpub-crawler\config\prod_example.cfg
The Errors I'm getting are:
Traceback (most recent call last):
File "C:\Users\****\*******\my_ebooks\PACKTPUB_CRAWLER\packtpub-crawler\script\spider.py", line 9, in <module>
from upload import Upload, SERVICE_GOOGLE_DRIVE, SERVICE_ONEDRIVE, SERVICE_DROPBOX, SERVICE_SCP
File "C:\Users\****\*******\my_ebooks\PACKTPUB_CRAWLER\packtpub-crawler\script\upload.py", line 1, in <module>
from googledrive import GoogleDrive
File "C:\Users\****\*******\my_ebooks\PACKTPUB_CRAWLER\packtpub-crawler\script\googledrive.py", line 6, in <module>
import magic
File "build\bdist.win-amd64\egg\magic.py", line 176, in <module>
ImportError: failed to find libmagic. Check your installation
I'm using the prod_example.cfg file as my config file and I have changed the email and password values in it
Getting the ImportError: No module named onedrivesdk error even though I have it installed.
Traceback (most recent call last):
File "script/spider.py", line 9, in <module>
from upload import Upload, SERVICE_GOOGLE_DRIVE, SERVICE_ONEDRIVE, SERVICE_DROPBOX, SERVICE_SCP
File "/home/abc/packtpub-crawler/script/upload.py", line 2, in <module>
from onedrive import OneDrive
File "/home/abc/packtpub-crawler/script/onedrive.py", line 4, in <module>
import onedrivesdk
ImportError: No module named onedrivesdk
abc@xyz:~/packtpub-crawler$ sudo pip install onedrivesdk
Requirement already satisfied: onedrivesdk in /usr/local/lib/python3.5/site-packages
Requirement already satisfied: requests>=2.6.1 in /usr/local/lib/python3.5/site-packages (from onedrivesdk)
It would be very nice to have an automatic upload to Nextcloud (/ownCloud) instances.
Disclaimer, I might take this issue and work on a PR - depending on the amount of spare time.
i have subscribe account on packtpub i just can read all books online , now can i download books with this way ?
The parameter -c config/prod.cfg
should be the default unless specified to remove verbosity
:~/packtpub-crawler/script# python spider.py --config config/prod.cfg
Traceback (most recent call last):
File "spider.py", line 9, in <module>
from upload import Upload, SERVICE_DRIVE, SERVICE_DROPBOX, SERVICE_SCP
File "/root/packtpub-crawler/script/upload.py", line 2, in <module>
from scpUpload import ScpUpload
File "/root/packtpub-crawler/script/scpUpload.py", line 5, in <module>
import paramiko
ImportError: No module named paramiko
I don't think adding cron feature onto the script is a good idea.
However, if you can have a guide for adding a line on crontab to run it every day in README.MD, it would be better so it won't make the script heavy loaded.
What I have used in crontab is:
0 9 * * * python /home/<username>/packtpub_crawler/script/spider.py --config /home/<username>/packtpub_crawler/config/prod.cfg --all --extras --upload drive
Looks like the login page has changed. My pull request added form id and thing however it is still not working. Do you have any idea?
i use ubuntu 16.04 server and no any gui , how can i do with this ?
Only the first time you will be prompted to login in a browser which has javascript enabled
i dont have any browser on my server .
Use this library
Hi!
I'm trying to use this crawler but i'm having only errors.
I use Python 2 and 3 (using pylauncher) and even after running the requeriments txt file:
py -2 -m pip install -r requirements.txt
I've got this error when running the script (py -2 script/spider.py --config config/prod.cfg -t pdf --extras
):
https://gist.github.com/vpontin/cfa2e42556624a3a5b9252351bf02f72
Using Win10 Creators update x64
hi
i want to download video course from packtpub
can you plz add this option also ?
thanks
I just cloned, correct the prod.cfg with my data and try to run... if fails like the lines below.
[Tue 31/May 10:52] alexgv@PROC-PE0ZW ~/workspace/packtpub-crawler $ python script/spider.py --config config/prod.cfg
__ __ __ __ __ __ __
/\ \ __ /\ \ /\ \ /\ \ /\ \ /\ \ _ / /\
/ \ \ /\_\ \ \ \ / \ \ / \ \____ / \ \ \ \ \ /_/ / /
/ /\ \ \_/ / / /\ \_\ / /\ \ \ / /\ \_____\ / /\ \ \ \ \ \ \___\/
/ / /\ \___/ / / /\/_/ / / /\ \ \ / / /\/___ // / /\ \_\/ / / \ \ \
/ / / \/____/ / / / / / / \ \_\ / / / / / // /_/_ \/_/\ \ \ \_\ \
/ / / / / / / / / / / / _ / / // / / / / // //\ \ \ \ / / /
/ / / / / / / / / / / / /\ / // / / / / // /**/ \ \ / / /
/ / / / / /**/ / /__ / / /**\ \ / \ \ **/ / // / /_____ \ \ / /
/ / / / / ////**/ / /**\ \ \ \ _**/ // / /**\ \ \ /
// // //// /__/ /__________/ _/
Download FREE eBook every day from www.packtpub.com
@see github.com/niqdev/packtpub-crawler
[+] configuration file: config/prod.cfg
[-] <class 'requests.exceptions.ConnectionError'> HTTPSConnectionPool(host='www.packtpub.com', port=443): Max retries exceeded with url: /packt/offers/free-learning (Caused by <class 'httplib.BadStatusLine'>: '') | spider.py@41
Traceback (most recent call last):
File "script/spider.py", line 41, in main
packpub.run()
File "/home/alexgv/workspace/packtpub-crawler/script/packtpub.py", line 110, in run
self.__GET_login()
File "/home/alexgv/workspace/packtpub-crawler/script/packtpub.py", line 46, in __GET_login
response = self.__session.get(url, headers=self.__headers)
File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 467, in get
return self.request('GET', url, *_kwargs)
File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 455, in request
resp = self.send(prep, *_send_kwargs)
File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 558, in send
r = adapter.send(request, **kwargs)
File "/usr/lib/python2.7/dist-packages/requests/adapters.py", line 378, in send
raise ConnectionError(e)
ConnectionError: HTTPSConnectionPool(host='www.packtpub.com', port=443): Max retries exceeded with url: /packt/offers/free-learning (Caused by <class 'httplib.BadStatusLine'>: '')
[-] something weird occurred, exiting...
[Tue 31/May 10:52] alexgv@PROC-PE0ZW ~/workspace/packtpub-crawler $
The script is cron running for 17 days now. The Un!ty books have large zip files and they are filling up my drive. Looks like we do need alternative method to do the upload. SCP would be great I think? Or should we make a feature of running script every time it finished download?
I just saw that #27 changed the behavior of the path.extras value.
Before it was used on it's own equally to the download path:
directory = self.__config.get('path', 'path.extras')
now it's appended
` if self.__config.has_option('path', 'path.group'):
folder_name = self.info['title'].encode('ascii', 'ignore').replace(' ', '_') + \
self.info['author'].encode('ascii', 'ignore').replace(' ', '_')
directory = base_path + join(self.__config.get('path', 'path.ebooks'), folder_name, self.__config.get('path', 'path.extras'))
else:
directory = base_path + self.__config.get('path', 'path.extras')`
@lszeremeta could you have a look? Since the extras be quite large, it should be possible to move them somewhere else.
Can you also add your changes to the example prod file and the readme? It took me quite some time to figure out what was happening.
Otherwise, if we decide which behavior we want to keep, I can refactor it while working on #56
Not sure why, but i see some of the pictures are not there. Maybe it is an upload issue on Google Drive API or download issue.
Shall we have some logs supported?
self.__parseDailyBookInfo(soup)
File "/home/developer/packtpub-crawler/script/packtpub.py", line 93, in __parseDailyBookInfo
self.info['url_claim'] = self.__url_base + div_target.select('a.twelve-days-claim')[0]['href']
It makes sense to use the Heroku Scheduler add-on as it requires much less dyno hours since it doesn't run continuously like the clock. The README is not clear on how to use the add-on however. Some questions:
Sometimes there is no content-length, i didn't know python.
It could be nice if the download is done even without progress bar
Traceback (most recent call last): File "script/spider.py", line 44, in main packpub.download_ebooks(types) File "/Users/Shared/bin/packtpub-crawler/script/packtpub.py", line 129, in download_ebooks download_file(self.__session, download['url'], directory, download['filename'], self.__headers)) File "/Users/Shared/bin/packtpub-crawler/script/utils.py", line 63, in download_file total_length = int(response.headers.get('content-length')) TypeError: int() argument must be a string or a number, not 'NoneType'
I would like to schedule the crawler every day on heroku
An improvement would be to add an argument to download and backup all subscribed ebooks at once
Hi, the crawler runs on my headless raspi via SSH, so no X Session or anything. How am I supposed to "open a browser" and "enter verfication code" in the process? Any hints?
From packtpub.py and spider.py
Else you get:
Traceback (most recent call last):
File "script/spider.py", line 8, in
from packtpub import Packtpub
File "/home/pclarity/packt-crawler2/script/packtpub.py", line 6, in
from noBookException import NoBookException
ImportError: No module named noBookException
As I mentioned in #47 (comment), the newsletter parser gets the book title from the url behind the image cover.
packtpub-crawler/script/packtpub.py
Line 101 in e604cc1
<a href="/networking-and-servers/mastering-aws-development">
<img src="//d1ldz4te4covpm.cloudfront.net/sites/default/files/3632EN_Mastering AWS Development.jpg" class="bookimage" />
</a>
but will yield some unexpected results when this href points to, for example, a cover image - like here: https://www.packtpub.com/packt/free-ebook/what-you-need-know-about-angular-2
<a class="fancybox" href="///d1ldz4te4covpm.cloudfront.net/sites/default/files/imagecache/nano_main_image/5612_WYNTKAngular_eBook_500x617.jpg">
<img src="//d1ldz4te4covpm.cloudfront.net/sites/default/files/imagecache/nano_main_image/5612_WYNTKAngular_eBook_500x617.jpg" class="bookimage" />
</a>
The latter will result in
packtpub-crawler/script/packtpub.py
Line 102 in e604cc1
An alternative to this would be to use the string inside the h1 tag of the title-bar-title
div like here: mkarpiarz@c583d37.
But this also doesn't seem to be always reliable, e.g.:
<div id="title-bar-title"><h1>Free Amazon Web Services eBook</h1></div>
Since this week's newsletter, it started to throw the following exception. My guess is, that the structure of the landing page changed somewhat.
[*] fetching url... 200 | https://www.packtpub.com/packt/free-ebook/what-you-need-know-about-angular-2
[-] <type 'exceptions.IndexError'> list index out of range | spider.py@125
Traceback (most recent call last):
File "script/spider.py", line 125, in main
packtpub.runNewsletter(currentNewsletterUrl)
File "/home/mira/packtpub-crawler/script/packtpub.py", line 169, in runNewsletter
self.__parseNewsletterBookInfo(soup)
File "/home/mira/packtpub-crawler/script/packtpub.py", line 101, in __parseNewsletterBookInfo
urlWithTitle = div_target.select('div.promo-landing-book-picture a')[0]['href']
IndexError: list index out of range
Updated to the latest version.
I have the following in a crontab, been using this for a few weeks working correctly. I noticed that I wasn't getting any new books so I ran the commend manually :
python /home/david/packtpub-crawler/script/spider.py --config /home/david/packtpub-crawler/config/prod.cfg -t pdf --extras
But I'm getting the following error.
[*] 2017-01-31 08:34 - fetching today's eBooks
[-] <type 'exceptions.IOError'> file not found! | spider.py@89
Traceback (most recent call last):
File "/home/david/packtpub-crawler/script/spider.py", line 89, in main
config = config_file(dir_path + args.config)
File "/home/david/packtpub-crawler/script/utils.py", line 24, in config_file
raise IOError('file not found!')
IOError: file not found!
[*] done
But if I move to the packtpub-crawler directory it works:
david@server:~/packtpub-crawler$ python script/spider.py --config config/prod.cfg -t pdf --extras
[*] 2017-01-31 08:39 - fetching today's eBooks
[*] configuration file: /home/david/packtpub-crawler/config/prod.cfg
[*] getting daily free eBook
[*] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[*] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[*] fetching url... 200 | https://www.packtpub.com/account/my-ebooks
[+] book successfully claimed
hi
i have mapt subscription account and they have many packt books to watch online
but there is any solution to download these books ?
Today packtpub has a different free offer and no ebook.
We should check if the correct heading/button exists on https://www.packtpub.com/packt/offers/free-learning to prevent confusing error messages.
I'm not sure if we should still send out error notifications or just log and skip it.
I have this setup with IFTTT to send a Slack notification.
It would be nice to include the book's thumbnail as part of the notification.
@juzim The ebook downloaded from the newsletter can't be uploaded to drive or stored on firebase and break the script, I would suggest at least to change the newsletter feature to optional i.e. add --newsletter
or -n
as parameter
[*] getting free ebook from newsletter
[*] fetching url... 200 | https://www.packtpub.com/packt/free-ebook/javascript-high-performance
[*] fetching url... 200 | https://www.packtpub.com/account/my-ebooks
[+] book successfully claimed
[-] downloading file from url: https://www.packtpub.com/ebook_download/20590/pdf
[################################] 12594/12594 - 00:00:00
[+] new download: XXX/packtpub-crawler/ebooks/Mastering_Javascript_High_Performance.pdf
[+] new file upload on Drive:
[+] uploading file...
-[+] updating file permissions...
- [path] XXX/packtpub-crawler/ebooks/Mastering_Javascript_High_Performance.pdf
[download_url] https://drive.google.com/uc?id=XXX&export=download
[name] Mastering_Javascript_High_Performance.pdf
[mime_type] application/pdf
[id] XXX
[-] skip store info: missing upload info
[-] <type 'exceptions.TypeError'> can't multiply sequence by non-int of type 'str' | spider.py@119
Traceback (most recent call last):
File "script/spider.py", line 119, in main
handleClaim(packpub, args, config, dir_path)
File "script/spider.py", line 55, in handleClaim
Notify(config, packpub.info, upload_info, args.notify).run()
File "XXX/github/packtpub-crawler/script/notify.py", line 30, in run
self.service.send()
File "XXX/packtpub-crawler/script/notification/gmail.py", line 98, in send
message = self.__prepare_message()
File "XXX/packtpub-crawler/script/notification/gmail.py", line 41, in __prepare_message
html *= "</ul>"
TypeError: can't multiply sequence by non-int of type 'str'
From @juzim:
"Not daily but regularly. Lots of indie books or the 1$ base tier on HumbleBundle.
Amazon also has free books (but they are horrible).
Storybundle.com often includes books by authors like Neil Gaiman. I think even O'Reilly has some free offers."
Just to keep track of your suggestions
Hi,
I cannot run the script because of import error:
python script/spider.py -c config/prod.cfg -t pdf --notify None
Traceback (most recent call last):
File "script/spider.py", line 12, in
from notify import Notify, SERVICE_GMAIL, SERVICE_IFTTT, SERVICE_JOIN, SERVICE_PUSHOVER
File "/home/boria/repos/test/packtpub-crawler/script/notify.py", line 5, in
from notification.mypushover import Pushover
File "/home/boria/repos/test/packtpub-crawler/script/notification/mypushover.py", line 3, in
from notification.mypushover import Client
ImportError: cannot import name Client
Hello,
Firstly, thank you for writing this script! I got the spider.py working to the point where it attempts to upload to my Google Drive. However, I don't believe the script ever prompted me via web browser to generate an auth_token.json. Here's the backtrace (occurs after successful download of .pdf) --
[-] <type 'exceptions.AttributeError'> 'module' object has no attribute 'from_file' | spider.py@54
Traceback (most recent call last):
File "script/spider.py", line 54, in main
upload.run(packpub.info['paths'])
File "/Users/mmcconnell/_HerokuProjects/packtpub-crawler/script/upload.py", line 26, in run
self.service.upload(path)
File "/Users/mmcconnell/_HerokuProjects/packtpub-crawler/script/drive.py", line 125, in upload
self.__guess_info(file_path)
File "/Users/mmcconnell/_HerokuProjects/packtpub-crawler/script/drive.py", line 28, in __guess_info
'mime_type': magic.from_file(file_path, mime=True),
AttributeError: 'module' object has no attribute 'from_file'
[-] something weird occurred, exiting...
I'm a python newb, so any assistance is appreciated.
Thanks,
Michael
Has anyone downloaded some of this books? I really would like to read some of them! Thanks in advance! Sorry for making it this way.
~ $ python script/spider.py --config config/prod.cfg --notify ifttt --claimOnly
__ __ __ __
____ ____ ______/ /__/ /_____ __ __/ /_ ______________ __ __/ /__ _____
/ __ \/ __ `/ ___/ //_/ __/ __ \/ / / / __ \______/ ___/ ___/ __ `/ | /| / / / _ \/ ___/
/ /_/ / /_/ / /__/ ,< / /_/ /_/ / /_/ / /_/ /_____/ /__/ / / /_/ /| |/ |/ / / __/ /
/ .___/\__,_/\___/_/|_|\__/ .___/\__,_/_.___/ \___/_/ \__,_/ |__/|__/_/\___/_/
/_/ /_/
Download FREE eBook every day from www.packtpub.com
@see github.com/niqdev/packtpub-crawler
[*] 2017-01-31 10:30 - fetching today's eBooks
[*] configuration file: /app/config/prod.cfg
[*] getting daily free eBook
[*] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[*] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[*] fetching url... 200 | https://www.packtpub.com/account/my-ebooks
[+] book successfully claimed
[+] notification sent to IFTTT
[*] getting free eBook from newsletter
[*] fetching url... 200 | https://www.packtpub.com/packt/free-ebook/practical-data-analysis
[-] <type 'exceptions.IndexError'> list index out of range | spider.py@123
Traceback (most recent call last):
File "script/spider.py", line 123, in main
packtpub.runNewsletter(currentNewsletterUrl)
File "/app/script/packtpub.py", line 160, in runNewsletter
self.__parseNewsletterBookInfo(soup)
File "/app/script/packtpub.py", line 98, in __parseNewsletterBookInfo
title = urlWithTitle.split('/')[4].replace('-', ' ').title()
IndexError: list index out of range
[+] error notification sent to IFTTT
[*] done
~ $
It has successfully claimed the book from the newsletter already, but on subsequent days I'm getting the above error.
And it sends an IFTTT notification for the second one :(
Hi, why do you need to get the current ip?
thanks for your work!
Excuse me, niqdev & other developers,
I have an issue like this, after, execute node dev/server.js
this is my prod.cfg
[url]
url.base=https://www.packtpub.com
url.login=/packt/offers/free-learning
# params: 0=id, 1=format
url.download=/ebook_download/{0}/{1}
#time in seconds
[delay]
delay.requests=2
[credential]
credential.email=mygoogleemail
credential.password=mygooglepassword
[path]
path.ebooks=ebooks
path.extras=ebooks/extras
[drive]
drive.oauth2_scope=https://www.googleapis.com/auth/drive
drive.client_secrets=config/client_secrets.json
drive.auth_token=config/auth_token.json
drive.gmail=mygoogleemail
[notify]
notify.host=smtp.gmail.com
notify.port=587
notify.username=mygoogleemail
notify.password=mygooglepassword
[email protected]
notify.to=mygoogleemail, mysecondgoogleemail
also i want to ask, where should i put my packtpub's username and password ?
thank you very much for share this crawler,
i'm sure it will be great for everyone.
I've placed my credentials in prod.cfg
when I run python script/spider.py --config config/prod.cfg --claimOnly I get this error:
Traceback (most recent call last):
File "script/spider.py", line 7, in
from utils import ip_address, config_file
File "/home/csllc4/packtpub-crawler/script/utils.py", line 16
print '[-] GET {0} | {1}'.format(response.status_code, response.url)
^
SyntaxError: invalid syntax
Hi ,
recently packtpub change a little bit for getting daily free ebook and need to pass reCAPTCHA to acquire the free ebook, and it cause the daily job failure, you need manually pass reCAPTCHA and get ebook in your account and using download all ebook(-dall) to get ebook, does it possible to improvement on this? Thanks!
Create docker image based on busybox
I get following error on windows.
I did pip install -r requirements.txt
before but it seems to not resolve all requirements.
File "C:\Python27\lib\site-packages\magic.py", line 171, in
raise ImportError('failed to find libmagic. Check your installation')
ImportError: failed to find libmagic. Check your installation
Packtpub apparently worked on their account page and it broke a few things (sometimes it works so it might be a A/B test). I'm not sure if I can find the time to fix everything right now.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.