junekihong / linkedinscraper Goto Github PK

View Code? Open in Web Editor NEW

110.0 110.0 81.0 5.8 MB

Scrapes public information off of LinkedIn

Python 100.00%

linkedinscraper's People

Contributors

Stargazers

Watchers

linkedinscraper's Issues

Issue on scrapy and six.

Dear Juneki Hong,

Thanks for making this program. Actually I have the problem on using this in my standalone computer. My current system specification is MacOS-X 10.11.5 and anaconda 2.3.0 and python 2.7.11. Also I install scrapy using pip.

However, when I execute this program by the command "scrapy crawl linkedin.com," then the error message occur;

Traceback (most recent call last):
File "/Users/byeongsuyu/anaconda/bin/scrapy", line 11, in
sys.exit(execute())

File "/Users/byeongsuyu/anaconda/lib/python2.7/site-packages/scrapy/cmdline.py", line 108, in execute settings = get_project_settings()

File "/Users/byeongsuyu/anaconda/lib/python2.7/site-packages/scrapy/utils/project.py", line 60, in get_project_settings
settings.setmodule(settings_module_path, priority='project')

File "/Users/byeongsuyu/anaconda/lib/python2.7/site-packages/scrapy/settings/init.py", line 285, in setmodule
self.set(key, getattr(module, key), priority)

File "/Users/byeongsuyu/anaconda/lib/python2.7/site-packages/scrapy/settings/init.py", line 260, in set
self.attributes[name].set(value, priority)

File "/Users/byeongsuyu/anaconda/lib/python2.7/site-packages/scrapy/settings/init.py", line 55, in set
value = BaseSettings(value, priority=priority)

File "/Users/byeongsuyu/anaconda/lib/python2.7/site-packages/scrapy/settings/init.py", line 91, in init
self.update(values, priority)

File "/Users/byeongsuyu/anaconda/lib/python2.7/site-packages/scrapy/settings/init.py", line 317, in update
for name, value in six.iteritems(values):

File "/Users/byeongsuyu/anaconda/lib/python2.7/site-packages/six.py", line 599, in iteritems
return d.iteritems(**kw)
AttributeError: 'list' object has no attribute 'iteritems'

I know this issue may stem from scrapy or six, but it would be helpful for me to your system environment when the code runs well without any problem.

Response of the webpage

When you scrape using this method it will return response containing javascript to load the content dynamically (provided internet is available) so this scrapper basically do not work any more.

scrapy.spidermiddlewares.httperror INFO: Ignoring response 999

Hi,

I tried scrapy code and getting following response from server :

c:\python27\lib\site-packages\scrapy\settings\deprecated.py:27: ScrapyDeprecationWarning: You are using the following settings which are deprecated or obsolete (ask [email protected] for alternatives): BOT_VERSION: no longer used (user agent defaults to Scrapy now) warnings.warn(msg, ScrapyDeprecationWarning)C:\Drive D\Work\Python\crawlers\linkedInScraper-master\linkedIn\linkedIn\spiders\linkedIn_spider.py:1: ScrapyDeprecationWarning: Module scrapy.spideris deprecated, usescrapy.spiders instead from scrapy.spider import BaseSpider C:\Drive D\Work\Python\crawlers\linkedInScraper-master\linkedIn\linkedIn\spiders\linkedIn_spider.py:20: ScrapyDeprecationWarning: linkedIn.spiders.linkedIn_spider.linkedInSpider inherits from deprecated class scrapy.spiders.BaseSpider, please inherit from scrapy.spiders.Spider. (warning only on first subclass, there may be others) class linkedInSpider(BaseSpider): 2018-03-15 16:34:42 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: linkedIn) 2018-03-15 16:34:42 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'linkedIn.spiders', 'FEED_URI': 'items.json', 'SPIDER_MODULES': ['linkedIn.spiders'], 'BOT_NAME': 'linkedIn', 'DEFAULT_ITEM_CLASS': 'linkedIn.items.LinkedinItem', 'FEED_FORMAT': 'json'}2018-03-15 16:34:42 [scrapy.middleware] INFO: Enabled extensions:['scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats']2018-03-15 16:34:44 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2018-03-15 16:34:44 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2018-03-15 16:34:44 [scrapy.middleware] INFO: Enabled item pipelines: ['linkedIn.pipelines.LinkedinPipeline'] 2018-03-15 16:34:44 [scrapy.core.engine] INFO: Spider opened 2018-03-15 16:34:44 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2018-03-15 16:34:44 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-a> (referer: None) 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-c> (referer: None) 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-d> (referer: None) 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-f> (referer: None) 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-e> (referer: None) 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-a>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-h> (referer: None) 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-c>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-d>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-f>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-e>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-h>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-b> (referer: None) 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-i> (referer: None) 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-k> (referer: None) 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-b>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-j> (referer: None) 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-l> (referer: None) 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-n> (referer: None) 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-i>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-m> (referer: None) 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-k>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-j>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-o> (referer: None) 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-l>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-g> (referer: None) 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-n>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-m>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-p> (referer: None) 2018-03-15 16:34:46 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-q> (referer: None) 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-o>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-g>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-s> (referer: None) 2018-03-15 16:34:46 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-r> (referer: None) 2018-03-15 16:34:46 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-t> (referer: None) 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-p>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-u> (referer: None) 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-q>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-w> (referer: None) 2018-03-15 16:34:46 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-v> (referer: None) 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-s>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-r>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-t>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-u>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-y> (referer: None) 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-w>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-x> (referer: None) 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-v>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-z> (referer: None) 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-y>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-x>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-z>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.core.engine] INFO: Closing spider (finished) 2018-03-15 16:34:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 8770, 'downloader/request_count': 26, 'downloader/request_method_count/GET': 26, 'downloader/response_bytes': 53336, 'downloader/response_count': 26, 'downloader/response_status_count/999': 26, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2018, 3, 15, 11, 4, 46, 403000), 'httperror/response_ignored_count': 26, 'httperror/response_ignored_status_count/999': 26, 'log_count/DEBUG': 27, 'log_count/INFO': 33, 'response_received_count': 26, 'scheduler/dequeued': 26, 'scheduler/dequeued/memory': 26, 'scheduler/enqueued': 26, 'scheduler/enqueued/memory': 26, 'start_time': datetime.datetime(2018, 3, 15, 11, 4, 44, 414000)} 2018-03-15 16:34:46 [scrapy.core.engine] INFO: Spider closed (finished)

getting the code scrapy.spidermiddlewares.httperror INFO: Ignoring response 999, please can you provide how to handle this error code from server.

Thanks

scrapy crawl linkedin.com > items.txt

when using the command I get an IO Error: Permission Denied: 'items.txt'

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble