junekihong / linkedinscraper Goto Github PK
View Code? Open in Web Editor NEWScrapes public information off of LinkedIn
Scrapes public information off of LinkedIn
Dear Juneki Hong,
Thanks for making this program. Actually I have the problem on using this in my standalone computer. My current system specification is MacOS-X 10.11.5 and anaconda 2.3.0 and python 2.7.11. Also I install scrapy using pip.
However, when I execute this program by the command "scrapy crawl linkedin.com," then the error message occur;
Traceback (most recent call last):
File "/Users/byeongsuyu/anaconda/bin/scrapy", line 11, in
sys.exit(execute())File "/Users/byeongsuyu/anaconda/lib/python2.7/site-packages/scrapy/cmdline.py", line 108, in execute settings = get_project_settings()
File "/Users/byeongsuyu/anaconda/lib/python2.7/site-packages/scrapy/utils/project.py", line 60, in get_project_settings
settings.setmodule(settings_module_path, priority='project')File "/Users/byeongsuyu/anaconda/lib/python2.7/site-packages/scrapy/settings/init.py", line 285, in setmodule
self.set(key, getattr(module, key), priority)File "/Users/byeongsuyu/anaconda/lib/python2.7/site-packages/scrapy/settings/init.py", line 260, in set
self.attributes[name].set(value, priority)File "/Users/byeongsuyu/anaconda/lib/python2.7/site-packages/scrapy/settings/init.py", line 55, in set
value = BaseSettings(value, priority=priority)File "/Users/byeongsuyu/anaconda/lib/python2.7/site-packages/scrapy/settings/init.py", line 91, in init
self.update(values, priority)File "/Users/byeongsuyu/anaconda/lib/python2.7/site-packages/scrapy/settings/init.py", line 317, in update
for name, value in six.iteritems(values):File "/Users/byeongsuyu/anaconda/lib/python2.7/site-packages/six.py", line 599, in iteritems
return d.iteritems(**kw)
AttributeError: 'list' object has no attribute 'iteritems'
I know this issue may stem from scrapy or six, but it would be helpful for me to your system environment when the code runs well without any problem.
When you scrape using this method it will return response containing javascript to load the content dynamically (provided internet is available) so this scrapper basically do not work any more.
Hi,
I tried scrapy code and getting following response from server :
c:\python27\lib\site-packages\scrapy\settings\deprecated.py:27: ScrapyDeprecationWarning: You are using the following settings which are deprecated or obsolete (ask [email protected] for alternatives): BOT_VERSION: no longer used (user agent defaults to Scrapy now) warnings.warn(msg, ScrapyDeprecationWarning)C:\Drive D\Work\Python\crawlers\linkedInScraper-master\linkedIn\linkedIn\spiders\linkedIn_spider.py:1: ScrapyDeprecationWarning: Module
scrapy.spideris deprecated, use
scrapy.spiders instead from scrapy.spider import BaseSpider C:\Drive D\Work\Python\crawlers\linkedInScraper-master\linkedIn\linkedIn\spiders\linkedIn_spider.py:20: ScrapyDeprecationWarning: linkedIn.spiders.linkedIn_spider.linkedInSpider inherits from deprecated class scrapy.spiders.BaseSpider, please inherit from scrapy.spiders.Spider. (warning only on first subclass, there may be others) class linkedInSpider(BaseSpider): 2018-03-15 16:34:42 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: linkedIn) 2018-03-15 16:34:42 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'linkedIn.spiders', 'FEED_URI': 'items.json', 'SPIDER_MODULES': ['linkedIn.spiders'], 'BOT_NAME': 'linkedIn', 'DEFAULT_ITEM_CLASS': 'linkedIn.items.LinkedinItem', 'FEED_FORMAT': 'json'}2018-03-15 16:34:42 [scrapy.middleware] INFO: Enabled extensions:['scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats']2018-03-15 16:34:44 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2018-03-15 16:34:44 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2018-03-15 16:34:44 [scrapy.middleware] INFO: Enabled item pipelines: ['linkedIn.pipelines.LinkedinPipeline'] 2018-03-15 16:34:44 [scrapy.core.engine] INFO: Spider opened 2018-03-15 16:34:44 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2018-03-15 16:34:44 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-a> (referer: None) 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-c> (referer: None) 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-d> (referer: None) 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-f> (referer: None) 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-e> (referer: None) 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-a>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-h> (referer: None) 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-c>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-d>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-f>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-e>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-h>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-b> (referer: None) 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-i> (referer: None) 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-k> (referer: None) 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-b>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-j> (referer: None) 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-l> (referer: None) 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-n> (referer: None) 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-i>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-m> (referer: None) 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-k>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-j>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-o> (referer: None) 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-l>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-g> (referer: None) 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-n>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-m>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-p> (referer: None) 2018-03-15 16:34:46 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-q> (referer: None) 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-o>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-g>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-s> (referer: None) 2018-03-15 16:34:46 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-r> (referer: None) 2018-03-15 16:34:46 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-t> (referer: None) 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-p>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-u> (referer: None) 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-q>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-w> (referer: None) 2018-03-15 16:34:46 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-v> (referer: None) 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-s>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-r>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-t>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-u>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-y> (referer: None) 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-w>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-x> (referer: None) 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-v>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-z> (referer: None) 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-y>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-x>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-z>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.core.engine] INFO: Closing spider (finished) 2018-03-15 16:34:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 8770, 'downloader/request_count': 26, 'downloader/request_method_count/GET': 26, 'downloader/response_bytes': 53336, 'downloader/response_count': 26, 'downloader/response_status_count/999': 26, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2018, 3, 15, 11, 4, 46, 403000), 'httperror/response_ignored_count': 26, 'httperror/response_ignored_status_count/999': 26, 'log_count/DEBUG': 27, 'log_count/INFO': 33, 'response_received_count': 26, 'scheduler/dequeued': 26, 'scheduler/dequeued/memory': 26, 'scheduler/enqueued': 26, 'scheduler/enqueued/memory': 26, 'start_time': datetime.datetime(2018, 3, 15, 11, 4, 44, 414000)} 2018-03-15 16:34:46 [scrapy.core.engine] INFO: Spider closed (finished)
getting the code scrapy.spidermiddlewares.httperror INFO: Ignoring response 999, please can you provide how to handle this error code from server.
Thanks
when using the command I get an IO Error: Permission Denied: 'items.txt'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.