divkakwani / webcorpus Goto Github PK
View Code? Open in Web Editor NEWGenerate large textual corpora for almost any language by crawling the web
License: Other
Generate large textual corpora for almost any language by crawling the web
License: Other
Will extend/merge it in future depending upon the sources we'll crawl, maybe create a separate markdown file dataset_stats.md
.
Currently, our sources counts stand at:
Let's try to add more sources as and when we find them. From my experience, 40 sources should give us 200M tokens atleast. So for ta, hi, en we should try to get 50 big sources atleast
There are websites which update their sitemap.xml
regularly every X months, thus removing the articles of the previous months.
Example-1:
https://www.dailythanthi.com/Sitemap/article-listing/News.xml
https://www.dailythanthi.com/Sitemap/article-listing/Devotional.xml
The above two sitemaps are two different categories of news, but from the same website.
This website only maintain the latest 1300 news articles in its list.
Though initially this seems like a big number, here's something to note:
As of now, the link-1 contains (1300) articles from 02-Oct-2019 to 24-Oct-2019.
The link-2 contains (1300) articles from Jan-2018 to Oct-2019. (since devotional articles are very less frequent compared to main news)
This shows that, by just crawling the sitemap, we're missing a lot of data (that is, in the case of News.xml
, imagine if we had the links of articles from 2017 to 2019, that is 1300*36=4.68lakhs
articles, just for 1 website!)
Example-2:
Ofcourse, there are a few sitemaps that store articles for a good long time.
https://hindi.news18.com/articles-sitemap-index.xml
This one has 3 lakh articles from 2018 till 2019, but these kinda sitemaps are rare as far as I can see.
How to solve the challenge to find potentially great amount of data?
Ensure that the tool works on both Python 2 and 3.
As of now, it works only on python3
We should record all the crawler events in a file. This can be useful for displaying crawl statistics later in a dashboard.
Is it recommended? If so, do we record the time of article publication
or time of crawl
?
The former might be obtained using SitemapNewsStory.publish_date
.
The above API also provides stuffs like genres
, keywords
, etc. Will that be useful to us?
There are sites with sitemap at abc.com/site-map.xml
or abc.com/sitemaps.xml
.
To handle these cases, we also need to check the robots.txt
of the website first, which may contain the list of sitemaps. (Example robots.txt)
Here is a repo called news-please
that is similar to what we're creating:
https://github.com/fhamborg/news-please
I guess it works for almost all the languages out-of-the-box, and also has support for CommonCrawl.
We can get some inspirations and ideas from repos like this.
Feel free to add more links like this, based on which we can think of new features / improvements in our code and create relevant tasks.
As we have been discussing, we need to monitor and administrate crawls while they are going on. We wanna be able to do things like drop sources, monitor the bottlenecks (network or cpu), add/drop nodes, data management and stuff.
This all will take a lot of time, but we can move towards it slowly. FYI, I think it is something similar to scrapinghub
For now, here's what we can plan to do before we close this issue:
We can create an issue for each of the tasks and assign it to ourselves.
Regarding distributed setup, this is what I propose. For this setup, we will need scrapyd, rabbitmq, and a distributed file system (HDFS/seaweedfs)
(1) Adding nodes: whatever node we wanna add, we will have to run scrapyd manually on it. Once scrapyd is up and running, we can control it through scrapyd's HTTP API
(2) The DFS will hold the jobdirs and the crawled data. The jobdirs will be regularly updated by the nodes
(3) Rabbitmq will be our event messenger. The running crawlers will push the events here.
(4) Then we can run the dashboard on any machine. The dashboard will show the crawl statistics obtained through events; it will show a list of live nodes, also obtained through events; we can start/stop crawls by using the scrapyd http api.
More specifically, the starting-a-crawl operation will look like this;
<choose node> <list of news sources>
The crawler will query the DFS to retrieve the latest jobdir, then initiate the crawl
Let's brainstorm over this in the current week and then go ahead with the implementation starting next week.
Collecting webcorpus
Downloading webcorpus-0.2-py3-none-any.whl (55 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 55.1/55.1 KB 280.7 kB/s eta 0:00:00
Collecting htmldate
Downloading htmldate-1.4.1-py3-none-any.whl (33 kB)
Collecting nltk
Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.5/1.5 MB 850.5 kB/s eta 0:00:00
Collecting click
Downloading click-8.1.3-py3-none-any.whl (96 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 96.6/96.6 KB 1.2 MB/s eta 0:00:00
Collecting pandas
Downloading pandas-1.5.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.1/12.1 MB 3.4 MB/s eta 0:00:00
Collecting scrapyd-client
Downloading scrapyd_client-1.2.3-py3-none-any.whl (15 kB)
Collecting boilerpipe3
Downloading boilerpipe3-1.3.tar.gz (1.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 4.5 MB/s eta 0:00:00
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [57 lines of output]
Traceback (most recent call last):
File "/usr/lib/python3.10/urllib/request.py", line 1348, in do_open
h.request(req.get_method(), req.selector, req.data, headers,
File "/usr/lib/python3.10/http/client.py", line 1282, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.10/http/client.py", line 1328, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.10/http/client.py", line 1277, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.10/http/client.py", line 1037, in _send_output
self.send(msg)
File "/usr/lib/python3.10/http/client.py", line 975, in send
self.connect()
File "/usr/lib/python3.10/http/client.py", line 1447, in connect
super().connect()
File "/usr/lib/python3.10/http/client.py", line 941, in connect
self.sock = self._create_connection(
File "/usr/lib/python3.10/socket.py", line 845, in create_connection
raise err
File "/usr/lib/python3.10/socket.py", line 833, in create_connection
sock.connect(sa)
BlockingIOError: [Errno 11] Resource temporarily unavailable
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "<pip-setuptools-caller>", line 34, in <module>
File "/tmp/pip-install-5ia44xtr/boilerpipe3_5c191d2aa6a2495e94a46a67d0ef3fbe/setup.py", line 36, in <module>
download_jars(datapath=DATAPATH)
File "/tmp/pip-install-5ia44xtr/boilerpipe3_5c191d2aa6a2495e94a46a67d0ef3fbe/setup.py", line 27, in download_jars
downloaded = urlretrieve(tgz_url, tgz_name)
File "/usr/lib/python3.10/urllib/request.py", line 241, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File "/usr/lib/python3.10/urllib/request.py", line 216, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.10/urllib/request.py", line 525, in open
response = meth(req, response)
File "/usr/lib/python3.10/urllib/request.py", line 634, in http_response
response = self.parent.error(
File "/usr/lib/python3.10/urllib/request.py", line 557, in error
result = self._call_chain(*args)
File "/usr/lib/python3.10/urllib/request.py", line 496, in _call_chain
result = func(*args)
File "/usr/lib/python3.10/urllib/request.py", line 749, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "/usr/lib/python3.10/urllib/request.py", line 519, in open
response = self._open(req, data)
File "/usr/lib/python3.10/urllib/request.py", line 536, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
File "/usr/lib/python3.10/urllib/request.py", line 496, in _call_chain
result = func(*args)
File "/usr/lib/python3.10/urllib/request.py", line 1391, in https_open
return self.do_open(http.client.HTTPSConnection, req,
File "/usr/lib/python3.10/urllib/request.py", line 1351, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 11] Resource temporarily unavailable>
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
× Encountered error while generating package metadata.
╰─> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
Please add an example on the README page on how to run the crawler with all the sources for a particular language.
@gowtham1997, @divkakwani
There are some sitemaps which recursively contains sitemaps. For instance:
https://www.dailythanthi.com/Sitemap/Sitemap.xml
But the recursive sitemaps may or may not comply to the sitemap format.
An example for recursive sitemap that complies to the sitemap format:
https://hindi.news18.com/sitemap.xml
Todo:
We'll better extract the URLs (http) from these links.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.