divkakwani / webcorpus Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 8.0 46.02 MB

Generate large textual corpora for almost any language by crawling the web

License: Other

Python 99.80% Shell 0.20%

datasets indic-languages multilingual news-crawler nlp nlp-datasets

webcorpus's People

Contributors

Stargazers

Watchers

Forkers

ai4bharat soumendrak hanumantappabudihal

webcorpus's Issues

Split the datasets table in README into 2

For data that we already have (either links or crawled data from Anoop)
The dataset we're creating as of now using news articles.

Will extend/merge it in future depending upon the sources we'll crawl, maybe create a separate markdown file dataset_stats.md.

Consolidate Data Sources

Currently, our sources counts stand at:

as: 16
bn: 34
en: 59
gu: 42
hi: 130
kn: 39
ml: 49
mr: 28
or: 26
pa: 30
ta: 98
te: 36

Let's try to add more sources as and when we find them. From my experience, 40 sources should give us 200M tokens atleast. So for ta, hi, en we should try to get 50 big sources atleast

Sitemap.xml might not cover all the articles

There are websites which update their sitemap.xml regularly every X months, thus removing the articles of the previous months.

Example-1:
https://www.dailythanthi.com/Sitemap/article-listing/News.xml
https://www.dailythanthi.com/Sitemap/article-listing/Devotional.xml

The above two sitemaps are two different categories of news, but from the same website.
This website only maintain the latest 1300 news articles in its list.
Though initially this seems like a big number, here's something to note:
As of now, the link-1 contains (1300) articles from 02-Oct-2019 to 24-Oct-2019.
The link-2 contains (1300) articles from Jan-2018 to Oct-2019. (since devotional articles are very less frequent compared to main news)

This shows that, by just crawling the sitemap, we're missing a lot of data (that is, in the case of News.xml, imagine if we had the links of articles from 2017 to 2019, that is 1300*36=4.68lakhs articles, just for 1 website!)

Example-2:
Ofcourse, there are a few sitemaps that store articles for a good long time.
https://hindi.news18.com/articles-sitemap-index.xml
This one has 3 lakh articles from 2018 till 2019, but these kinda sitemaps are rare as far as I can see.

How to solve the challenge to find potentially great amount of data?

Better not to use sitemaps at all and recursively crawl all the websites?
Is there a way to find previous versions of sitemaps?
Any other way to list all the existing articles given a website?

Python2 compatibility

Ensure that the tool works on both Python 2 and 3.
As of now, it works only on python3

Record Crawl Events in a File

We should record all the crawler events in a file. This can be useful for displaying crawl statistics later in a dashboard.

Unit tests

Task 1

Move all scrapy global settings to settings.py

Reference:
https://docs.scrapy.org/en/latest/topics/settings.html#project-settings-module

Use metadata derived from the sitemap

Is it recommended? If so, do we record the time of article publication or time of crawl ?

The former might be obtained using SitemapNewsStory.publish_date.

The above API also provides stuffs like genres, keywords, etc. Will that be useful to us?

Handle robots.txt if sitemap.xml fails

There are sites with sitemap at abc.com/site-map.xml or abc.com/sitemaps.xml.

To handle these cases, we also need to check the robots.txt of the website first, which may contain the list of sitemaps. (Example robots.txt)

Other possible sources of code & data

Here is a repo called news-please that is similar to what we're creating:
https://github.com/fhamborg/news-please

I guess it works for almost all the languages out-of-the-box, and also has support for CommonCrawl.
We can get some inspirations and ideas from repos like this.

Feel free to add more links like this, based on which we can think of new features / improvements in our code and create relevant tasks.

Create a Dashboard

Why We Need it

As we have been discussing, we need to monitor and administrate crawls while they are going on. We wanna be able to do things like drop sources, monitor the bottlenecks (network or cpu), add/drop nodes, data management and stuff.

This all will take a lot of time, but we can move towards it slowly. FYI, I think it is something similar to scrapinghub

For now, here's what we can plan to do before we close this issue:

Discuss architecture a bit
Record crawl events in a file
Expose signalling mechanism of the crawlers
Create a simple web dashboard.

We can create an issue for each of the tasks and assign it to ourselves.

Distributed Setup

Regarding distributed setup, this is what I propose. For this setup, we will need scrapyd, rabbitmq, and a distributed file system (HDFS/seaweedfs)

(1) Adding nodes: whatever node we wanna add, we will have to run scrapyd manually on it. Once scrapyd is up and running, we can control it through scrapyd's HTTP API

(2) The DFS will hold the jobdirs and the crawled data. The jobdirs will be regularly updated by the nodes

(3) Rabbitmq will be our event messenger. The running crawlers will push the events here.

(4) Then we can run the dashboard on any machine. The dashboard will show the crawl statistics obtained through events; it will show a list of live nodes, also obtained through events; we can start/stop crawls by using the scrapyd http api.

More specifically, the starting-a-crawl operation will look like this;
<choose node> <list of news sources>
The crawler will query the DFS to retrieve the latest jobdir, then initiate the crawl

Let's brainstorm over this in the current week and then go ahead with the implementation starting next week.

boilerpipe3 is not being installed

Collecting webcorpus
Downloading webcorpus-0.2-py3-none-any.whl (55 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 55.1/55.1 KB 280.7 kB/s eta 0:00:00
Collecting htmldate
Downloading htmldate-1.4.1-py3-none-any.whl (33 kB)
Collecting nltk
Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.5/1.5 MB 850.5 kB/s eta 0:00:00
Collecting click
Downloading click-8.1.3-py3-none-any.whl (96 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 96.6/96.6 KB 1.2 MB/s eta 0:00:00
Collecting pandas
Downloading pandas-1.5.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.1/12.1 MB 3.4 MB/s eta 0:00:00
Collecting scrapyd-client
Downloading scrapyd_client-1.2.3-py3-none-any.whl (15 kB)
Collecting boilerpipe3
Downloading boilerpipe3-1.3.tar.gz (1.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 4.5 MB/s eta 0:00:00
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [57 lines of output]
Traceback (most recent call last):
File "/usr/lib/python3.10/urllib/request.py", line 1348, in do_open
h.request(req.get_method(), req.selector, req.data, headers,
File "/usr/lib/python3.10/http/client.py", line 1282, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.10/http/client.py", line 1328, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.10/http/client.py", line 1277, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.10/http/client.py", line 1037, in _send_output
self.send(msg)
File "/usr/lib/python3.10/http/client.py", line 975, in send
self.connect()
File "/usr/lib/python3.10/http/client.py", line 1447, in connect
super().connect()
File "/usr/lib/python3.10/http/client.py", line 941, in connect
self.sock = self._create_connection(
File "/usr/lib/python3.10/socket.py", line 845, in create_connection
raise err
File "/usr/lib/python3.10/socket.py", line 833, in create_connection
sock.connect(sa)
BlockingIOError: [Errno 11] Resource temporarily unavailable

  During handling of the above exception, another exception occurred:
 
  Traceback (most recent call last):
    File "<string>", line 2, in <module>
    File "<pip-setuptools-caller>", line 34, in <module>
    File "/tmp/pip-install-5ia44xtr/boilerpipe3_5c191d2aa6a2495e94a46a67d0ef3fbe/setup.py", line 36, in <module>
      download_jars(datapath=DATAPATH)
    File "/tmp/pip-install-5ia44xtr/boilerpipe3_5c191d2aa6a2495e94a46a67d0ef3fbe/setup.py", line 27, in download_jars
      downloaded = urlretrieve(tgz_url, tgz_name)
    File "/usr/lib/python3.10/urllib/request.py", line 241, in urlretrieve
      with contextlib.closing(urlopen(url, data)) as fp:
    File "/usr/lib/python3.10/urllib/request.py", line 216, in urlopen
      return opener.open(url, data, timeout)
    File "/usr/lib/python3.10/urllib/request.py", line 525, in open
      response = meth(req, response)
    File "/usr/lib/python3.10/urllib/request.py", line 634, in http_response
      response = self.parent.error(
    File "/usr/lib/python3.10/urllib/request.py", line 557, in error
      result = self._call_chain(*args)
    File "/usr/lib/python3.10/urllib/request.py", line 496, in _call_chain
      result = func(*args)
    File "/usr/lib/python3.10/urllib/request.py", line 749, in http_error_302
      return self.parent.open(new, timeout=req.timeout)
    File "/usr/lib/python3.10/urllib/request.py", line 519, in open
      response = self._open(req, data)
    File "/usr/lib/python3.10/urllib/request.py", line 536, in _open
      result = self._call_chain(self.handle_open, protocol, protocol +
    File "/usr/lib/python3.10/urllib/request.py", line 496, in _call_chain
      result = func(*args)
    File "/usr/lib/python3.10/urllib/request.py", line 1391, in https_open
      return self.do_open(http.client.HTTPSConnection, req,
    File "/usr/lib/python3.10/urllib/request.py", line 1351, in do_open
      raise URLError(err)
  urllib.error.URLError: <urlopen error [Errno 11] Resource temporarily unavailable>
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

Add an example

Please add an example on the README page on how to run the crawler with all the sources for a particular language.
@gowtham1997, @divkakwani

Handle recursive sitemaps

There are some sitemaps which recursively contains sitemaps. For instance:
https://www.dailythanthi.com/Sitemap/Sitemap.xml

But the recursive sitemaps may or may not comply to the sitemap format.
An example for recursive sitemap that complies to the sitemap format:
https://hindi.news18.com/sitemap.xml

Todo:
We'll better extract the URLs (http) from these links.

divkakwani / webcorpus Goto Github PK

webcorpus's People

Contributors

Stargazers

Watchers

Forkers

webcorpus's Issues

Why We Need it

Recommend Projects

Recommend Topics

Recommend Org

Jobs