GithubHelp home page GithubHelp logo

divkakwani / webcorpus Goto Github PK

View Code? Open in Web Editor NEW
7.0 7.0 8.0 46.02 MB

Generate large textual corpora for almost any language by crawling the web

License: Other

Python 99.80% Shell 0.20%
datasets indic-languages multilingual news-crawler nlp nlp-datasets

webcorpus's People

Contributors

divkakwani avatar gokulnc avatar gowtham1997 avatar niksarrow avatar react117 avatar soumendrak avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

webcorpus's Issues

Split the datasets table in README into 2

  1. For data that we already have (either links or crawled data from Anoop)
  2. The dataset we're creating as of now using news articles.

Will extend/merge it in future depending upon the sources we'll crawl, maybe create a separate markdown file dataset_stats.md.

Consolidate Data Sources

Currently, our sources counts stand at:

  • as: 16
  • bn: 34
  • en: 59
  • gu: 42
  • hi: 130
  • kn: 39
  • ml: 49
  • mr: 28
  • or: 26
  • pa: 30
  • ta: 98
  • te: 36

Let's try to add more sources as and when we find them. From my experience, 40 sources should give us 200M tokens atleast. So for ta, hi, en we should try to get 50 big sources atleast

Sitemap.xml might not cover all the articles

There are websites which update their sitemap.xml regularly every X months, thus removing the articles of the previous months.

Example-1:
https://www.dailythanthi.com/Sitemap/article-listing/News.xml
https://www.dailythanthi.com/Sitemap/article-listing/Devotional.xml

The above two sitemaps are two different categories of news, but from the same website.
This website only maintain the latest 1300 news articles in its list.
Though initially this seems like a big number, here's something to note:
As of now, the link-1 contains (1300) articles from 02-Oct-2019 to 24-Oct-2019.
The link-2 contains (1300) articles from Jan-2018 to Oct-2019. (since devotional articles are very less frequent compared to main news)

This shows that, by just crawling the sitemap, we're missing a lot of data (that is, in the case of News.xml, imagine if we had the links of articles from 2017 to 2019, that is 1300*36=4.68lakhs articles, just for 1 website!)

Example-2:
Ofcourse, there are a few sitemaps that store articles for a good long time.
https://hindi.news18.com/articles-sitemap-index.xml
This one has 3 lakh articles from 2018 till 2019, but these kinda sitemaps are rare as far as I can see.

How to solve the challenge to find potentially great amount of data?

  • Better not to use sitemaps at all and recursively crawl all the websites?
  • Is there a way to find previous versions of sitemaps?
  • Any other way to list all the existing articles given a website?

Python2 compatibility

Ensure that the tool works on both Python 2 and 3.
As of now, it works only on python3

Record Crawl Events in a File

We should record all the crawler events in a file. This can be useful for displaying crawl statistics later in a dashboard.

Handle robots.txt if sitemap.xml fails

There are sites with sitemap at abc.com/site-map.xml or abc.com/sitemaps.xml.

To handle these cases, we also need to check the robots.txt of the website first, which may contain the list of sitemaps. (Example robots.txt)

Other possible sources of code & data

Here is a repo called news-please that is similar to what we're creating:
https://github.com/fhamborg/news-please

I guess it works for almost all the languages out-of-the-box, and also has support for CommonCrawl.
We can get some inspirations and ideas from repos like this.

Feel free to add more links like this, based on which we can think of new features / improvements in our code and create relevant tasks.

Create a Dashboard

Why We Need it

As we have been discussing, we need to monitor and administrate crawls while they are going on. We wanna be able to do things like drop sources, monitor the bottlenecks (network or cpu), add/drop nodes, data management and stuff.

This all will take a lot of time, but we can move towards it slowly. FYI, I think it is something similar to scrapinghub

For now, here's what we can plan to do before we close this issue:

  • Discuss architecture a bit
  • Record crawl events in a file
  • Expose signalling mechanism of the crawlers
  • Create a simple web dashboard.

We can create an issue for each of the tasks and assign it to ourselves.

Distributed Setup

Regarding distributed setup, this is what I propose. For this setup, we will need scrapyd, rabbitmq, and a distributed file system (HDFS/seaweedfs)

(1) Adding nodes: whatever node we wanna add, we will have to run scrapyd manually on it. Once scrapyd is up and running, we can control it through scrapyd's HTTP API

(2) The DFS will hold the jobdirs and the crawled data. The jobdirs will be regularly updated by the nodes

(3) Rabbitmq will be our event messenger. The running crawlers will push the events here.

(4) Then we can run the dashboard on any machine. The dashboard will show the crawl statistics obtained through events; it will show a list of live nodes, also obtained through events; we can start/stop crawls by using the scrapyd http api.

More specifically, the starting-a-crawl operation will look like this;
<choose node> <list of news sources>
The crawler will query the DFS to retrieve the latest jobdir, then initiate the crawl

Let's brainstorm over this in the current week and then go ahead with the implementation starting next week.

boilerpipe3 is not being installed

Collecting webcorpus
Downloading webcorpus-0.2-py3-none-any.whl (55 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 55.1/55.1 KB 280.7 kB/s eta 0:00:00
Collecting htmldate
Downloading htmldate-1.4.1-py3-none-any.whl (33 kB)
Collecting nltk
Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.5/1.5 MB 850.5 kB/s eta 0:00:00
Collecting click
Downloading click-8.1.3-py3-none-any.whl (96 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 96.6/96.6 KB 1.2 MB/s eta 0:00:00
Collecting pandas
Downloading pandas-1.5.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.1/12.1 MB 3.4 MB/s eta 0:00:00
Collecting scrapyd-client
Downloading scrapyd_client-1.2.3-py3-none-any.whl (15 kB)
Collecting boilerpipe3
Downloading boilerpipe3-1.3.tar.gz (1.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 4.5 MB/s eta 0:00:00
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [57 lines of output]
Traceback (most recent call last):
File "/usr/lib/python3.10/urllib/request.py", line 1348, in do_open
h.request(req.get_method(), req.selector, req.data, headers,
File "/usr/lib/python3.10/http/client.py", line 1282, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.10/http/client.py", line 1328, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.10/http/client.py", line 1277, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.10/http/client.py", line 1037, in _send_output
self.send(msg)
File "/usr/lib/python3.10/http/client.py", line 975, in send
self.connect()
File "/usr/lib/python3.10/http/client.py", line 1447, in connect
super().connect()
File "/usr/lib/python3.10/http/client.py", line 941, in connect
self.sock = self._create_connection(
File "/usr/lib/python3.10/socket.py", line 845, in create_connection
raise err
File "/usr/lib/python3.10/socket.py", line 833, in create_connection
sock.connect(sa)
BlockingIOError: [Errno 11] Resource temporarily unavailable

  During handling of the above exception, another exception occurred:
 
  Traceback (most recent call last):
    File "<string>", line 2, in <module>
    File "<pip-setuptools-caller>", line 34, in <module>
    File "/tmp/pip-install-5ia44xtr/boilerpipe3_5c191d2aa6a2495e94a46a67d0ef3fbe/setup.py", line 36, in <module>
      download_jars(datapath=DATAPATH)
    File "/tmp/pip-install-5ia44xtr/boilerpipe3_5c191d2aa6a2495e94a46a67d0ef3fbe/setup.py", line 27, in download_jars
      downloaded = urlretrieve(tgz_url, tgz_name)
    File "/usr/lib/python3.10/urllib/request.py", line 241, in urlretrieve
      with contextlib.closing(urlopen(url, data)) as fp:
    File "/usr/lib/python3.10/urllib/request.py", line 216, in urlopen
      return opener.open(url, data, timeout)
    File "/usr/lib/python3.10/urllib/request.py", line 525, in open
      response = meth(req, response)
    File "/usr/lib/python3.10/urllib/request.py", line 634, in http_response
      response = self.parent.error(
    File "/usr/lib/python3.10/urllib/request.py", line 557, in error
      result = self._call_chain(*args)
    File "/usr/lib/python3.10/urllib/request.py", line 496, in _call_chain
      result = func(*args)
    File "/usr/lib/python3.10/urllib/request.py", line 749, in http_error_302
      return self.parent.open(new, timeout=req.timeout)
    File "/usr/lib/python3.10/urllib/request.py", line 519, in open
      response = self._open(req, data)
    File "/usr/lib/python3.10/urllib/request.py", line 536, in _open
      result = self._call_chain(self.handle_open, protocol, protocol +
    File "/usr/lib/python3.10/urllib/request.py", line 496, in _call_chain
      result = func(*args)
    File "/usr/lib/python3.10/urllib/request.py", line 1391, in https_open
      return self.do_open(http.client.HTTPSConnection, req,
    File "/usr/lib/python3.10/urllib/request.py", line 1351, in do_open
      raise URLError(err)
  urllib.error.URLError: <urlopen error [Errno 11] Resource temporarily unavailable>
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.