GithubHelp home page GithubHelp logo

mwscrape's People

Contributors

itkach avatar korhoj avatar mhbraun avatar sklart avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mwscrape's Issues

mwscrape --changes-since and --recent no feedback

Using
time mwscrape de.m.wikipedia.org --delete-not-found --changes-since 20150802
or
time mwscrape de.m.wikipedia.org --delete-not-found --recent --recent-days 14
do not provide a feedback (screen listing) of the collected articles.

The output
Starting session de-m-wikipedia-org-1439950680-485
does not create the document in mwscrape
The scrape stops after approximately 2h

Therefore I guess no data is collected.

does --recent collect all changes?

A user came up with the following finding in dewiki:

Article Westerwelle Guido
his death was entered in Wikipedia on the same day at March 18 2016. However it did not show up in April nor in May compilation of dewiki data. The log files I create do not show any change of Westerwelle.
Article Genscher Hans-Dietrich
died on March 31 2016, his data was in April and May dewiki. The log files do not show any change of Genscher.

The scrape of dewiki runs as a cron job once a day:

mwscrape $couchdb --delete-not-found --recent --recent-days 21 2>&1 | tee /home/guest/dewi-$Datum.log

where$couchdb is de-m-wikipedia-org
and $Datumis the actual date.
The 2>&1 | tee does redirect the output of the scrape as well onto the screen as into the logfile.
This way I have an overlap of 21 days on items I would not capture the first time or any broken connections during scraping.

Nevertheless Genscher was updated correctly, but Westerwelle not.

It is virtually impossible to check if the changes made are completely scraped.
Any thoughts?

Add support for SQLite database

Since CouchDB is a rather big dependency that requires installation as a system-wide service, would it be possible to add support for an SQLite database as well? I would imagine that SQLite would be able to cope with all Mediawikis except Wikipedia; and it could make mwscrape much easier to set up for new users, since it would simply write to a single local file.

mwscrape.db in couch gets big

updated mwscrape to the version including the --speed parameter.
Using
mwscrape de.m.wikipedia.org --speed 5 --delete-not-found
to replace individual 10 scrapes resulted in mwscrape.db was increased to 72,5 GB (!) after a couple of days running. Compacting manually resolved the size issue once.
I guess the compaction of mwscrape.db with --speed does not work properly albeit it worked correctly without --speed. (Sometimes quite aggressive, tough)
Probably allow growth of mwscrape to a specific (adjustable :) ? ) value like 256 MB. This would compact the mwscrape.db after around 5000 scrapes. The compacting load on the couchdb would not be excessive. With an adjustable value the harddisk space may be tweaked.

This would give the option to tweak the system according to space and speed capabilities

There is an option within couchdb settings to have automatic compaction, however this seems to run systemwide and not for a specific database.

article count enwiktionary

With scraping enwiktionary to couchdb with

lang=en
mwscrape -c http://admin:password@localhost:5984 https://$lang.wiktionary.org --db $lang-wiktionary-org --speed 5

I am getting 2,404,262 articles in the couchdb after a week scraping.
According to https://en.wikipedia.org/wiki/Wiktionary there should be 7,5 mio articles. I know a lot of that stuff is Chinese and whatsoever, but the discrepancy is significant.

dewiktionary is pretty close with 1.09 mio in couchdb and online,
elwiktionary has 1,226,405 in couchdb and 1,318,825 online
frwiktionary has 4,709,048 and 4,798,530 online
which seems fine for me. There are always deviations depending on the timestamp and the different count methods of wikimedia itself.

Where does the difference in enwiktionary come from?
Is there any filter selecting English language only?

UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-12: ordinal not in range(128)

hello , help to solve the problem
(env-mwscrape)root@nikitozzz:~# mwscrape http://sportwiki.to/ --site-path=/
Starting session sportwiki-to-1439500632-631
Starting at None
0 5-HTP
5-HTP is up to date (rev. 60706), skipping
Traceback (most recent call last):
File "/root/env-mwscrape/bin/mwscrape", line 9, in
load_entry_point('mwscrape==1.0', 'console_scripts', 'mwscrape')()
File "/root/env-mwscrape/local/lib/python2.7/site-packages/mwscrape/scrape.py", line 489, in main
for page in ipages(pages):
File "/root/env-mwscrape/local/lib/python2.7/site-packages/mwscrape/scrape.py", line 472, in ipages
print('%7s %s' % (index, title))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-12: ordinal not in range(128)

Not possible to login to CouchDB

In the doc of CouchDB I can read that it were "wise" to create an admin to restrict the full access to the couchdb, and in the fresh couchdb installation in Futon on Apache CouchDB 1.6.1 I can read at the bottom right:
Welcome to Admin Party!
Everyone is admin. Fix this

But in mwscrape it seems to be not possible to use parameters as login to a secured CouchDB.
Without login I get an error:
couchdb.http.Unauthorized: (u'unauthorized', u'You are not a server admin.')

Parameter --end missing

Would it not be useful to have a parameter --end like in the mwscrape2slob (--endkey) to have better control over scrape process?
Especially if I scrape not only with one process, but with many scrapes at the same time, this will end up in unneccessary requests when one scrape reached the beginning of the next scrape.
With a parameter --end I could start one scrape e.g. at AAA and end it at BZZZ etc.

Python 3 support

Is Python 3 support planned? AFAIK mwclient already supports Python 3 in its resent version. So all mwscrape dependencies already support Python 3.

This issue is very important for such environments as for example Buildroot, that support only one Python instance (either python 2 or python 3) at a time.

2to3 tools found several needed changes:

  • python 3 incompatible print statements
  • urllib package was renamed in python 3
  • thread package was also renamed (_thread)

I'm not a user of mwscrape but rather a co-maintainer of various python packages in Buildroot. So it would be great, if you could fix this issue.

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.