medialab / hyphe Goto Github PK

Websites crawler with built-in exploration and control web interface

Home Page: http://hyphe.medialab.sciences-po.fr/demo/

License: GNU Affero General Public License v3.0

Python 31.42% JavaScript 38.13% Shell 1.76% HTML 25.06% CSS 3.47% Dockerfile 0.16%

hyphe's Issues

Hyphe does not manage different corpora or users at the moment.

Any vague idea if or when this will be implemented? Has it been scheduled or is still something which is going to happen sometime in the far future?

. in webentity

When you make a crawl from a url list (i haven't tested the normal crawl) and there is a dot in the end of a url (e.g. www.domain.com . ) the backend server crashes and the entire import has to be repeated.

It is an important bug because DMI link harvester from time to time leaves you with dots in the end of the url.

Handle www in prefixes

If http://test.fr is created before http://www.test.fr, the latter falls naturally into the first's prefix and is therefore never created as a prefix, whereas it is if they are created in the reverse order.

Possible options :

handle from webentity creation rule ?
always strip www ?
always create www extra when missing ?

hyphe_backend.lib missing

Hi Medialab,

Running sudo bin/deploy_scrapy_spider.sh to install the latest edition of hypher gives me the following error:

Copying config.json from root directory to hyphe_backend/crawler for scrapy deployment...
Traceback (most recent call last):
File "deploy.py", line 18, in
from hyphe_backend.lib import config_hci
ImportError: No module named hyphe_backend.lib

It appears that a file most be missing.

Best regards
Tobias

Crawl list features

Useful things to add :

display details of a crawl within the screen when selecting a crawl down in the list so we can see them without scrolling back to the top
checkbox to show all "suspect" crawls meaning those "finished" but with small pages/links figures (i.e less than 4 for both?)
while stille in debug, having the crawljob id on the right is useful, but the webentity_id is more often needed so should be there as well
add link to recrawl the same webentity
add a link to the webentity page (edit) corresponding to a crawl

Handle websites giving wrong http statuses

Some websites act in weird manners, we need to establish policies for these.
For instance:

http://www.sallenougaro.com/ is a correct webpage but says 404
http://www.ilri.org/ILRIAudio is correct but indicates an unlimited redirection (302 Found!) (example to test redirects : curl "http://www.cgiar.org/" -L -o dumpfile -w 'Last URL was: %{url_effective}' )
http://www.meteofrance.com and http://community.eldis.org refuses to work because of some cookies reasons (see http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#std:setting-COOKIES_ENABLED )

store.set_webentity_homepage adds http:// on null home page

When sending the null chain as a parameter for store.set_webentity_homepage, the homepage is actually set to "http://". The expected behavior would be to get a null homepage. The feature is to allow this way to remove the homepage.

Bug in store.get_webentity_by_url possibly linked to HTTPS

I have a case where the right web entity is not fetched. This is my exact case:

I have a generic web entity "Twitter HTTPS" prefixed by https://twitter.com:443
I have a more precise web entity "XXX on Twitter" prefixed by https://twitter.com/xxx
I have a this URL: 'https://twitter.com/xxx'
Note that this URL is exactly the prefix of "XXX on Twitter". But when I ask to fetch the right web entity, it is the other one that is returned: "Twitter HTTPS".

The right web entity is not returned, a more generic one is instead.

Actualisation de page

Chaque fois que je tente d'actualiser, j'obtiens ça : (je suis alors obligé de rouvrir le site dans une nouvelle fenetre (chrome)

Handle past crawls on removed/merged webentities

When a webentity has been crawled and is merged into another one, its ID no longer corresponds to a webentity in the whole list so it crashes the display of its name in the crawl list (crawl.php).

Either remove those from the crawl list or handle them differently, maybe we need to keep a record of merged ids to their merged_into ones?

Thrift 0.8.0 compilation

Add --without-erlang to the thrift ./configure. Otherwise, the compilation won't succeed on Ubuntu and such.

CSS error in Remove page to crawl

Using Google chrome Mac it is nearly imposible to remove pages you have added (see attached image). It appears to be a simple css error but I haven't been able to track it down.

Sort discovered entities by in-degree

I've given a look a the interface and to me the n°1 priority for the user interface is the possibility to sort the discovered web entities for decreasing in-degree.
Of course it is possible to export the graph, sort in Gephi and then get back to Hyphe to crawl the most cited neighbors...
but how nicer would it be to do it without leaving the interface.

Loading the interface when the server is off

Since the interface (served by Apache) can be loaded when the server is off, it could be nice to have a specific interface reaction instead of an error message (especially since it is not really an error, at least not an unknown one).

Need a page to make a real search

The list of webentities is slow and it comes from that fact that it loads all webentities. It would be useful to have a page where to search web entities from Lucence. It would be more efficient. It would be the occasion to attach features to groups of web entities (group tag, group classify, group merge...)

Scrapyd configuration documentation is not complete

The part about the edition of the configuration is inexistant.

Improve speed of linking algorithm

Test replace Java collections by Trove http://trove.starlight-systems.com/
Renew links only for webentities newly modified or created (add a timestamp for last time of update?)

Handle multiple corpora in one single instance

create/close/export corpus with specific settings (WE Creation Rule, crawl strategy(?), max_depth, precision limit)
Adapt core/crawler code to specific corpora
Run/Stop one MemoryStructure JAVA instance per corpus from core on demand and close on inactive
limit number of corpus running simultanuously

Add option to merge a webentity's parent into itself like for subWebEntities

In the "edit a webentity"page, it would make sense within the Content part when clicking on a parent webentity to be offered the same options as when clicking on a subwebentity, therefore adding the option to merge into current webentity would be nice.

[Crawl] Follow link issues on urls like "fing.org/?key=val"

It looks like the scrapy spider does not follow such links (example: crawling fing.org returns nothing)

Collect and display number of pages actually crawled

Download button in Network

Pu the button just below the selected radio button, so that it is clear that we download what is selected

Alert boxes on some pages when no data yet

After resetting, if one clicks on the "network of WEs" or "explore discovered WEs" links, the webpages complain with a popup alert box because there is no data yet, this should probably just be a less invasive simple message on the pages.

Paste a list of URLs does not trigger the UI

Fix size of Tag value input field in webentity_edit

Also clicking on the validate blue button without having pressed enter in the input field should validate and not restart the operation

Header menu

It would be nice if the "Hyphe" on the left of the header could link to the home page.

Also the "Webentities" menu could point also to the "Explore discovered entities" page, and the "Crawl" menu to the "crawl list" page.

Add backwards fonctions to cancel done crawls ?

Need to mark in memory structure elements coming from a specific crawl

. in webentity

It is an important bug because DMI link harvester from time to time leaves you with dots in the end of the url.

Change URL/LRU rule to put t: (port) after h: (host) instead of before

Possibility to specify in configuration an Analytics ID

It might be nice to monitor who accesses the crawler interface through a Google Analytics account.

To do that, the best would be to have behavioral tracking in the interface, but with an ID specifiable in the global instance configuration.

Exclude button in the list

I know that it is possible to enter in the editing window of a web entity to exclude it. Still, since this is an operation that is done frequently and often just by looking at the URL of the site, it'd be much easier to it directly in the list.

Web entities and crawl limits

This is less of a bug report and more of an attempt to open the discussion.
Currently the limits of a web entities and the limits of its crawl coincide. This is probably a good idea in most cases, but not necessarily in all cases.

Example:
In our cartography of the climate adaptation debate, we have to deal with the website of the Food and Agriculture Organisation. Of course, we don't want crawl this entire website because it is too big and only a portion of it directly concerns climate adaptation. In fact, we are lucky, because they have a sub-directory that is dedicated to climate change (http://www.fao.org/climatechange/). Great! so we only want to crawl this directory.
Still, this does not necessarily imply that we only want to limit this entity to this folder. In fact, the FAO is a relatively unitary institution. Someone who want to cite a FAO study for example may as well site the homepage of the FAO website and not necessarily the pages in the sub-directory.

What this example tries to illustrate is that sometime we might want to define a larger web-entity, but only crawl a smaller portion of it (without necessarily reduce the size of the web-entity).
Could we think of a way to do this?

[Crawl list] warning message on start pages sometimes irrelevant ?

In the Startpages column of the crawl by list interface, the circled exclamation points icons seem to appear in some cases even when the result is a success.

This happens for instance with the simple sample list below :
http://www.medialab.sciences-po.fr/
http://www.sciencespo.fr/
http://www.regardscitoyens.org/

Also, the description of this page is still gibberish and should probably be updated ;)

Fix Lucene queries with too many clauses [Was: Get Pages fails for web entities with many sub-web-entities]

Example: get pages for webentity with many sub-web-entities like WordPress.com when there are a lot of blogs
Cases possible in linking algorithm

Pressing Enter instead of clicking "Declare" returns an error on Crawl_new

When declaring a new crawl in crawl_new.php by inputting a new url, if we do Enter with the keyboard instead of clicking on the "Declare" button, it does not work and the Chrome console displays :
Uncaught TypeError: Object # has no method 'get' _page_crawl_new.js:618
(anonymous function) _page_crawl_new.js:618
p.event.dispatch jquery.min.js:2
g.handle.h

Downloadable CSV

Possibility to download a CSV, in the list of web entities, the crawl jobs, and the classification of discovered web entities

Add list of backend technos to frontend interface footer

Remove button to remove LRU Prefix from WebEntity when only one prefix present

Fix encoding issue in WebEntities fields setter through API

Redirection in iframe

Framebreakers get rid of the iframe preview. We want to avoid that.

hint: use onbeforeunload ?

Migrate from JsonRPC 1.0 to 2.0

Reuse hefee's updated version of txjsonrpc found by @jrault for biblib/reference manager (https://github.com/medialab/reference_manager) :
pip install git+https://github.com/hefee/txjsonrpc.git

cf https://github.com/hefee/txjsonrpc/commit/9fb8fdf45f3b8fa827a5d2548d283f178a412bf5

extra resources:
http://www.simple-is-better.org/rpc/jsonrpc.py
https://launchpad.net/txjsonrpc

API function to get all webentities by status/tag/...

could be useful to get mutliple statuses at once, all WEs with a namespace tag, etc

could also apply to webentitylinks

Remove Abort Crawl button when crawl finished

It does not make much sense to be still able to cancel a crawl even when the crawling and indexing are already over.

Changing the button into another one to "recrawl" could be a nice feature

Installations script

Hi Medialab,

Congratulations with your new release - hypher just seems to become better and better.

Today I experimented for the first time with your bin/install.sh script in which I encountered two minor problems:

Since my first attempt of installation failed (quite a common problem) i had to run the installation script again. This time I however got some new errors whenever I reached these two lines:

sudo ln -s pwd/config/scrapyd.config /etc/scrapyd/conf.d/100-hyphe || exit 1
sudo ln -s pwd/hyphe_www_client/_config/apache2.conf /etc/apache2/sites-available/hyphe || exit 1

The script had already been run once the files already existed and the script died with an error. One should in other words add some if-check to see if the symbolic link already exists.

The scripts currently ends with the following text: "You can now run bash bin/start.sh and access Hyphe at http://localhost/hyphe". I however kept encountering an error until i finally realised that bash bin/build_thrift.sh is not included in the install script and had to be run manually.

Maybe this could be added as well or the end text of the install script could simply be changed.

Best regards
Tobias

For instance, if we try to modify the values of the tags of the category A or C here, it impacts the tags in category B without affecting the others : http://jiminy.medialab.sciences-po.fr/hyphe-demo/webentity_edit.php#we_id=ed90cc14-4dc1-422d-b388-c1bbfaa38e76

Looking at the code, it looks like the category variable is only grabbed when trying to update the category name, so the latest category is being used

Fix recognition of url in webentity with trailing slashes

Examples:
./hyphe_backend/test_client.py store.declare_webentity_by_lru_prefix_as_url http://www.test.fr/test/
./hyphe_backend/test_client.py store.get_webentity_by_url http://www.test.fr/test #FAIL ?
./hyphe_backend/test_client.py store.get_webentity_by_url http://www.test.fr/test/ #OK
./hyphe_backend/test_client.py store.get_webentity_by_url http://www.test.fr/test/a #FAIL

./hyphe_backend/test_client.py store.declare_webentity_by_lru_prefix_as_url http://www.test2.fr/test2
./hyphe_backend/test_client.py store.get_webentity_by_url http://www.test2.fr/test2 #OK
./hyphe_backend/test_client.py store.get_webentity_by_url http://www.test2.fr/test2/ #OK
./hyphe_backend/test_client.py store.get_webentity_by_url http://www.test2.fr/test2/a #OK

medialab / hyphe Goto Github PK

hyphe's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs