medialab / hyphe Goto Github PK

Websites crawler with built-in exploration and control web interface

Home Page: http://hyphe.medialab.sciences-po.fr/demo/

License: GNU Affero General Public License v3.0

Python 31.42% JavaScript 38.13% Shell 1.76% HTML 25.06% CSS 3.47% Dockerfile 0.16%

hyphe's Introduction

Hyphe: web corpus curation tool & links crawler

Welcome to Hyphe, a research-driven web crawler developped at the Sciences Po médialab for the DIME-SHS Web project (ANR-10-EQPX-19-01).

Hyphe aims at providing a tool to build web corpus by crawling data from the web and generating networks between what we call "web entities", which can be single pages as well as a website, subdomains or parts of it, or even a combination of those.

Demo & Tutos

You can try a limited version of Hyphe at the following url: http://hyphe.medialab.sciences-po.fr/demo/

You can find extensive tutorials on Hyphe's Wiki. See also these videos on how to grow a Hyphe corpus and what is a web entity.

How to install?

Before running Hyphe, you may want to adjust the settings first. The default config will work but you may want to tune it for your own needs. There is a procedure to change the configuration after the installation. However we recommend to take a look at the Configuration documentation for detailed explanation of each available option.

Warning: Hyphe can be quite disk-consuming, a big corpus with a few hundred crawls with a depth 2 can easily take up to 50GB, so if you plan on allowing multiple users, you should ensure at least a few hundreds gigabytes are available on your machine. You can reduce disk-space by setting to false the option store_crawled_html_content and limiting the max_depth allowed.

Migrating older versions

Hyphe has changed a lot in the past few years. Migrating from an older version by pulling the code from git is not guaranteed anymore, it is highly recommended to reinstall from scratch. Older corpora can be rebuilt by exporting the list of web entities from the old version and recrawl from that list of urls in the new Hyphe.

Easy install: using Docker

For an easy install either on Linux, Mac OS X or Windows, the best solution is to rely on Docker.

Docker enables isolated install and execution of software stacks, which helps installing easily a whole set of dependencies.

Docker's containers are sizeable: you should ensure at least 4GB of empty space is available before installing. In any case, as expressed above, for a regular and complete use of Hyphe, you should better ensure at least 100GB are available.

Note for Mac OS: ~~you need Apple's XCode installed to allow Docker to run on Mac OS.~~ (XCode is not required anymore for Docker, although it's always preferable to have it for other reasons, such as git etc.)

1. Install Docker

First, you should deploy Docker on your machine following its official installation instructions.

Once you've got Docker installed and running, you will need Docker Compose to set up and orchestrate Hyphe services in a single line. Docker Compose is already installed along with Docker on Windows and Mac OS X, but you may need to install it for Linux.

2. Download Hyphe

Collect Hyphe's sourcecode from this git repository (recommended way to benefit from future updates) or download and uncompress a zipped release, then enter the resulting directory:

git clone https://github.com/medialab/hyphe.git hyphe
cd hyphe

Or, if you do not have git (for instance on a Mac without XCode), you can also download and uncompress the files from Hyphe's latest release by clicking the link to "Source code (zip)" or "Source code (tar.gz)" from the following page: https://github.com/medialab/hyphe/releases

3. Configure

Then, copy the default configuration files and edit them to adjust the settings to your needs: WARNING: do not enclose the values with any kind of quotes

# use "copy" instead of "cp" under Windows powershell
cp .env.example .env
cp config-backend.env.example config-backend.env
cp config-frontend.env.example config-frontend.env

The .env file lets you configure:

TAG: the reference Docker image you want to work with among
- prod: for the latest stable release
- preprod: for intermediate unstable developments
- A specific version, for instance 1.3.0. You will find the list on Hyphe's Docker Hub page and descriptions for each version on GitHub's releases page.
PUBLIC_PORT: the web port on which Hyphe will be served (usually 80 for a single service server, or for a shared host any other port you like which will need to be redirected)
DATA_PATH: using Hyphe can quickly consume several gigabytes of hard drive. By default, volumes will be stored within Docker's default directories but you can define your own path here.

WARNING: DATA_PATH MUST be either empty, or a full absolute path including leading and trailing slashes (for instance /var/opt/hyphe/).

It is not currently supported under Windows, and should always remain empty in this case (so you should install Hyphe from a drive with enough available space).
RESTART_POLICY: the choice of autorestart policy you want Hyphe containers to apply
- no: (default) containers will not be restarted automatically under any circumstance
- always: containers will always restart when stopped
- on-failure: containers will restart only if the exit code indicates an on-failure error
- unless-stopped: containers will always restart unless when explicitly stopped
If you want Hyphe to start automatically at boot, you should use the always policy and make sure the Docker daemon is started at boot time with your service manager.

Hyphe's internal settings are adjustable within config-backend.env and config-frontend.env. Adjust the settings values to your needs following recommendations from the config documentation.

If you want to restrict Hyphe's access to a selected few, you should leave HYPHE_OPEN_CORS_API false in config-backend.env, and setup HYPHE_HTPASSWORD_USER & HYPHE_HTPASSWORD_PASS in config-frontend.env (use openssl passwd -apr1 to generate your password's encrypted value).

4. Prepare the Docker containers

You have two options: either collect, or build Hyphe's Docker containers.

Recommended: Pull our official preassembled images from the Docker Store
```
docker-compose pull
```
Alternative: Build your own images from the source code (mostly for development or if you intend to edit the code, and for some very specific configuration settings):
```
docker-compose build
```

Pulling should be faster, but it will still take a few minutes to download or build everything either way.

5. Start Hyphe

Finally, start Hyphe containers with the following command, which will run Hyphe and display all of its logs in the console until stopped by pressing Ctrl+C.

docker-compose up

Or run the containers as a background daemon (for instance for production on a server):

docker-compose up -d

Once the logs say "All tests passed. Ready!", you can access your Hyphe install at http://localhost:80/ (or http://localhost:<PUBLIC_PORT>/ if you changed the port value in the .env configuration file).

6. Stop and monitor Hyphe

To stop containers running in background, use docker-compose stop (or docker-compose down to also clean relying data).

You can inspect the logs of the various Docker containers using docker-compose logs, or with option -f to track latest entries like with tail.

Whenever you change any configuration file, restart the Docker container to take the changes into account:

docker-compose stop
docker-compose up -d

Run docker-compose help to get more explanations on any extra advanced use of Docker.

If you encounter issues with the Docker builds, please report an issue including the "Image ID" of the Docker images you used from the output of docker images or, if you installed from source, the last commit ID (read from git log).

7. Update to future versions

WARNING: Do not do this if you're not sure of what you're doing, upgrading to major new versions can potentially break your existing corpuses making it really complex to get your data back.

If you installed from git by pulling our builds from DockerHub, you should be able to update Hyphe to future minor releases by simply doing the following:

docker-compose down
git pull
docker-compose pull
# eventually edit your configuration files to use new options
docker-compose up -d

Manual install (complex and only for Linux)

If your computer or server relies on an old Linux distribution unable to run Docker, if you want to contribute to Hyphe's backend development, or for any other personal reason, you might want to rather install Hyphe manually by following the manual install instructions.

Please note there are many dependencies which are not always trivial to install and that you might run in quite a bit of issues. You can ask for some help by opening an issue and describing your problem, hopefully someone will find some time to try and help you.

Hyphe relies on a web interface with a server daemon which must be running at all times. When manually installed, one must start, stop or restart the daemon using the following command (without sudo):

bin/hyphe <start|restart|stop> [--nologs]

By default the starter will display Hyphe's log in the console using tail. You can use Ctrl+C whenever you like to stop displaying logs without shutting Hyphe down. Use the --nologs option to disable logs display on start. Logs are always accessible from the log directory.

All settings can be configured directly from the global configuration file config/config.json. Restart Hyphe afterwards to take changes into account: bin/hyphe restart.

Serve Hyphe on the web

As soon as the Docker containers or the manual daemon start, you can use Hyphe's web interface on your local machine at the following url:

Docker install: http://localhost/
manual install: http://localhost/hyphe.

For personal uses, you can already work with Hyphe as such. Although, if you want to let others use it as well (typically if you installed on a distant server), you need to serve it on a webserver and make a few adjustments.

Read the dedicated documentation to do so.

Advanced developers features & contributing

Please read the dedicated Developers documentation and the API description.

What's next?

See our roadmap!

Papers & references

Tutorials / examples

(EN) The following videos present different aspects of Hyphe's methodology:
- What is the purpose of Hyphe? (22 min)
- The process of growing a web corpus (13 min)
- What is a web entity? (10 min)
- The web as layers, or how to get oriented during exploration. (10 min)
You can also find here the slides corresponding to these videos.
(FR) An older explanatory video (FR) is also available online: Explorer les internets avec Hyphe (September 2017). Mathieu Jacomy presents Hyphe at Sciences Po CEVIPOF during METSEM, a seminar on digital methods in social sciences.
(FR) A blogpost detailing an example of corpus building with Hyphe: Le web du secteur de l'hydrogène par Mathieu Boyer (2018)

Publications about Hyphe

OOGHE-TABANOU, Benjamin, JACOMY, Mathieu, GIRARD, Paul & PLIQUE, Guillaume, "Hyperlink is not dead!" (Proceeding / Slides), In Proceedings of the 2nd International Conference on Web Studies (WS.2 2018), Everardo Reyes, Mark Bernstein, Giancarlo Ruffo, and Imad Saleh (Eds.). ACM, New York, NY, USA, 12-18. DOI: https://doi.org/10.1145/3240431.3240434
PLIQUE, Guillaume, JACOMY, Mathieu, OOGHE-TABANOU, Benjamin & GIRARD, Paul, "It's a Tree... It's a Graph... It's a Traph!!!! Designing an on-file multi-level graph index for the Hyphe web crawler". (Video / Slides) Presentation at the FOSDEM, Brussels, BELGIUM, February 3rd, 2018.
JACOMY, Mathieu, GIRARD, Paul, OOGHE-TABANOU, Benjamin, et al, "Hyphe, a curation-oriented approach to web crawling for the social sciences.", in International AAAI Conference on Web and Social Media. Association for the Advancement of Artificial Intelligence, 2016.

Publications using Hyphe

Carlotta Capurro & Marta Severo (2023) Mapping European Digital Heritage Politics: An Empirical Study of Europeana as a Web-based Network, Heritage & Society, DOI: https://www.tandfonline.com/doi/full/10.1080/2159032X.2023.2266801 DOI: 10.1080/2159032X.2023.2266801
GRUSON-DANIEL, Célya. Mapping contemporary “research on research” and “science studies”: how new methods change the traditional academic landscape and inform public open science policies. 2023.
YOUNG, Lisa, ALFREY, Laura, et O’CONNOR, Justen. When teachers Google physical literacy: A cartography of controversies analysis. European Physical Education Review, 2022, vol. 28, no 3, p. 613-632.
Hongyi Shi. Exploratory analysis of the hypertext structure linked to diabetes. Human health and pathology. Sorbonne Université, 2020. English. ⟨NNT : 2020SORUS391⟩. ⟨tel-03609140⟩
BLANKE, Tobias et VENTURINI, Tommaso. A network view on reliability: using machine learning to understand how we assess news websites. Journal of Computational Social Science, 2022, vol. 5, no 1, p. 69-88.
DA SILVA, Jaércio. Um conceito na rede: interseccionalidade e sua tradução micromidiática na web francesa. Revista Fronteiras, estudos midiáticos, 2022, vol. 24, no 1, p. 22-36.
BROWNE, Danielle. The ‘more-than-food’geographies of omega-3s. 2022. Thèse de doctorat. Memorial University of Newfoundland.
HEGADE, Prakash, CHITRAGAR, Ruturaj, KULKARNI, Raghavendra, et al. InSciC—Knowledge-Aware Crawler for Indian Sciences. In : Proceedings of International Conference on Communication and Computational Technologies: ICCCT 2022. Singapore : Springer Nature Singapore, 2022. p. 913-924.
BAKER, Catherine R. Infrastructures of male supremacism: a mixed-methods analysis of the incel wiki. 2022. Thèse de doctorat. Loughborough University.
MORALES, Pedro Ramaciotti, COINTET, Jean-Philippe, BENBOUZID, Bilel, et al. Atlas multi-plateforme d’un mouvement social: le cas des Gilets jaunes. Statistique et société, 2022, vol. 9, no 1-2, p. 39-77.
PAPAZU, Irina et VENG, Adam. Controversy Mapping and the Care for Climate Commons: Re-assembling the Danish Climate Movement by Counter-Mapping Digital Network Maps. 2022.
SHI, Hongyi, PFAENDER, Fabien, et JAULENT, Marie-Christine. Mapping the hyperlink structure of diabetes online communities. In : MEDINFO 2019: Health and Wellbeing e-Networks for All. IOS Press, 2019. p. 467-471.
DESBONNET, Johan. Réinterroger les notions d'accès à l'information géographique numérique au prisme de ses supports et de sa circulation sur le web. Applications aux territoires littoraux français et québécois. In : ASRDLF-57ème colloque-Territoire (s) et numérique Innovations, mutations et décision. 2021.
SHI, Hongyi, JAULENT, Marie-Christine, et PFAENDER, Fabien. Semantic interpretation of the map with diabetes-related websites. Procedia Computer Science, 2019, vol. 160, p. 330-337.
LECLERC, Éric. Ville intelligente et e-gouvernance en Inde, cartographier un nouveau paysage urbain. Mappemonde. Revue trimestrielle sur l’image géographique et les formes du territoire, 2020, no 128.
DOUZET, Frédérick, LIMONIER, Kévin, MIHOUBI, Selma, et al. Cartographier la propagation des contenus russes et chinois sur le Web africain francophone. Hérodote, 2020, no 2-3, p. 77-99.
DESBONNET, Johan, GOURMELON, Françoise, et CLARAMUNT, Christophe. Analyser la structure de dispositifs de mise à disposition de données géographiques-Application aux enjeux de suivi et de gestion du trait de côte en France. Revue Internationale de Géomatique, 2019, vol. 29, no 1, p. 57-80.
LEPAWSKY, Josh, DAVIS, John-Michael, AKESE, Grace, et al. Cooking with controversies: How geographers might use controversy mapping as a research tool. The Professional Geographer, 2019, vol. 71, no 3, p. 437-448.
DOUZET, Frédérick, LIMONIER, Kévin, MIHOUBI, Selma, et al. Mapping the spread of Russian and Chinese contents on the French-speaking African web. Journal of Cyber Policy, 2021, vol. 6, no 1, p. 50-67.
CIUCCARELLI, Giulia De Rossi Paolo. Mapping Communication Design through the Web. Design and Digital Interaction: Re: Research, Volume 7, 2019, vol. 7, p. 89.
LE DEUFF, Olivier, DAVID, Jean, BOREL, Clément, et al. Paul Otlet sur le web: une étude cartographique. In : H2PTM’19. De l’hypertexte aux humanités numériques. 2019.
CAYLA, Nathalie. Dinosaur Geotourism: A World-Wide Growing Tourism Niche. In : The Geotourism Industry in the 21st Century. Apple Academic Press, 2020. p. 449-472.
VERDI, UGO et DESFRICHES-DORIA, ORÉLIE. Information Literacy: une exploration hypertextuelle. H2PTM’19: De l’hypertexte aux humanités numériques, 2019, p. 22.
FROIO, Caterina. Nosotros y los Otros. La alteridad en los sitios web de las extremas derechas en Francia. We and the Others. The alterity in the websites of the. deSignis Serie Intersecciones deSigniS Intersection’s Series, p. 241.
HERASIMOVICH, Volha et ALZUA-SORZABAL, Aurkene. Communication network analysis to advance mapping ‘sport for development and peace’complexity: Cohesion and leadership. International Review for the Sociology of Sport, 2021, vol. 56, no 2, p. 170-193.
HUMALISTO, Niko, VALVE, Helena, ÅKERMAN, Maria, "Making the circular economy online: a hyperlink analysis of the articulation of nutrient recycling in Finland", (2021), Environmental Politics, 30:5, 833-853, DOI: 10.1080/09644016.2020.1817291
CAFIERO, Florian, GUILLE-ESCURET Paul, WARD Jeremy K, "“I’m not an antivaxxer, but…”: Spurious and authentic diversity among vaccine critical activists" (2021), in Social Networks, Volume 65, 2021, Pages 63-70, ISSN 0378-8733. DOI: 10.1016/j.socnet.2020.11.004.
BLANKE, Tobias, VENTURINI, Tommaso, "A network view on reliability: using machine learning to understand how we assess news websites" (2021), in Journal of Computational Social Science. DOI: 10.1007/s42001-021-00116-w
HERASIMOVICH, Volha, ALZUA-SORZABAL, Aurkene. "Communication network analysis to advance mapping ‘sport for development and peace’ complexity: Cohesion and leadership" (2020), in International Review for the Sociology of Sport. DOI: 10.1177/1012690220909748
WARD Jeremy K, GUILLE-ESCURET Paul, ALAPETITE Clément, "Les « antivaccins », figure de l’anti-Science" (2019), in Déviance et Société, 2019/2 (Vol. 43), p. 221-251. DOI: 10.3917/ds.432.0221
TOURNAY Virginie, JACOMY Mathieu, NECULA Andra, LEIBING Annette & BLASIMME Alessandro, 2019, "A New Web-Based Big Data Analytics for Dynamic Public Opinion Mapping in Digital Networks on Contested Biotechnology Fields", in OMICS: A Journal of Integrative Biology. DOI: 10.1089/omi.2019.0130
CARDON, Dominique, Jean-Philippe COINTET, Benjamin OOGHE, and Guillaume PLIQUE. "Unfolding the Multi-Layered Structure of the French Mediascape" (2019), report for Institut Montaigne.
LAROUSSERIE David, 2019, "Les premiers pas d’un Internet quantique", in Le Monde.
ÁLVARO SÁNCHEZ, Sandra, 2019, "A Topological Space for Design, Participation and Production. Tracking Spaces of Transformation", in Journal of Peer Production, Issue 13: OPEN.
VENTURINI, Tommaso, JACOMY, Mathieu, BOUNEGRU, Liliana & GRAY, Jonathan, (2018), "Visual Network Exploration for Data Journalists", In The Routledge Handbook of Developments in Digital Journalism Studies (pp. 265-283). Routledge.
DESFRICHES-DORIA, Orélie, SERGENT, Henri, TRAN, Félicia, HAETTICH, Yoann & BOREL, Justine (2018), "What is Digital Humanities' identity in interdisciplinary practices?: An experiment with digital tools for visualizing the francophone DH network, In Proceedings of the 2nd International Conference on Web Studies (WS.2 2018), Everardo Reyes, Mark Bernstein, Giancarlo Ruffo, and Imad Saleh (Eds.). ACM, New York, NY, USA, 39-47. DOI: https://doi.org/10.1145/3240431.3240439
WARD, Jeremy, CAFIERO, Florian, FRETIGNY, Raphael, COLGROVE, James & SEROR, Valérie. France's citizen consultation on vaccination and the challenges of participatory democracy in health (2018), in Social Science & Medicine. 220. DOI: https://doi.org/10.1016/j.socscimed.2018.10.032
COUSIN Grégoire & PONTRANDOLFO Stefania (2018), Sull’uso politico dei rom in campagna elettorale. Una ricerca sul web. In Pontrandolfo, S. (ed.), Politiche locali per Rom e Sinti in Italia. Roma, CISU, pp. 243-265.
OJALA, Mace "Mining with Hyphe", 2018 (a blog post on Ethos Lab's blog)
FROIO, Caterina, "Race, Religion, or Culture? Framing Islam between Racism and Neo-Racism in the Online Network of the French Far Right", in Perspectives on Politics, 16(3) (n° 202-203), p. 696-709, 2018. DOI: 10.1017/S1537592718001573.
FROIO, Caterina, "Nous et les autres. L’altérité sur les sites web des extrêmes droites en France", in Réseaux, 2017/2 (n° 202-203), p. 39-78. DOI: 10.3917/res.202.0039.
CREPEL, Maxime, BOULLIER, Dominique, JACOMY, Mathieu, OOGHE-TABANOU, Benjamin, ANTOLINOS-BASSO, Diego, MONSALLIER, Paul "Privacy web corpus", 2017.
DE CARVALHO PEREIRA, Débora "Les réseaux d'influence sur le web dans le domaine des produits laitiers" (2017), report for REPASTOL.
CAYLA, Nathalie, PEYRACHE-GADEAU, Véronique, La singularité territoriale générée par le déploiement de la marque Parc naturel régional, 2017
ROGERS, Richard. Digital Methods for Cross-platform Analysis. J. Burgess, A. Marwick e T. Poell (a cura di) The Sage Handbook of Social Media. London: Sage, 2017.
SERRES, Alexandre et STALDER, Angèle. L'EMI sur le web: cartographie d'un domaine en émergence. In: Journée d'étude GRCDI-ESPE de Caen et Rouen,«L’EMI en questions: enjeux, prescriptions, contenus, apprentissages». 2016.
ROMELE, Alberto, SEVERO Marta. From Philosopher to Network. Using Digital Traces for Understanding Paul Ricoeur’s Legacy. Azimuth. Philosophical Coordinates in Modern and Contemporary Age, Edizioni Storia e Letteratura, 2016, Philosophy and Digital Traces, VI (6), <azimuthjournal.com>
BERTHELOT, Marie-Aimée, SEVERO, Marta, et KERGOSIEN, Eric. Cartographier les acteurs d'un territoire: une approche appliquée au patrimoine industriel textile du Nord-Pas-de-Calais. In: CIST2016-En quête de territoire (s)?. 2016. p. 66-72.
PEDROJA, Cynthia, et al. "Dépasser la liste : quand la bibliothèque entre dans la danse des corpus web." Digital Humanities 2016 (DH2016). Jagiellonian University & Pedagogical University, 2016.
ARZHEIMER, Kai, "The AfD’s Facebook Wall: A new Hub for Right-Wing Mobilisation in Germany?", in American Political Science Association Conference, 2015.
SCHNEIDER, Élisabeth, SERRES, Alexandre & STALDER, Angèle. L’EMI en partage: essai de cartographie des acteurs. In: 10e Congrès des Enseignants Documentalistes de l'Education Nationale" Enseigner-apprendre l'information-documentation". 2015.
KERGOSIEN, Eric, JACQUEMIN, Bernard, SEVERO, Marta, et al. Vers l'interopérabilité des données hétérogènes liées au patrimoine industriel textile. In: CiDE. 18. 18e Colloque international sur le Document Électronique. Europia, 2015. p. 145-158.
MUNK, Anders Kristian. Mapping Wind Energy Controversies Online: Introduction to Methods and Datasets. 2014.
COUSIN, Grégoire & PONTRANDOLFO, Stefania (2014). Emergence and Diffusion of «Roma issues» in the political web arena. At the Conference "Construyendo puentes entre acción política e investigación social. Logros, desafíos y perspectivas en las políticas acerca de la población gitana en el Estado español. Un encuentro entre investigadores, responsables políticos y tercer sector", Barcelona, 16-17 October 2014

Credits & License

Mathieu Jacomy, Benjamin Ooghe-Tabanou & Guillaume Plique @ Sciences Po médialab

Discover more of our projects at médialab tools.

This work is supported by DIME-Web, part of DIME-SHS research equipment financed by the EQUIPEX program (ANR-10-EQPX-19-01).

Hyphe is a free open source software released under AGPL 3.0 license.

Thanks to https://www.useragents.me for maintaining a great updated list of common user agents which are reused within Hyphe!

[...] I hear _kainos_ [(greek: "now")] in the sense of thick, ongoing presence, with __hyphae__ infusing all sorts of temporalities and materialities."
Donna J. Haraway, Staying with the Trouble, Making kin with the Chthlucene p.2

hyphe's People

Contributors

Stargazers

Watchers

hyphe's Issues

. in webentity

When you make a crawl from a url list (i haven't tested the normal crawl) and there is a dot in the end of a url (e.g. www.domain.com . ) the backend server crashes and the entire import has to be repeated.

It is an important bug because DMI link harvester from time to time leaves you with dots in the end of the url.

Crawl list features

Useful things to add :

display details of a crawl within the screen when selecting a crawl down in the list so we can see them without scrolling back to the top
checkbox to show all "suspect" crawls meaning those "finished" but with small pages/links figures (i.e less than 4 for both?)
while stille in debug, having the crawljob id on the right is useful, but the webentity_id is more often needed so should be there as well
add link to recrawl the same webentity
add a link to the webentity page (edit) corresponding to a crawl

[Crawl list] warning message on start pages sometimes irrelevant ?

In the Startpages column of the crawl by list interface, the circled exclamation points icons seem to appear in some cases even when the result is a success.

This happens for instance with the simple sample list below :
http://www.medialab.sciences-po.fr/
http://www.sciencespo.fr/
http://www.regardscitoyens.org/

Also, the description of this page is still gibberish and should probably be updated ;)

hyphe_backend.lib missing

Hi Medialab,

Running sudo bin/deploy_scrapy_spider.sh to install the latest edition of hypher gives me the following error:

Copying config.json from root directory to hyphe_backend/crawler for scrapy deployment...
Traceback (most recent call last):
File "deploy.py", line 18, in
from hyphe_backend.lib import config_hci
ImportError: No module named hyphe_backend.lib

It appears that a file most be missing.

Best regards
Tobias

Bug in store.get_webentity_by_url possibly linked to HTTPS

I have a case where the right web entity is not fetched. This is my exact case:

I have a generic web entity "Twitter HTTPS" prefixed by https://twitter.com:443
I have a more precise web entity "XXX on Twitter" prefixed by https://twitter.com/xxx
I have a this URL: 'https://twitter.com/xxx'
Note that this URL is exactly the prefix of "XXX on Twitter". But when I ask to fetch the right web entity, it is the other one that is returned: "Twitter HTTPS".

The right web entity is not returned, a more generic one is instead.

Fix recognition of url in webentity with trailing slashes

Examples:
./hyphe_backend/test_client.py store.declare_webentity_by_lru_prefix_as_url http://www.test.fr/test/
./hyphe_backend/test_client.py store.get_webentity_by_url http://www.test.fr/test #FAIL ?
./hyphe_backend/test_client.py store.get_webentity_by_url http://www.test.fr/test/ #OK
./hyphe_backend/test_client.py store.get_webentity_by_url http://www.test.fr/test/a #FAIL

./hyphe_backend/test_client.py store.declare_webentity_by_lru_prefix_as_url http://www.test2.fr/test2
./hyphe_backend/test_client.py store.get_webentity_by_url http://www.test2.fr/test2 #OK
./hyphe_backend/test_client.py store.get_webentity_by_url http://www.test2.fr/test2/ #OK
./hyphe_backend/test_client.py store.get_webentity_by_url http://www.test2.fr/test2/a #OK

Possibility to specify in configuration an Analytics ID

It might be nice to monitor who accesses the crawler interface through a Google Analytics account.

To do that, the best would be to have behavioral tracking in the interface, but with an ID specifiable in the global instance configuration.

Fix Lucene queries with too many clauses [Was: Get Pages fails for web entities with many sub-web-entities]

Example: get pages for webentity with many sub-web-entities like WordPress.com when there are a lot of blogs
Cases possible in linking algorithm

. in webentity

It is an important bug because DMI link harvester from time to time leaves you with dots in the end of the url.

Remove button to remove LRU Prefix from WebEntity when only one prefix present

Improve speed of linking algorithm

Test replace Java collections by Trove http://trove.starlight-systems.com/
Renew links only for webentities newly modified or created (add a timestamp for last time of update?)

Pressing Enter instead of clicking "Declare" returns an error on Crawl_new

When declaring a new crawl in crawl_new.php by inputting a new url, if we do Enter with the keyboard instead of clicking on the "Declare" button, it does not work and the Chrome console displays :
Uncaught TypeError: Object # has no method 'get' _page_crawl_new.js:618
(anonymous function) _page_crawl_new.js:618
p.event.dispatch jquery.min.js:2
g.handle.h

Fix size of Tag value input field in webentity_edit

Also clicking on the validate blue button without having pressed enter in the input field should validate and not restart the operation

CSS error in Remove page to crawl

Using Google chrome Mac it is nearly imposible to remove pages you have added (see attached image). It appears to be a simple css error but I haven't been able to track it down.

Loading the interface when the server is off

Since the interface (served by Apache) can be loaded when the server is off, it could be nice to have a specific interface reaction instead of an error message (especially since it is not really an error, at least not an unknown one).

Remove Abort Crawl button when crawl finished

It does not make much sense to be still able to cancel a crawl even when the crawling and indexing are already over.

Changing the button into another one to "recrawl" could be a nice feature

Web entities and crawl limits

This is less of a bug report and more of an attempt to open the discussion.
Currently the limits of a web entities and the limits of its crawl coincide. This is probably a good idea in most cases, but not necessarily in all cases.

Example:
In our cartography of the climate adaptation debate, we have to deal with the website of the Food and Agriculture Organisation. Of course, we don't want crawl this entire website because it is too big and only a portion of it directly concerns climate adaptation. In fact, we are lucky, because they have a sub-directory that is dedicated to climate change (http://www.fao.org/climatechange/). Great! so we only want to crawl this directory.
Still, this does not necessarily imply that we only want to limit this entity to this folder. In fact, the FAO is a relatively unitary institution. Someone who want to cite a FAO study for example may as well site the homepage of the FAO website and not necessarily the pages in the sub-directory.

What this example tries to illustrate is that sometime we might want to define a larger web-entity, but only crawl a smaller portion of it (without necessarily reduce the size of the web-entity).
Could we think of a way to do this?

Exclude button in the list

I know that it is possible to enter in the editing window of a web entity to exclude it. Still, since this is an operation that is done frequently and often just by looking at the URL of the site, it'd be much easier to it directly in the list.

Add list of backend technos to frontend interface footer

Hyphe does not manage different corpora or users at the moment.

Any vague idea if or when this will be implemented? Has it been scheduled or is still something which is going to happen sometime in the far future?

Actualisation de page

Chaque fois que je tente d'actualiser, j'obtiens ça : (je suis alors obligé de rouvrir le site dans une nouvelle fenetre (chrome)

Header menu

It would be nice if the "Hyphe" on the left of the header could link to the home page.

Also the "Webentities" menu could point also to the "Explore discovered entities" page, and the "Crawl" menu to the "crawl list" page.

Sort discovered entities by in-degree

I've given a look a the interface and to me the n°1 priority for the user interface is the possibility to sort the discovered web entities for decreasing in-degree.
Of course it is possible to export the graph, sort in Gephi and then get back to Hyphe to crawl the most cited neighbors...
but how nicer would it be to do it without leaving the interface.

Thrift 0.8.0 compilation

Add --without-erlang to the thrift ./configure. Otherwise, the compilation won't succeed on Ubuntu and such.

store.set_webentity_homepage adds http:// on null home page

When sending the null chain as a parameter for store.set_webentity_homepage, the homepage is actually set to "http://". The expected behavior would be to get a null homepage. The feature is to allow this way to remove the homepage.

Need a page to make a real search

The list of webentities is slow and it comes from that fact that it loads all webentities. It would be useful to have a page where to search web entities from Lucence. It would be more efficient. It would be the occasion to attach features to groups of web entities (group tag, group classify, group merge...)

Redirection in iframe

Framebreakers get rid of the iframe preview. We want to avoid that.

hint: use onbeforeunload ?

Handle past crawls on removed/merged webentities

When a webentity has been crawled and is merged into another one, its ID no longer corresponds to a webentity in the whole list so it crashes the display of its name in the crawl list (crawl.php).

Either remove those from the crawl list or handle them differently, maybe we need to keep a record of merged ids to their merged_into ones?

API function to list all namespaces/keys/values for tags

Add credentials to JsonRPC API

Using SSL and user/pass for the API could be better to securize the API server

Add backwards fonctions to cancel done crawls ?

Need to mark in memory structure elements coming from a specific crawl

Add option to merge a webentity's parent into itself like for subWebEntities

In the "edit a webentity"page, it would make sense within the Content part when clicking on a parent webentity to be offered the same options as when clicking on a subwebentity, therefore adding the option to merge into current webentity would be nice.

Migrate from JsonRPC 1.0 to 2.0

Reuse hefee's updated version of txjsonrpc found by @jrault for biblib/reference manager (https://github.com/medialab/reference_manager) :
pip install git+https://github.com/hefee/txjsonrpc.git

cf https://github.com/hefee/txjsonrpc/commit/9fb8fdf45f3b8fa827a5d2548d283f178a412bf5

extra resources:
http://www.simple-is-better.org/rpc/jsonrpc.py
https://launchpad.net/txjsonrpc

Scrapyd configuration documentation is not complete

The part about the edition of the configuration is inexistant.

Modifying list of tag values for a category always updates the tags of the last category

For instance, if we try to modify the values of the tags of the category A or C here, it impacts the tags in category B without affecting the others : http://jiminy.medialab.sciences-po.fr/hyphe-demo/webentity_edit.php#we_id=ed90cc14-4dc1-422d-b388-c1bbfaa38e76

Looking at the code, it looks like the category variable is only grabbed when trying to update the category name, so the latest category is being used

Collect and display number of pages actually crawled

Download button in Network

Pu the button just below the selected radio button, so that it is clear that we download what is selected

[Crawl] Follow link issues on urls like "fing.org/?key=val"

It looks like the scrapy spider does not follow such links (example: crawling fing.org returns nothing)

Alert boxes on some pages when no data yet

After resetting, if one clicks on the "network of WEs" or "explore discovered WEs" links, the webpages complain with a popup alert box because there is no data yet, this should probably just be a less invasive simple message on the pages.

Installations script

Hi Medialab,

Congratulations with your new release - hypher just seems to become better and better.

Today I experimented for the first time with your bin/install.sh script in which I encountered two minor problems:

Since my first attempt of installation failed (quite a common problem) i had to run the installation script again. This time I however got some new errors whenever I reached these two lines:

sudo ln -s pwd/config/scrapyd.config /etc/scrapyd/conf.d/100-hyphe || exit 1
sudo ln -s pwd/hyphe_www_client/_config/apache2.conf /etc/apache2/sites-available/hyphe || exit 1

The script had already been run once the files already existed and the script died with an error. One should in other words add some if-check to see if the symbolic link already exists.

The scripts currently ends with the following text: "You can now run bash bin/start.sh and access Hyphe at http://localhost/hyphe". I however kept encountering an error until i finally realised that bash bin/build_thrift.sh is not included in the install script and had to be run manually.

Maybe this could be added as well or the end text of the install script could simply be changed.

Best regards
Tobias

Change URL/LRU rule to put t: (port) after h: (host) instead of before

Fix encoding issue in WebEntities fields setter through API

Don't refresh canceled crawl jobs in crawls list

Handle websites giving wrong http statuses

Some websites act in weird manners, we need to establish policies for these.
For instance:

http://www.sallenougaro.com/ is a correct webpage but says 404
http://www.ilri.org/ILRIAudio is correct but indicates an unlimited redirection (302 Found!) (example to test redirects : curl "http://www.cgiar.org/" -L -o dumpfile -w 'Last URL was: %{url_effective}' )
http://www.meteofrance.com and http://community.eldis.org refuses to work because of some cookies reasons (see http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#std:setting-COOKIES_ENABLED )

Radio buttons with explanations for crawl depths

API function to get all webentities by status/tag/...

could be useful to get mutliple statuses at once, all WEs with a namespace tag, etc

could also apply to webentitylinks

Downloadable CSV

Possibility to download a CSV, in the list of web entities, the crawl jobs, and the classification of discovered web entities

Paste a list of URLs does not trigger the UI

Handle multiple corpora in one single instance

create/close/export corpus with specific settings (WE Creation Rule, crawl strategy(?), max_depth, precision limit)
Adapt core/crawler code to specific corpora
Run/Stop one MemoryStructure JAVA instance per corpus from core on demand and close on inactive
limit number of corpus running simultanuously

Handle www in prefixes

If http://test.fr is created before http://www.test.fr, the latter falls naturally into the first's prefix and is therefore never created as a prefix, whereas it is if they are created in the reverse order.

Possible options :

handle from webentity creation rule ?
always strip www ?
always create www extra when missing ?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble