This is less of a bug report and more of an attempt to open the discussion. Curren

Web entities and crawl limits about hyphe HOT 2 CLOSED

medialab commented on June 3, 2024

Web entities and crawl limits

from hyphe.

Comments (2)

boogheta commented on June 3, 2024

This is kind of a tricky case. It is somehow possible already by first defining the sub-website, crawling it, which will generate a second so-called "parent" webentity for this one which won't be crawled, and afterwards merge the parent into the sub-one making them only one (while only the sub one will have been crawled). Features to redefine and merge webentities aren't completely offered yet on the web interface but these are possible functionnalities.

from hyphe.

jacomyma commented on June 3, 2024

I think that there is a non-technical discussion here. I will reopen this issue so that we have this discussion if needed.

The coincidence of a web entity and the limits of a crawl is intentional. We want to crawl the web and we need to define the limits of a crawl. We tried to stick to user's needs and users think in terms of websites (most of the time). In order to fit to that need we implemented web entities. They are what you have crawled. Of course you can edit a web entity and then reach a state where a web entity is only partially crawled. But this is a side effect and we want users to fix that situation so that every web entity is crawled. In other terms, web entities deserve the purpose of helping users to manage their crawl.

Web entities are good because they are a simple way to cope with a difficult problem. This problem is to define the limits of a crawl so that we have meaningful entities even if the web is large, heterogeneous, and full of singularities (such as redirects). Web entities are the incarnation of a design strategy. We aim at presenting features in terms of results for the user. The user comes with a need: "I want to have a website in my data." We would rather say "Let's define this website (we call that a web entity) and then harvest it, knowing that it requires several steps" than "We have a harvesting feature requiring several steps, starting with the definition of what you call a website". The user searches for a way to achieve goals, and features must appear as answers to these goals. Web entities are our concept for leading the user to cope with the issues of crawling. As a design trick I find it quite efficient, since users seem to quickly understand the concept, while we are able to use it as a solution to different hard-to-design features. We just ask the user to keep believing that web entities are the result of the crawl, and then we lead the user to the different methodological questions of the crawl.

How is it that some users like you want to separate the crawl from a web entity? Maybe the concept of web entity is so transparent that people see different things in it. This is somehow a design success, since you accept the concept of web entity while discussing the issues of crawling. But you probably understand now that if we separate crawl settings from web entities, it leads us to a bigger issue about how to explain the issue of crawling to users. We can nevertheless explore this design space if you have ideas. Feel free to detail the system you would like to use!

from hyphe.

Web entities and crawl limits about hyphe HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs