GithubHelp home page GithubHelp logo

Web entities and crawl limits about hyphe HOT 2 CLOSED

medialab avatar medialab commented on June 3, 2024
Web entities and crawl limits

from hyphe.

Comments (2)

boogheta avatar boogheta commented on June 3, 2024

This is kind of a tricky case. It is somehow possible already by first defining the sub-website, crawling it, which will generate a second so-called "parent" webentity for this one which won't be crawled, and afterwards merge the parent into the sub-one making them only one (while only the sub one will have been crawled). Features to redefine and merge webentities aren't completely offered yet on the web interface but these are possible functionnalities.

from hyphe.

jacomyma avatar jacomyma commented on June 3, 2024

I think that there is a non-technical discussion here. I will reopen this issue so that we have this discussion if needed.

The coincidence of a web entity and the limits of a crawl is intentional. We want to crawl the web and we need to define the limits of a crawl. We tried to stick to user's needs and users think in terms of websites (most of the time). In order to fit to that need we implemented web entities. They are what you have crawled. Of course you can edit a web entity and then reach a state where a web entity is only partially crawled. But this is a side effect and we want users to fix that situation so that every web entity is crawled. In other terms, web entities deserve the purpose of helping users to manage their crawl.

Web entities are good because they are a simple way to cope with a difficult problem. This problem is to define the limits of a crawl so that we have meaningful entities even if the web is large, heterogeneous, and full of singularities (such as redirects). Web entities are the incarnation of a design strategy. We aim at presenting features in terms of results for the user. The user comes with a need: "I want to have a website in my data." We would rather say "Let's define this website (we call that a web entity) and then harvest it, knowing that it requires several steps" than "We have a harvesting feature requiring several steps, starting with the definition of what you call a website". The user searches for a way to achieve goals, and features must appear as answers to these goals. Web entities are our concept for leading the user to cope with the issues of crawling. As a design trick I find it quite efficient, since users seem to quickly understand the concept, while we are able to use it as a solution to different hard-to-design features. We just ask the user to keep believing that web entities are the result of the crawl, and then we lead the user to the different methodological questions of the crawl.

How is it that some users like you want to separate the crawl from a web entity? Maybe the concept of web entity is so transparent that people see different things in it. This is somehow a design success, since you accept the concept of web entity while discussing the issues of crawling. But you probably understand now that if we separate crawl settings from web entities, it leads us to a bigger issue about how to explain the issue of crawling to users. We can nevertheless explore this design space if you have ideas. Feel free to detail the system you would like to use!

from hyphe.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.