Hi Ethan, Thinking over the core structure of the knowledge base got

[consistency] Checking external or internal links about decko HOT 6 OPEN

decko-commons commented on May 30, 2024

[consistency] Checking external or internal links

from decko.

Comments (6)

ethn commented on May 30, 2024

No there isn't a current mechanism, but I have several related thoughts:

Currently we track references (by links, nests, queries...) to other cards in a separate table (card_references). This is what makes it fast/easy to query card relationships. I have often wanted to track external references as well. That would be a good building block for a system like this, because you could go through all the links without re-parsing content.
WikiRate.org uses external links heavily, and this is a problem for them as you mention. One part of the planned solution (currently slotted for attention in about 2 months) would involve storing copies of external sources. Those may become canonical (?), but regardless, it would clearly be useful to know whether external links are valid.

I'll have to give more thought as to whether something like #1 above could come together quickly enough to make it a better solution for us than something like link-checker. On the one hand, #1 would clearly be much more efficient. decko sites can be pretty hard to crawl efficiently because of all the content reuse, and wikirate.org is especially complex. And we already have all the parsing mechanisms in place, so it wouldn't be a massive new undertaking for us. On the other hand, new development always takes time.

I guess part of the question may come down to the utility of making the references queryable. I can certainly imagine that having benefits for wikirate down the road.

from decko.

tukanos commented on May 30, 2024

Thank you for your answer.

would involve storing copies of external sources

In my eyes, that is probably nearly impossible. Imagine all the contents you wanted to be saved every time the source will get update. The source can contain plenty of javascript, ajax, etc. Of course, it will depend on what exactly you are storing and how much you plan to store. If your database will grow probably will take also some toll on the performance.

On the other hand, if the source is simple enough it can makes sense.

from decko.

ethn commented on May 30, 2024

Our solution isn't super ambitious: it involves one static version of the source document as it was at citation time. In WikiRate's case, static is arguably preferable, because citations can have specific content references that will get lost in an update. But WikiRate is probably unusual in the need for cached source files; I wouldn't expect many other sites to borrow that functionality.

The shared functionality is what you proposed, the external link tracking. That's just a record of the link and its validity. In WikiRate's case, it makes sense to provide a link to the external source version so long as the reference is still valid. That will avail users of any dynamic (JS, etc) functionality, so long as the source is there.

Re resources, while our (cloud-based) file storage will undoubted grow, we won't be updating it with every source update. The only major database growth would be the external link tracking, which isn't storing much more than (1) referring card id, (2) referee uri, and (3) current http status. That should be manageable.

from decko.

tukanos commented on May 30, 2024

Our solution isn't super ambitious: it involves one static version of the source document as it was at citation time. In WikiRate's case, static is arguably preferable, because citations can have specific content references that will get lost in an update. But WikiRate is probably unusual in the need for cached source files; I wouldn't expect many other sites to borrow that functionality.

I see. The static version would be quite nice to have. Some really important information gets lost in the internet history.

The shared functionality is what you proposed, the external link tracking. That's just a record of the link and its validity. In WikiRate's case, it makes sense to provide a link to the external source version so long as the reference is still valid. That will avail users of any dynamic (JS, etc) functionality, so long as the source is there.

Yes that is exactly what I have proposed. The external links tracking, if the site is alive at the link. It would be also good to have some mass update functionality if only the source moved to different link like e.g. domain.com/I_was_here now it is at domain.com/new_site/old_information/I_m_here.

Re resources, while our (cloud-based) file storage will undoubted grow, we won't be updating it with every source update. The only major database growth would be the external link tracking, which isn't storing much more than (1) referring card id, (2) referee uri, and (3) current http status. That should be manageable.

Yes that is reasonable. I was talking more on the copy site to wikirate case. In case of external link tracking that should be manageable, even to say desirable.

from decko.

ethn commented on May 30, 2024

I like the bulk update idea. It would be pretty similar to what happens when a card gets renamed, provided we have the link tracking.

I suppose we could also consider updating the link in the case of redirects, but that's not always desirable (eg when a more permanent links redirects to a more temporary one).

from decko.

tukanos commented on May 30, 2024

I like the bulk update idea. It would be pretty similar to what happens when a card gets renamed, provided we have the link tracking.

Yes the logic would be similar.

I still about how to dead with a situation when a "dead" link is found. Would you keep the information and put a [deadlink] tag or would you prefer to delete it completely? Maybe an option to copy it from Internet Archive: Wayback Machine or any other "backup" source would be also nice.

from decko.

[consistency] Checking external or internal links about decko HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs