Web browsers let us access most of human knowledge, but they do it in a way it was prescribed by a site hosting the resource and do very little in terms of capturing relevant information into local knowledge base. This experiment explores ways to view web through different lens, e.g. viewing web page as an image catalog or readable article, local anotations and possibly more... in an assumbtion that captured web artifacts can seed ideas and help us identify connections.
At the moment this is a reasearch experiment that is unlikely to be useful, however if you feel inlclined to try you can use Artifacts bookmarlet and run it on arbitrary page to see what happens. Bookmarklet does not work on sites that use conten security policies to block third party scripts and in the future we plan to have a web-extension to overcome this limitation.
When bookmarklet is activated it loads a bookmarklet host, script that
injects an iframe into a document loading a bookmarklet client which
communicates with a host via MessagePort
API.
In the future we plan to load host as browser extension content script in order to overcome content security restrictions. Other than that design will remain equivalent.
Client once loaded will issue request to a host to scrape document metadata, archive page, etc.... Which host fulfills and transfers response back to the client which then renders it in the UI.
On clients request host will scrape metadata from the document for the "preview card" (that is similar to twitter, slack, apple messages, etc...)
Scraper attempts to extract following information from the document:
- URL
- Hero images
- Title
- Summary
- Site name
To accomplish this it looks for the following information to thevarios extent.
- Open Graph metada.
- Twitter Card metadata.
- Apple Web Application metadata.
- Microsoft Tile metadata.
- structured data used by Google search.
- microformats.
If none of the above is found, it falls back to trying it's best at guessing it via primitive algorithm inspired by Mozilla Readability library.
On client request host will archive a page via (custom fork of) an excellent freeze-dry library as web bundle file containing all the linked resources and transfer it back to a client in form of ArrayBuffer
of it's content.
There is no shortage of file formats for representing web bundles:
However none is part of web standard or widely supported by mainstream browsers, there for figuring out right format for the task is part of this research.
Received web bundle then gets loaded into a special web bundle viewer.
Given that no browser support viewing web bundles natively (except for Safari) for this reasearch we create a custom viewer using service worker registered at /webarchive/
and sandboxed iframe.
This allows us to e.g. access archived web bundle via URL like:
https://gozala.io/artifacts/webarchive/blob/dc265246-d4ca-f644-91a5-d4b33c4512fd
service worker will take care of decoding corresponding web bundle file and serving all the linked resources per request.