GithubHelp home page GithubHelp logo

worldbrain / legacy-research-engine Goto Github PK

View Code? Open in Web Editor NEW
279.0 279.0 65.0 14.94 MB

WorldBrain's Chrome Extension to full-text search through your browser history & bookmarks.

Home Page: http://www.worldbrain.io

License: GNU General Public License v3.0

CSS 7.83% HTML 7.08% JavaScript 85.09%

legacy-research-engine's People

Contributors

altonius avatar andrewilyas avatar artisin avatar blackforestboi avatar bohrium272 avatar chaitya62 avatar colejohnson66 avatar fuzzmz avatar gastonche avatar girishramnani avatar lengstrom avatar michaelmior avatar mirko911 avatar niieani avatar obsidianart avatar raj-maurya avatar the-fallen avatar tobeorla avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

legacy-research-engine's Issues

Malware Warning stops upload of bookmarks (Chrome)

Google Chrome: 54.0.2840.98 (64-bit)

Include aknewelt or another page that has a malware flag of Chrome into your bookmarks
bildschirmfoto 2016-11-25 um 17 45 20

Start bookmark upload.
A malware warning comes as a popup, or even as a page redirect.

Expected result: Warning comes, it will put this particular page on hold. But won't stop any other bookmarks to upload.
Another expected result: Browser needs to notify user that upload has been stopped.
Actual result: Warnung stops the upload - without the user notifying)

Add progress bar to download page

Currently the download page is static, but could need a progress bar.

I would suggest just doing a count on the pages that have been successful in the form of : 31/3948 and counting up.

Furthermore adding another one for all the ones that have failed.

Retrieve history via Chrome API

Summary

This tasks is a necessary part of the onboarding process, so that we know which pages to download and store.
https://projects.invisionapp.com/share/KZ8XQZ1BR#/screens/198644906
They URLs are not downloaded and stored yet
This module only should get a list of all the urls from the chrome API

Technical implementation

The complete history/bookmarks stored in chrome can be retrieved via: https://developer.chrome.com/extensions/history
https://developer.chrome.com/extensions/bookmarks

The data we need are the url and the lastVisit

Maybe you have to handle some prompts as well, that have to be confirmed by the user.

Increase result showing in addressbar

Good morning developer, first of all I want to thank you for creating this extension I really like it it is much better than falcon because of it's active development...

back to the business Is there is anyway you can increase the result showing in address bar right now it only shows 5 results can you make it atleast 20.


Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Port it to Edge

what do you think? & rightnow there is no way to add it on the store manually you have to talk to the microsoft dev. to add your extension on the store.

Implement Onboarding Process

Summary

As seen in https://projects.invisionapp.com/share/KZ8XQZ1BR#/screens/198964146we need to connect #1 #2 #3 #4 to one workflow.

**Implementation as simple HTML pages, take the current "design" (bootstrap)

Sub-Tasks:

TODO:
"Import" button should request module importHistory line 367 in background.js and switch to next page "analyse_urls"

TODO:

  • Call the key in localStorage.getItem('number_urls') (background.js line 381) and use the resulting value to roughly calculate the time and size of the download. THIS IS STATIC FOR NOW.
    • Size per item: 50kb
    • Time per item: 1.5 seconds
  • Focus on getting this value displayed, no styling or text whatsover. Oliver does that.
    • "Import" Button should then activate downloadHistory() (backgroundJS line: 429) and switch to next page
  • 3. Implement ["Downloading Content" page];(https://projects.invisionapp.com/share/KZ8XQZ1BR#/screens/199481756)

TODO:

  • As soon as download is finished, switch to Finish page. The end is indicated with line 389 in background.js
  • 4. Implement "Finish" Page

TODO:

  • Button "close window" should close tab

Technical Implementation

  • React?
    • Can/should we change the framework that Falcon uses at this point?
    • Would it unnecessarily blow up the size?
    • Be aware that everything runs locally.
    • What do they use now? Bare JS?
      • In the front-end its just pure HTML at the moment.

Sometimes PDF is not read

Sometimes PDF is not parsed.

Found cases: 1

=================
Page: http://www.german-asa.de/wp-content/uploads/2012/05/Schuelerwettbewerb_web1.pdf
Error-Message:

Uncaught TypeError: Cannot read property 'appendChild' of undefined
    at Function.PDFJS.Util.b.loadScript (pdf.min.js:25)
    at b.setupFakeWorker (pdf.min.js:53)
    at b (pdf.min.js:52)
    at Object.PDFJS.getDocument (pdf.min.js:42)
    at FileReader.fileReader.onload (pdftotext.js:24)

=================


Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Migration of old data to new PouchDB

When we update the plugin in production, users shouldn't lose their old data.

This is why we need to implement a migration process that automatically runs as soon as the plugin is updated on the users machine.

This process takes every entry from the SQlite DB and imports it into the PouchDB.

This is a invisible process for the user.


Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Download and store content

Take the list of urls that are retrieved by #16 and put them through a parsing & storing module similar to what is used to store the different history elements in js/download_history.js

In this process it makes sense to single out the part of the code that provides the parsing part. (line 49 -77), so it can be reused by the download_bookmarks.js more easily.

Checklist before starting

Welcome Page not showing after extension installation

After installation the URL chrome-extension://[random]/assets/about.html says:

Your file was not found

It may have been moved or deleted.
ERR_FILE_NOT_FOUND

The favicon(?) is shown, so there must be something. Refresh doesn't help. Extension was installed successfully and is usable, too. Not a big but, then. However, this is reproducible.

Chrome Version 54.0.2840.98 (64-bit) on macOS. Same issue on Version 55.0.2883.75 (64-bit) on macOS (latest).

Urls are parsed and stored, but cannot be retrieved via the search

Problem:

Urls are parsed and stored, but cannot be retrieved via the search

How I tested it and what were the bugs?

Test:

  1. before installing the plugin I visited a website and remembered a word
  2. installing plugin (not visiting site again)
  3. going to background console
  4. typing importHistory() (worked with message “undefined")
  5. typing downloadHistory() worked and gave me response that my last visited article was actually downloaded first.

Stored History item: Record hot 2015 gave us a glimpse at the future of global warming | Dana Nuccitelli | Environment | The Guardian

but when I now went and tried w+ tab+ aforementioned (the word i remembered) it cant find it.

Source of problem:

  • The body text is not properly extracted in module between line 397-409 in background.js. and it therefore cannot find the right keywords when searching for it.
    I compared the data.text output from data.msg ===pageContent and saveHistory

They are completely different, it seems like the HTML file that is delivered by chrome's document object and the XMLHttpRequest are producing different data sets that are handled differently by the module processPageText

These are the two output files for data.text

Coming from savehistory.txt

coming from chrome document.txt

Retrieve bookmarks via Chrome API

Summary

This tasks is a necessary part of the onboarding process, so that we know which pages to download and store.
https://projects.invisionapp.com/share/KZ8XQZ1BR#/screens/198644906
This module only should get a list of all the bookmark-urls from the chrome API

Technical implementation

The complete bookmarks list stored in chrome can be retrieved via:
https://developer.chrome.com/extensions/bookmarks

Look into import_history.js for inspiration (how we have done it for the history)

Checklist before starting

Download and store content data

Currently done by using the built in storage function of Falcon.

Summary

We need to download each URL's data to our local storage indexed DB where downloaded=false and need to capture the following data:

  • domain_name (already there from #1)
  • url (already there from #1)
  • lastVisitTime (already there from #1)
  • text,
  • keywords(list),
  • authors(list),
  • publish_date,
  • tags(list),
  • summary,
  • links(list)

If successful, change downloaded=true

Technical implementation

We could use Newspaper3K again or readability(as it is used now).

Questions to clear up

Before working on this task, the following has to be researched and cleared up. Please answer in the comments.

  1. Can we store additional data to the current storage? This way we dont have to change anything on the search functions and storage of the tool but just can add data. See #4 for more information about the current data storage.
  2. Which library is the most stable, precise and provides us with the most data? tip, look for benchmarks.
  3. In which format can we store data to indexedDB, what are the constraints of storage.

Notification system

We need a way to notify users with messages.

To display them, I see 3 options:

  1. in the results list as the first result, if there are currently notification.
  2. as a change in the icon in the top right corner and then in the popup
  3. as a popup that automatically appears
  4. in the dashboard as a separate page (there we also list all past notifications)

The plugin has to fetch them from our servers.


Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Discussion: How to approach server architecture

The plan for a user's autonomy / Server infrastructure

Hey folks,

thanks for your input on this.

What is the goal?

Our promise is to keep the user in full control over his data and to allow effective/asynchronous sharing of content-recommendations, content-associations and metadata.
Therefore there is a need to host the data somewhere in the cloud, but in the control of users.
Means that as soon as a server is needed, the idea is to make it as easy as possible for users to set up their own server with our firmware (i.e. a docker container), that handles all the data storage/processing.

Secondary effect.

Making this form of decentralisation the default architecture could contribute to a more decentralised internet infrastructure, since we also reach many non-technical internet user.
Because the data of a user is always available via their servers this could also build the foundation for other decentralised projects to reach broader use.
(i.e. P2P social networks or decentralised web search engines like Yacy/Sersia)

We hope that it would lead to a shift of ecosystems that form around users, not around platforms.
This current, centralised and platform focussed, circumstance leads to unhealthy amassing of power on the web.

The architecture chart as seen below are how I imagine the system.

Stage 1:

In the first stage, its just a client side software, the browser extension.
There is no communication with the outside world needed yet.
Currently the used DB is PouchDB.

Stage 2:

Providing a server that handles all the the logic of syncing with the attached services and processing of data. (like building the search indexes or analysing for related content)
It also has built in the first version of the communication API (called "Ragnorok-Module" as an homage to Daniel Suarez' Daemon & Freedom Books ;) )
In this stage this API is there to communicate with the different clients a user uses as well as provide a web-based interface the user can access from anywhere on the web.
Here we possibly have to sync an index to the local machine in order to provide off-line support.

Stage 3:

As soon as the system is working for the users themselves we update API to be able to talk to other APIs in the network and exchange information, like content recommendation or provide searchable indexes of the pages other users visited.
In this stage people can start following each other and therefore build circles of trust.

I have a couple of questions:

  • How seemless can the process be made for the user to setup the servers? (Important for non-technical users, as most of ours will be)
  • What kind of problems do you see with this architecture?
  • How can we make the code that runs on the server replicatable and agnostic from the server choice?
  • What storage solution do you know that is capable of running/syncing in an extension as well as on the server? Maybe also including a built in permission system to handle access. Afaik Pouch/CouchDB don't have that. remotestorage.js has for example.
  • If we use a system like the IPFS/IPDB, can we also host and run code there?
  • As far as I know, searching in encrypted datasets is not yet mature, so the question is, it it possible to add an encrypted accesslayer that would effectively sandbox data and its processing, making it unavailable to outside people without the right credentials?

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Update Onboarding Process

To reflect the new feature of importing your bookmarks, we have to update the onboarding process.

Subtasks:

  • Update Mockups
  • Update Process
    • Update Info-Alert

Setting up a server, where we can host documents to be displayed in the extension

There are some pages currently in the extension, that will be updated fairly often, like the FAQ or the contributions page.
If we now want to update them, we have to update the tool and also the version number.
This is rather unsuitable circumstance.

The idea is that we put those documents all on a server and then implement them into an iFrame inside the extension. And in case there is no internet connection, the page is just cached.


Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Estimate Size and Download time

Part of #6

Summary

Based on the amount of urls to download, we need to estimate the size and download time.
This is important, since we develop the tool for heavy web researchers, that could have collected many links, therefore the size on the hard drive could be big or it would take substantial amount of time to crawl all the pages.

Technical Implementation:

I would say we estimate a size of 50kb per page that needs to be downloaded and 2 seconds for the parsing. Download will run multithreaded. 3 downloads simultaneously.

Twitter profile is not indexed

When visiting a twitter profile, it is not indexed.

The assumption is that in content.js there is no callback that waits for the page to be parsed completely via document.body.innerText and the page gets saved before that.

Why this assumption?
When doing document.body.innerText manually, then it takes about 5-8 seconds until a result is given

Weirdly it only happens with this profile: https://twitter.com/Protohedgehog

Not with others.


Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Implement PouchDB as local storage system

Summary

Current Architecture:
https://github.com/WorldBrain/Research-Engine/wiki

V1 of this task:

Replace current storages with PouchDB and current search with PouchDB Quick Search.

For later:

Status Quo
We currently save the data in SQLite3 format and saves a new file for each domain.
I don't know if that is the best approach for us, but in general we have to store the data in a way, that allows us later to locally run more analysis on it, like ML or creating better filters.
Here is an example file:
https_www.theguardian.com_0.localstorage.txt
The data points that are stored for each URL are the title, text, url and the lastVisitTime as the key.

Product Goal:

We plan a web based results and filter page (domain, is_bookmark) and need a queryable database for the browser local storage for that.
Features:

  • Filter methods (time, domain, entities, metadata)
  • Connected content (see how you move through the web)
  • ML Analysis

Requirements data structure

Prio 1:

  1. Implementable in a chrome extension
  2. Extensibility of data model possible so we can add/change fields in later updates, without losing data.
  3. Ability to locally query and analyse data in bulk so we can create better filters and run algorithms on it
  4. Storage cross browser compatible, so we dont have to refactor major parts when porting to Firefox et. al.
  5. Possibility to add search layer on top
  6. implementation for linked data possible (RDF?)

Questions to clear up

Please answer in the comments.
PRIO 1:

  • Can we query the current DB and how?
  • Can we easily extend the fields the current DB?

PRIO 2:

  • Research possibilities and summaries options (Descriptions, upsides, downsides)
  • What are the constraints of putting everything to local storage
  • Which of the requirements can't we have with putting data to local storage?
  • Is there an existing framework that already allows us to do this? Save development time.

Ressources for Exploration:


Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Deploy Travis and test 1 function

We need to test our extension - no brainer :)

Scope of task

We need to deploy CI into our tool and make it run for 1 function. This should be the test if everything works. If this task is done, we can start testing all the other functions in our tool.

Subtasks:

  • Deploy travis in "Research-Engine" repo
  • Write test for 1 function

Important:

Chrome extension have limitations on how to use javascript, for example you cannot use require() functions.
It is advised to look for tutorials/guides the Travis + Web Extension combination

Ressources:


Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Web Interface for search results and filter options

Currently users can only retrieve 5 results at a time via the drop down of the address bar and have limited filter options. Right now I can only filter for the time and this also only by typing i.e. after:"yesterday".

We need a relatively simple web interface that can be run locally and lets users filter for various metrics:

  • Keywords (via search bar)
  • Timeframe
  • Author
  • Source

Mockup:
mockup_results


Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.