worldbrain / legacy-research-engine Goto Github PK

View Code? Open in Web Editor NEW

279.0 279.0 65.0 14.94 MB

WorldBrain's Chrome Extension to full-text search through your browser history & bookmarks.

Home Page: http://www.worldbrain.io

License: GNU General Public License v3.0

CSS 7.83% HTML 7.08% JavaScript 85.09%

legacy-research-engine's People

Contributors

Stargazers

Watchers

legacy-research-engine's Issues

Include a "Blacklist current site" button in the extension's Button Menu

This will provide an easy and convenient way of blacklisting the site that the user is currently on.

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Not initiating by pressing w on Vivaldi

As the title says WorldBrain not initiating by pressing w + space / tab on Vivaldi browser
extension installed correctly every feature in the extension is working correctly except W to enter in research mode.

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Google Docs don't get indexed

Google Docs don't get indexed when visiting.

Not the title, not the url, not the content.

Malware Warning stops upload of bookmarks (Chrome)

Google Chrome: 54.0.2840.98 (64-bit)

Include aknewelt or another page that has a malware flag of Chrome into your bookmarks

Start bookmark upload.
A malware warning comes as a popup, or even as a page redirect.

Expected result: Warning comes, it will put this particular page on hold. But won't stop any other bookmarks to upload.
Another expected result: Browser needs to notify user that upload has been stopped.
Actual result: Warnung stops the upload - without the user notifying)

Broken intro video link in README.md

Implement a feature to highlight the word on the searched page

As the title says implement a feature to highlight the word on the searched page like when a user press ctrl+f and then write a word on it, it automatically highlight all the matching words...

thanx for reading

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Add progress bar to download page

Currently the download page is static, but could need a progress bar.

I would suggest just doing a count on the pages that have been successful in the form of : 31/3948 and counting up.

Furthermore adding another one for all the ones that have failed.

Retrieve history via Chrome API

Summary

This tasks is a necessary part of the onboarding process, so that we know which pages to download and store.
https://projects.invisionapp.com/share/KZ8XQZ1BR#/screens/198644906
They URLs are not downloaded and stored yet
This module only should get a list of all the urls from the chrome API

Technical implementation

The complete history/bookmarks stored in chrome can be retrieved via: https://developer.chrome.com/extensions/history
https://developer.chrome.com/extensions/bookmarks

The data we need are the url and the lastVisit

Maybe you have to handle some prompts as well, that have to be confirmed by the user.

Implement Designs in Front-End

Implmentation of the Designs as HTML5/JS interface that runs completely locally.

Also important to connect the fields with the query: #26

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Connect Front-End to DB

The web based client needs to be connected to the database and search engine.

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Increase result showing in addressbar

Good morning developer, first of all I want to thank you for creating this extension I really like it it is much better than falcon because of it's active development...

back to the business Is there is anyway you can increase the result showing in address bar right now it only shows 5 results can you make it atleast 20.

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Port it to Edge

what do you think? & rightnow there is no way to add it on the store manually you have to talk to the microsoft dev. to add your extension on the store.

Save progress, if people (accidentally) leave download page.

When people close the download page withouth actively pausing the download, it causes a mess in the background.

Should be somehow prevented.

Ideas:

Bring popup when people leave page, where they can choose to store
if page closes, store the progress and restart plugin.

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Change Falcon Request to new data format in local storage

Summary

If we change the current data model that falcon uses as described in #4 we need to change the request accordingly.

Technical Implementation

Has to be investigated.
Please state your approach before working on this task in the comments

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Implement Onboarding Process

Summary

As seen in https://projects.invisionapp.com/share/KZ8XQZ1BR#/screens/198964146we need to connect #1 #2 #3 #4 to one workflow.

**Implementation as simple HTML pages, take the current "design" (bootstrap)

Sub-Tasks:

1. Adding section "Import History" to preferences.html.

TODO:
"Import" button should request module importHistory line 367 in background.js and switch to next page "analyse_urls"

2. Implementing page: "Step 1: Analysing Size"

TODO:

Call the key in localStorage.getItem('number_urls') (background.js line 381) and use the resulting value to roughly calculate the time and size of the download. THIS IS STATIC FOR NOW.
- Size per item: 50kb
- Time per item: 1.5 seconds
Focus on getting this value displayed, no styling or text whatsover. Oliver does that.
- "Import" Button should then activate downloadHistory() (backgroundJS line: 429) and switch to next page
3. Implement ["Downloading Content" page];(https://projects.invisionapp.com/share/KZ8XQZ1BR#/screens/199481756)

TODO:

As soon as download is finished, switch to Finish page. The end is indicated with line 389 in background.js
4. Implement "Finish" Page

TODO:

Button "close window" should close tab

Technical Implementation

React?
- Can/should we change the framework that Falcon uses at this point?
- Would it unnecessarily blow up the size?
- Be aware that everything runs locally.
- What do they use now? Bare JS?
  - In the front-end its just pure HTML at the moment.

Implement Testing Framework & Test tool

We need to have a testing framework in place and test the application thoroughly.

Since it is a web extension its a little bit more tricky, but as I read through the abilities provides by Browserify, it could work with that to include NPM testing libraries.

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Display search results along side google results

If people search without typing in 'w' + 'space'/'tab' in address bar, they will automatically search in google.
On the google page, we then should display the results on top of the regular ones

##Inspiration:
Pinboard: https://github.com/nhoizey/PinboardInGoogle

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Also make PDFs searchable

we can probably use the textract parser for pdfs:
https://github.com/dbashford/textract/blob/master/lib/extractors/pdf.js

Sometimes PDF is not read

Sometimes PDF is not parsed.

Found cases: 1

=================
Page: http://www.german-asa.de/wp-content/uploads/2012/05/Schuelerwettbewerb_web1.pdf
Error-Message:

Uncaught TypeError: Cannot read property 'appendChild' of undefined
    at Function.PDFJS.Util.b.loadScript (pdf.min.js:25)
    at b.setupFakeWorker (pdf.min.js:53)
    at b (pdf.min.js:52)
    at Object.PDFJS.getDocument (pdf.min.js:42)
    at FileReader.fileReader.onload (pdftotext.js:24)

=================

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Store all existing URLs to local storage

Right now there are only the first 100 urls stored, we need to store all of them.

Migration of old data to new PouchDB

When we update the plugin in production, users shouldn't lose their old data.

This is why we need to implement a migration process that automatically runs as soon as the plugin is updated on the users machine.

This process takes every entry from the SQlite DB and imports it into the PouchDB.

This is a invisible process for the user.

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

If item is deleted from preferences, it can not be re-imported

this is because we have a list of all the already imported urls in the local storage.

This will probably be gone as soon as we have the new DB implemented and can query the db better.

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Skip pages with basic http auth during importing

When you encounter sites requiring basic http authentication during the import the whole process stops and waits. So maybe just skip these pages altogether.

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Download and store content

Take the list of urls that are retrieved by #16 and put them through a parsing & storing module similar to what is used to store the different history elements in js/download_history.js

In this process it makes sense to single out the part of the code that provides the parsing part. (line 49 -77), so it can be reused by the download_bookmarks.js more easily.

Checklist before starting

Make list of data points we need to store in the DB for the bookmarks
Make README about the module and put it in READMEs folder. How do I make a good readme?

Welcome Page not showing after extension installation

After installation the URL chrome-extension://[random]/assets/about.html says:

Your file was not found

It may have been moved or deleted.
ERR_FILE_NOT_FOUND

The favicon(?) is shown, so there must be something. Refresh doesn't help. Extension was installed successfully and is usable, too. Not a big but, then. However, this is reproducible.

Chrome Version 54.0.2840.98 (64-bit) on macOS. Same issue on Version 55.0.2883.75 (64-bit) on macOS (latest).

Make (semi) mobile support for chrome possible.

When users visit pages on a mobile chrome browser and are syncing their history with google, we can download them as well.

Idea is just to have the history scanned for new entries every fixed interval and download them.

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Find solution for code and non-code documentation

We need to agree on a way to document our code and non-code from now on.
Best would be an all in one solution.

Options:

GitBook?

Checklist before starting

How would we implement it?
Anything other option?

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Urls are parsed and stored, but cannot be retrieved via the search

Problem:

Urls are parsed and stored, but cannot be retrieved via the search

How I tested it and what were the bugs?

Test:

before installing the plugin I visited a website and remembered a word
installing plugin (not visiting site again)
going to background console
typing importHistory() (worked with message “undefined")
typing downloadHistory() worked and gave me response that my last visited article was actually downloaded first.

Stored History item: Record hot 2015 gave us a glimpse at the future of global warming | Dana Nuccitelli | Environment | The Guardian

but when I now went and tried w+ tab+ aforementioned (the word i remembered) it cant find it.

Source of problem:

The body text is not properly extracted in module between line 397-409 in background.js. and it therefore cannot find the right keywords when searching for it.
I compared the data.text output from data.msg ===pageContent and saveHistory

They are completely different, it seems like the HTML file that is delivered by chrome's document object and the XMLHttpRequest are producing different data sets that are handled differently by the module processPageText

These are the two output files for data.text

Coming from savehistory.txt

coming from chrome document.txt

Add FAQ

Indexing doesn't support AJAX requests

As this reddit comment suggests, so any modern single-page app or pagination won't be indexed.

This should change...

@obsidianart how can we do that?

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Retrieve bookmarks via Chrome API

Summary

This tasks is a necessary part of the onboarding process, so that we know which pages to download and store.
https://projects.invisionapp.com/share/KZ8XQZ1BR#/screens/198644906
This module only should get a list of all the bookmark-urls from the chrome API

Technical implementation

The complete bookmarks list stored in chrome can be retrieved via:
https://developer.chrome.com/extensions/bookmarks

Look into import_history.js for inspiration (how we have done it for the history)

Checklist before starting

Look through the bookmarks API and give suggestions on which data points we should also extract and store
Make README about the module and put it in [READMEs folder].(https://github.com/WorldBrain/Research-Engine/tree/master/READMEs) And how do I make a good readme?

Download and store content data

Currently done by using the built in storage function of Falcon.

Summary

We need to download each URL's data to our local storage indexed DB where downloaded=false and need to capture the following data:

domain_name (already there from #1)
url (already there from #1)
lastVisitTime (already there from #1)
text,
keywords(list),
authors(list),
publish_date,
tags(list),
summary,
links(list)

If successful, change downloaded=true

Technical implementation

We could use Newspaper3K again or readability(as it is used now).

Questions to clear up

Before working on this task, the following has to be researched and cleared up. Please answer in the comments.

Can we store additional data to the current storage? This way we dont have to change anything on the search functions and storage of the tool but just can add data. See #4 for more information about the current data storage.
Which library is the most stable, precise and provides us with the most data? tip, look for benchmarks.
In which format can we store data to indexedDB, what are the constraints of storage.

Restructuring of code and files

Place to add notes and discuss the clean up of the code and file structure.

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Notification system

We need a way to notify users with messages.

To display them, I see 3 options:

in the results list as the first result, if there are currently notification.
as a change in the icon in the top right corner and then in the popup
as a popup that automatically appears
in the dashboard as a separate page (there we also list all past notifications)

The plugin has to fetch them from our servers.

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Discussion: How to approach server architecture

The plan for a user's autonomy / Server infrastructure

Hey folks,

thanks for your input on this.

What is the goal?

Our promise is to keep the user in full control over his data and to allow effective/asynchronous sharing of content-recommendations, content-associations and metadata.
Therefore there is a need to host the data somewhere in the cloud, but in the control of users.
Means that as soon as a server is needed, the idea is to make it as easy as possible for users to set up their own server with our firmware (i.e. a docker container), that handles all the data storage/processing.

Secondary effect.

Making this form of decentralisation the default architecture could contribute to a more decentralised internet infrastructure, since we also reach many non-technical internet user.
Because the data of a user is always available via their servers this could also build the foundation for other decentralised projects to reach broader use.
(i.e. P2P social networks or decentralised web search engines like Yacy/Sersia)

We hope that it would lead to a shift of ecosystems that form around users, not around platforms.
This current, centralised and platform focussed, circumstance leads to unhealthy amassing of power on the web.

The architecture chart as seen below are how I imagine the system.

Stage 1:

In the first stage, its just a client side software, the browser extension.
There is no communication with the outside world needed yet.
Currently the used DB is PouchDB.

Stage 2:

Providing a server that handles all the the logic of syncing with the attached services and processing of data. (like building the search indexes or analysing for related content)
It also has built in the first version of the communication API (called "Ragnorok-Module" as an homage to Daniel Suarez' Daemon & Freedom Books ;) )
In this stage this API is there to communicate with the different clients a user uses as well as provide a web-based interface the user can access from anywhere on the web.
Here we possibly have to sync an index to the local machine in order to provide off-line support.

Stage 3:

As soon as the system is working for the users themselves we update API to be able to talk to other APIs in the network and exchange information, like content recommendation or provide searchable indexes of the pages other users visited.
In this stage people can start following each other and therefore build circles of trust.

I have a couple of questions:

How seemless can the process be made for the user to setup the servers? (Important for non-technical users, as most of ours will be)
What kind of problems do you see with this architecture?
How can we make the code that runs on the server replicatable and agnostic from the server choice?
What storage solution do you know that is capable of running/syncing in an extension as well as on the server? Maybe also including a built in permission system to handle access. Afaik Pouch/CouchDB don't have that. remotestorage.js has for example.
If we use a system like the IPFS/IPDB, can we also host and run code there?
As far as I know, searching in encrypted datasets is not yet mature, so the question is, it it possible to add an encrypted accesslayer that would effectively sandbox data and its processing, making it unavailable to outside people without the right credentials?

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Sync data across multiple devices + Manual backup (import/export)

Hey just thought about it is sync is possible with size of data like Research engine has?
I am not a developer but even I can tell that the size is really big to sync if sync isn't possible at all then what about import/export?

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Update Onboarding Process

To reflect the new feature of importing your bookmarks, we have to update the onboarding process.

Subtasks:

Update Mockups
Update Process
- Update Info-Alert

Setting up a server, where we can host documents to be displayed in the extension

There are some pages currently in the extension, that will be updated fairly often, like the FAQ or the contributions page.
If we now want to update them, we have to update the tool and also the version number.
This is rather unsuitable circumstance.

The idea is that we put those documents all on a server and then implement them into an iFrame inside the extension. And in case there is no internet connection, the page is just cached.

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Decide on data points that are stored locally for each URL.

Here we need gather a list of all the data points for a URL we want to store and how we plan to get them.

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Update database when bookmarks change

It can happen that users remove or add new bookmarks and therefore we need to update this in the database.

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Estimate Size and Download time

Part of #6

Summary

Based on the amount of urls to download, we need to estimate the size and download time.
This is important, since we develop the tool for heavy web researchers, that could have collected many links, therefore the size on the hard drive could be big or it would take substantial amount of time to crawl all the pages.

Technical Implementation:

I would say we estimate a size of 50kb per page that needs to be downloaded and 2 seconds for the parsing. Download will run multithreaded. 3 downloads simultaneously.

Twitter profile is not indexed

When visiting a twitter profile, it is not indexed.

The assumption is that in content.js there is no callback that waits for the page to be parsed completely via document.body.innerText and the page gets saved before that.

Why this assumption?
When doing document.body.innerText manually, then it takes about 5-8 seconds until a result is given

Weirdly it only happens with this profile: https://twitter.com/Protohedgehog

Not with others.

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Add Contribute to development page

There should be a tab in the admin interface that links to the contributor page here on GitHub

Importing PDF's?

there is no option in setting for importing PDF's?

Make diagram and write explanation for current architecture

Make a chart of how the current tool processes data and explain the processes.

Should be done by @arpitgogia and @oliversauter because they built the first version and saw how the code works.

Make diagram with anticipated processes and modules

Here we should figure out a first version of the updated architecture, split the research work to multiple people and come together again to discuss the chances and hurdles.

Here is the place to gather ideas on how to improve it, so we then can build the architecture diagram.

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Implement PouchDB as local storage system

Summary

Current Architecture:
https://github.com/WorldBrain/Research-Engine/wiki

V1 of this task:

Replace current storages with PouchDB and current search with PouchDB Quick Search.

For later:

Status Quo
We currently save the data in SQLite3 format and saves a new file for each domain.
I don't know if that is the best approach for us, but in general we have to store the data in a way, that allows us later to locally run more analysis on it, like ML or creating better filters.
Here is an example file:
https_www.theguardian.com_0.localstorage.txt
The data points that are stored for each URL are the title, text, url and the lastVisitTime as the key.

Product Goal:

We plan a web based results and filter page (domain, is_bookmark) and need a queryable database for the browser local storage for that.
Features:

Filter methods (time, domain, entities, metadata)
Connected content (see how you move through the web)
ML Analysis

Requirements data structure

Prio 1:

Implementable in a chrome extension
Extensibility of data model possible so we can add/change fields in later updates, without losing data.
Ability to locally query and analyse data in bulk so we can create better filters and run algorithms on it
Storage cross browser compatible, so we dont have to refactor major parts when porting to Firefox et. al.
Possibility to add search layer on top
implementation for linked data possible (RDF?)

Questions to clear up

Please answer in the comments.
PRIO 1:

Can we query the current DB and how?
Can we easily extend the fields the current DB?

PRIO 2:

Research possibilities and summaries options (Descriptions, upsides, downsides)
What are the constraints of putting everything to local storage
Which of the requirements can't we have with putting data to local storage?
Is there an existing framework that already allows us to do this? Save development time.

Ressources for Exploration:

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Deploy Travis and test 1 function

We need to test our extension - no brainer :)

Scope of task

We need to deploy CI into our tool and make it run for 1 function. This should be the test if everything works. If this task is done, we can start testing all the other functions in our tool.

Subtasks:

Deploy travis in "Research-Engine" repo
Write test for 1 function

Important:

Chrome extension have limitations on how to use javascript, for example you cannot use require() functions.
It is advised to look for tutorials/guides the Travis + Web Extension combination

Ressources:

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Web Interface for search results and filter options

Currently users can only retrieve 5 results at a time via the drop down of the address bar and have limited filter options. Right now I can only filter for the time and this also only by typing i.e. after:"yesterday".

We need a relatively simple web interface that can be run locally and lets users filter for various metrics:

Keywords (via search bar)
Timeframe
Author
Source

Mockup:

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Implement Sharing Buttons

We should implement buttons for sharing in the top right corner.
One fat button where "spread the word" is written on and that has a drop down with all the social networks. Found this library:
https://github.com/carrot/share-button

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

worldbrain / legacy-research-engine Goto Github PK

legacy-research-engine's People

Contributors

Stargazers

Watchers

Forkers

legacy-research-engine's Issues

Summary

Technical implementation

Ideas:

Summary

Technical Implementation

Summary

Technical Implementation

Found cases: 1

Checklist before starting

Options:

Checklist before starting

Problem:

Source of problem:

Summary

Technical implementation

Checklist before starting

Summary

Technical implementation

Questions to clear up

The plan for a user's autonomy / Server infrastructure

What is the goal?

Secondary effect.

The architecture chart as seen below are how I imagine the system.

Stage 1:

Stage 2:

Stage 3:

I have a couple of questions:

Subtasks:

Summary

Technical Implementation:

Summary

V1 of this task:

For later:

Product Goal:

Requirements data structure

Questions to clear up

Ressources for Exploration:

Scope of task

Subtasks:

Important:

Ressources:

Recommend Projects

Recommend Topics

Recommend Org

Jobs