GithubHelp home page GithubHelp logo

anyfetch / deprecated_box-provider.anyfetch.com Goto Github PK

View Code? Open in Web Editor NEW
0.0 9.0 0.0 368 KB

anyFetch + Box

Home Page: http://box.provider.anyfetch.com

License: MIT License

JavaScript 100.00%

deprecated_box-provider.anyfetch.com's Introduction

Box-provider

While talking with the Box team here at TC Disrupt, we realised search was a challenge for modern applications. Users wants more, and they want it faster.

Unlike its competitors, Box already offers a basic search interface. However, we thought we could do better.

We got to work, hoping to search in the contents of every document ; whether it be a Powerpoint Presentation, a picture from a sheet of paper or a markdown file, we wanted to generate a preview o the file and a meaningful snippet.

Since Box offers a fully featured API, we started coding in node.js.

Nodejs is our tool of choice to write asynchronous code, allowing us to fetch the content of up to 15 folders at the same time while downloading files and uploading them to our indexing server.

Installation

Git clone, then npm install.

To run the code, you'll need to specify a set of tokens. Write this into a keys.sh file and source it before running npm start:

# Go to https://www.box.com/ to register a new app
export BOX_ID="your_box_id"
export BOX_SECRET="your_box_secret"

# Callback after box consent, most probably https://your-host/init/callback
export BOX_CALLBACK_URL="http://localhost:8000/init/callback"
export BOX_CONNECT_URL="http://localhost:8000/init/connect"

# AnyFetch app id and secret
export BOX_ANYFETCH_ID=""
export BOX_ANYFETCH_SECRET=""

export BOX_TEST_REFRESH_TOKEN="waiting for box"

How does it work

We firt get authorization from the Box.com user to use our API. We retrieve metadatas about every file in every folder of a Box user, using a recursive breadth-first-search. Each file is then sequentially downloaded (botlle-necked to 5 concurrent files) and uploaded to our indexer server. This indexer server, available on http://anyfetch.com, begins "hydrating" the document using the open source Tika project to retrieve a textual representation of the file. For images, we use Tesseract, a free OCR reader, to extract content.

Finally, every piece of data is put into ElasticSearch for fast and accurate querying.

About the tests

Test suite is quite poor, but Box.com API forbids using the same refresh_token twice, so you need to request a new set of tokens every time you want to run the tests. The only alternative is to write tests with the access_token, but they'll break after an hour :(

We spoke with the Box staff, thay said they may add a developer refresh_token with infinite lifespan... can't wait.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.