GithubHelp home page GithubHelp logo

karpathy / arxiv-sanity-lite Goto Github PK

View Code? Open in Web Editor NEW
1.1K 22.0 123.0 1015 KB

arxiv-sanity lite: tag arxiv papers of interest get recommendations of similar papers in a nice UI using SVMs over tfidf feature vectors based on paper abstracts.

Home Page: https://arxiv-sanity-lite.com

License: MIT License

Python 68.00% JavaScript 7.40% CSS 8.09% HTML 16.18% Makefile 0.32%
arxiv deep-learning machine-learning flask

arxiv-sanity-lite's Introduction

arxiv-sanity-lite

A much lighter-weight arxiv-sanity from-scratch re-write. Periodically polls arxiv API for new papers. Then allows users to tag papers of interest, and recommends new papers for each tag based on SVMs over tfidf features of paper abstracts. Allows one to search, rank, sort, slice and dice these results in a pretty web UI. Lastly, arxiv-sanity-lite can send you daily emails with recommendations of new papers based on your tags. Curate your tags, track recent papers in your area, and don't miss out!

I am running a live version of this code on arxiv-sanity-lite.com.

Screenshot

To run

To run this locally I usually run the following script to update the database with any new papers. I typically schedule this via a periodic cron job:

#!/bin/bash

python3 arxiv_daemon.py --num 2000

if [ $? -eq 0 ]; then
    echo "New papers detected! Running compute.py"
    python3 compute.py
else
    echo "No new papers were added, skipping feature computation"
fi

You can see that updating the database is a matter of first downloading the new papers via the arxiv api using arxiv_daemon.py, and then running compute.py to compute the tfidf features of the papers. Finally to serve the flask server locally we'd run something like:

export FLASK_APP=serve.py; flask run

All of the database will be stored inside the data directory. Finally, if you'd like to run your own instance on the interwebs I recommend simply running the above on a Linode, e.g. I am running this code currently on the smallest "Nanode 1 GB" instance indexing about 30K papers, which costs $5/month.

(Optional) Finally, if you'd like to send periodic emails to users about new papers, see the send_emails.py script. You'll also have to pip install sendgrid. I run this script in a daily cron job.

Requirements

Install via requirements:

pip install -r requirements.txt

Todos

  • Make website mobile friendly with media queries in css etc
  • The metas table should not be a sqlitedict but a proper sqlite table, for efficiency
  • Build a reverse index to support faster search, right now we iterate through the entire database

License

MIT

arxiv-sanity-lite's People

Contributors

atdino avatar karpathy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

arxiv-sanity-lite's Issues

MIssing paper(s)?

I noticed that out paper is missing from arxiv-sanity. It was stuck in moderation for a while so maybe couldn't be indexed properly? I assume there might be other papers affected by the same issue.

Link to missing paper

Connection reset by peer

With running the arxiv_daemon, I mostly am getting the response from arxiv "Connection reset by peer", which loops and loops for 1000 times before I get the message

"ok we tried 1,000 times, something is srsly wrong. exiting."

Is there a reason you set it to 1000? Why is this looped in the first place, is arxiv supposed to be finicky about this? Regardless, I feel like hammering arxiv so much is probably not preferred. Perhaps set it to a lower value?

Strange thing is, it doesn't always happen. Sometimes, I do get a connection immediately and a proper response from arxiv. That never happens after a few loops of "Connection reset". Then, a minute later if I try it would loop for the full 1000 times again. Is this an issue on arxiv side (like I'm on a blocklist of one of their load-balancing servers), or is this an arxiv-sanity-lite issue? Any ideas?

BioArxiv integration

Hi, great site :) would there be capacity to integrate bioarxiv articles in the future. I am aware of forks which have done this but they seem to be offline.

papers.labml.ai

Hi @karpathy,

We built papers.labml.ai in May (introductory tweet) to discover research papers based on popularity on Twitter. We were using arxiv-sanity to discover papers and I started this as a side project inspired by it (partly because it was down from time to time).

We worked on it on and off since May and have added a bunch of features, such as:

  • Popular papers based on Tweets
  • Link source codes, annotated implementations, videos, Reddit and Hackernews discussions, and other resources related to the paper
  • Conferences (iclr 2022, neurips 2021)
  • Short two-line summaries of the papers to quickly browse through lists of papers
  • Similar papers based on language model embeddings

And we are working on something very similar to tags on sanity-lite (which we call lists).

We love to hear your feedback and suggestions. Thanks for releasing your work.

Screenshot 2021-11-14 at 10 24 45

Screenshot 2021-11-14 at 10 25 36

Screenshot 2021-11-14 at 10 27 01

We are building Skim - inspired by arXiv Sanity with improvements :)

Hi @karpathy, thank you for introducing arXiv Sanity Lite!

Few of my peers and I are developing Skim https://skimhq.tech - Spotify for ML World - inspired by arXiv Sanity.

Currently it supports:

  • Creating a list of papers as "Rack" (similar to Spotify playlist)
  • See similar papers based on TF-IDF based features
  • See popular conferences and their racks (arXiv papers bundled into racks based on their yearly proceedings) and conference statistics as well - giving you complete information about a conference :)
  • Search across all papers, racks, conferences and user base

We would like to discuss more and share an invite to you - so that we can collaborate on this and improve over time.
Please let us know - [email protected]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.