karpathy / arxiv-sanity-lite Goto Github PK

arxiv-sanity lite: tag arxiv papers of interest get recommendations of similar papers in a nice UI using SVMs over tfidf feature vectors based on paper abstracts.

Home Page: https://arxiv-sanity-lite.com

License: MIT License

Python 68.00% JavaScript 7.40% CSS 8.09% HTML 16.18% Makefile 0.32%

arxiv deep-learning machine-learning flask

arxiv-sanity-lite's Introduction

arxiv-sanity-lite

A much lighter-weight arxiv-sanity from-scratch re-write. Periodically polls arxiv API for new papers. Then allows users to tag papers of interest, and recommends new papers for each tag based on SVMs over tfidf features of paper abstracts. Allows one to search, rank, sort, slice and dice these results in a pretty web UI. Lastly, arxiv-sanity-lite can send you daily emails with recommendations of new papers based on your tags. Curate your tags, track recent papers in your area, and don't miss out!

I am running a live version of this code on arxiv-sanity-lite.com.

To run

To run this locally I usually run the following script to update the database with any new papers. I typically schedule this via a periodic cron job:

#!/bin/bash

python3 arxiv_daemon.py --num 2000

if [ $? -eq 0 ]; then
    echo "New papers detected! Running compute.py"
    python3 compute.py
else
    echo "No new papers were added, skipping feature computation"
fi

You can see that updating the database is a matter of first downloading the new papers via the arxiv api using arxiv_daemon.py, and then running compute.py to compute the tfidf features of the papers. Finally to serve the flask server locally we'd run something like:

export FLASK_APP=serve.py; flask run

All of the database will be stored inside the data directory. Finally, if you'd like to run your own instance on the interwebs I recommend simply running the above on a Linode, e.g. I am running this code currently on the smallest "Nanode 1 GB" instance indexing about 30K papers, which costs $5/month.

(Optional) Finally, if you'd like to send periodic emails to users about new papers, see the send_emails.py script. You'll also have to pip install sendgrid. I run this script in a daily cron job.

Requirements

Install via requirements:

pip install -r requirements.txt

Todos

Make website mobile friendly with media queries in css etc
The metas table should not be a sqlitedict but a proper sqlite table, for efficiency
Build a reverse index to support faster search, right now we iterate through the entire database

License

MIT

arxiv-sanity-lite's People

Contributors

Stargazers

Watchers

Forkers

sbusso techthiyanes zivzone naxrevlis irdanish11 yonashub wx-b stevenyesz kennivelez arkhymadhe dphean prakyathkantharaju patrickvossler18 tripleess afiqmuzaffar jordan4senator darkknight2223 hadryan laplacekorea yaffils vicentcamison dgrinko yannnnnnnnnnnn elaygall jinyeom rajats usct01 nhchristianson atdino tol echacko giscardbiamby hamidpalangi jkobject pinakinathc breakend arossbach10 paperwave subramanya1997 valeman sebastiani g-simmons zhouqx8 awesome-archive fkrasnov twnming python-repository-hub rdk nz99 brush701 white-research wintel egorsmkv buaawht dylanhogg rmbapi sufi-an suryatmodulus ebazarov finlaymacklon stjordanis xingyaoww lrnq ege-del tiagofrepereira2012 hnarayanan rosefun haotieu2001 mysticaltech rtu4673 mbaigorria neuhausler automationkit davgit mistobaan sharma-arpit 5l1v3r1 hemangjoshi37a henrylao shadown ariafyy rickeyestes2 mushfiqulislam mmistele hughplay stanleyjacob dennisbakhuis ahacad kroonen iq-scm metonymize-kripa michael7736 fermiq aditikhare007 vhaasteren n-mca lwgfangz perposaitni aryan-at-ul nhsjgczryf

arxiv-sanity-lite's Issues

MIssing paper(s)?

I noticed that out paper is missing from arxiv-sanity. It was stuck in moderation for a while so maybe couldn't be indexed properly? I assume there might be other papers affected by the same issue.

Link to missing paper

Impossible to unsubscribe if login is forgotten

If I've added e-mail to my account and managed to forget the login than there is no way to unsubscribe from what I can tell. One simple solution is to add login to e-mail body.

Connection reset by peer

With running the arxiv_daemon, I mostly am getting the response from arxiv "Connection reset by peer", which loops and loops for 1000 times before I get the message

"ok we tried 1,000 times, something is srsly wrong. exiting."

Is there a reason you set it to 1000? Why is this looped in the first place, is arxiv supposed to be finicky about this? Regardless, I feel like hammering arxiv so much is probably not preferred. Perhaps set it to a lower value?

Strange thing is, it doesn't always happen. Sometimes, I do get a connection immediately and a proper response from arxiv. That never happens after a few loops of "Connection reset". Then, a minute later if I try it would loop for the full 1000 times again. Is this an issue on arxiv side (like I'm on a blocklist of one of their load-balancing servers), or is this an arxiv-sanity-lite issue? Any ideas?

BioArxiv integration

Hi, great site :) would there be capacity to integrate bioarxiv articles in the future. I am aware of forks which have done this but they seem to be offline.

papers.labml.ai

Hi @karpathy,

We built papers.labml.ai in May (introductory tweet) to discover research papers based on popularity on Twitter. We were using arxiv-sanity to discover papers and I started this as a side project inspired by it (partly because it was down from time to time).

We worked on it on and off since May and have added a bunch of features, such as:

Popular papers based on Tweets
Link source codes, annotated implementations, videos, Reddit and Hackernews discussions, and other resources related to the paper
Conferences (iclr 2022, neurips 2021)
Short two-line summaries of the papers to quickly browse through lists of papers
Similar papers based on language model embeddings

And we are working on something very similar to tags on sanity-lite (which we call lists).

We love to hear your feedback and suggestions. Thanks for releasing your work.

Suggestion: add an entry indicating where the article was published

Add a line under each entry on the home page stating where the article was published. "Comments" on the arxiv.org website.

We are building Skim - inspired by arXiv Sanity with improvements :)

Hi @karpathy, thank you for introducing arXiv Sanity Lite!

Few of my peers and I are developing Skim https://skimhq.tech - Spotify for ML World - inspired by arXiv Sanity.

Currently it supports:

Creating a list of papers as "Rack" (similar to Spotify playlist)
See similar papers based on TF-IDF based features
See popular conferences and their racks (arXiv papers bundled into racks based on their yearly proceedings) and conference statistics as well - giving you complete information about a conference :)
Search across all papers, racks, conferences and user base

We would like to discuss more and share an invite to you - so that we can collaborate on this and improve over time.
Please let us know - [email protected]