I like deep neural nets.
karpathy / arxiv-sanity-preserver Goto Github PK
View Code? Open in Web Editor NEWWeb interface for browsing, search and filtering recent arxiv submissions
Home Page: http://www.arxiv-sanity.com/
License: MIT License
Web interface for browsing, search and filtering recent arxiv submissions
Home Page: http://www.arxiv-sanity.com/
License: MIT License
I like deep neural nets.
Conclusion section is often more compact and to the point.
It would be great to have 2 summary tabs - "abstract" and "conclusion" (where available) for each paper in the list.
Also default view in general preference with an option to view as default "abstract" or "conclusion" as a summary would be nice.
How feasible would it be to expand to all categories in arXiv?
Per #33, you mention that it's important to keep communities small so that "top papers" are still relevant. Couldn't this still be maintained by having a user specify as part of their account which subcategories they work in? And then top papers for a user would do some sort of cross-category normalization to account for multiple communities of different sizes. Maybe we could also crowdsource clustering of categories into different research areas and have those preset (like it has been done for ML currently).
Would love to see this platform become widely adopted!
Would you consider a PR to use the list of txt files in analyze.py
instead of querying the database? This would make it easier to make use of this script in other contexts. In addition, the script already skips over files when the text doesn't exist anyway. The only way the behaviour should be different is if there are some txt files that were manually placed in the folder for some reason.
currently its very difficult to view the website on mobile. it would be nice to have mobile view.
great project btw.
I was wondering if using both a topic vector (LSA/LDA based, or even paragraph2vec...) plus tf idf would improve results.
Topic vector based score would be added to tf idf based score with a low weight so common words (with high tfidf weight) are very important, but topic would be taken into account to probably affect document order.
What do you think?
I'm glad I found what I was looking for.
Again, would be nice to be able to rank/sort papers in a distributed fashion, e.g. among members of a research group. The more likes a paper collects the more likely it is to be discussed in the next reading group or the like.
Is arxiv-sanity-preserver under an open source license? Can we make modifications and contribute back?
I've registered a week or so ago, added some papers to library, but the recommended papers tab is empty
Those paper don't appear in arxiv-sanity:
I guess it's because they are listed under cs.IR, which isn't indexed by arxiv-sanity.
This is a bit strange as those papers could have been published under stat.ML or cs.CL.
Do you think cs.IR could be added to arxiv-sanity?
This issue is similar to #39 which seems to be fixed.
It would be really nice to a rudimentary date filter.
Here: https://arxiv.org/help/robots
is the "Robots Beware: Indiscriminate automated downloads from this site are not permitted."
This makes Your code doing what is explicitly forbidden by arxiv.
I'm started indexing some of the Physics categories. My plan is to cover all of them, but I've started with physics.* and astro-ph.* for now.
The site is currently hosted at http://physics.arxiv-sanity.nolife.de/
I'd be wiling to host it long term, if you want to focus on the already covered categories. Alternatively I could forward the PDFs, thumbnails and extracted texts to you, if you want to incorporate them in your site. What is your plan at the moment?
How do you want to handle domain names for forks? As a sub domain, or should I register a different one?
Hi!
I was recently thinking about similar service!
What do you think about ontology/connections graph visualization option?
I'm from the quantum information field and would be interested to use a similar service.
By the way, we have a crowd-rating website for arXiv papers https://scirate3.herokuapp.com/
It might be interesting to combine these two features.
Can a newsletter feature be added for suggested papers?
Hi @karpathy,
Few weeks ago, I forked your code to add twitter trends, I end up with a different architecture (wanted something more robust), anyway, I use postgres and sqlalchemy to record twitter stream.
I just open-sourced it so you can use if you want to use it! It's pretty straightforward if you have a postgres db.
=> https://github.com/BenderV/twitter_stream/tree/arxiv
I also have the same thing (sqlalchemy) set-up for arxiv (authors/papers/tags) if you are interested (I just need to do small work before open-sourcing it).
Rather surprisingly, putting an ID e.g. 1412.7210
directly into the search field fails to procure the corresponding paper.
Papers often arrive in batches, or sometimes I can't check them for a few days. It would be nice to be able to see a chronological list than spans maybe a week.
As the text already contains $
and LaTeX code everywhere, it should be quite simple to display it properly using mathjax, or the very cool KaTeX.
Mathjax instructions: https://docs.mathjax.org/en/v2.6-latest/start.html
KaTeX instructions: https://github.com/Khan/KaTeX/blob/master/contrib/auto-render/README.md
See here for a version of the sanity-preserver that uses Mathjax: https://arxiv.babushk.in/.
I have saved 74 papers. Yes that is a bit much, but not that much. I thought it would help the recommendation algorithm, and also store papers that looked interesting that I might want to read in the future.
Now axiv-sanity refused to show all of my papers. The papers I have saved most recently do not appear on the list at all. Whereas others do appear, but are near the bottom of the list. And scrolling down hits "You hit the limit of number of papers to show is one result. [sic]"
I'm wondering if I now have to go through and unsave every paper and go back to using bookmarks or something. But I really like the convenience of arxi-sanity, and the ability to take advantage of it's recommendation algorithm.
Thanks for making this 💪
I think the site would benefit from having security improved. Unfortunately, people have a tendency to re-use passwords, and as of now, the password and the session cookie can be intercepted on the same network and in man-in-the-middle attacks.
Perhaps you can use Certbot (Let's Encrypt) for this?
Amazing work here!
Unfortunately I can't generate the images:
convert: unable to open image `pdf/********.pdf[0-7]`
while [0-7] should not be part of the file name, instead, the index of pages I want. Any hint?
Like Google Alert for search result or citation notification in Google Scholar, but basically, user will be able to set alert for their search query, and upon any new submission that matches that search query, it'll shoot a notification/email.
Hi
I would like to add an RSS feed to the most recent papers tab. I was trying to setup on my local machine based on the instruction in the README. It failed when I ran analyze.py
C:\Users\<user>\Desktop\arxiv-sanity-preserver>python analyze.py
Traceback (most recent call last):
File "analyze.py", line 29, in <module>
txt = f.read()
File "C:\Users\<user>\AppData\Local\Continuum\Anaconda3\lib\encodings\cp1252
.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 2705: cha
racter maps to <undefined>
Any idea how to fix it?
Ravi
The PDF link on articles goes to the article page, not the PDF. Is this per design?
Can the support for http://www.jmlr.org/ papers be added?
Hi, sorry for my poor bug report. I'm new with github und such.
I'm trying to use your program with two topics in the astrophysics domain.
Everything processed fine until the webserver-like thing tries to read some tables.
~/arxiv-sanity-preserver ❯❯❯ ./venv/bin/python serve.py --prod
/$HOME/arxiv-sanity-preserver/venv/lib/python2.7/site-packages/flask_limiter/extension.py:124: UserWarning: Use of the default get_ipaddr
function is discouraged. Please refer to https://flask-limiter.readthedocs.org/#rate-limit-domain for the recommended configuration
UserWarning
Namespace(num_results=200, port=5000, prod=True)
loading db.p...
loading tfidf_meta.p...
loading sim_dict.p...
loading user_sim.p...
precomputing papers date sorted...
computing top papers...
Traceback (most recent call last):
File "serve.py", line 415, in
top_counts = get_popular()
File "serve.py", line 409, in get_popular
libs = sqldb.execute('''select * from library''').fetchall()
sqlite3.OperationalError: no such table: library
http://www.arxiv-sanity.com/top?timefilter=3days&vfilter=all
This view is sometimes empty, just showing the 'load more' button. Might be related to #57.
Hi,
I'm not sure if this is normal, but analyzing a corpus of 800MB (ca. 16000 articles) runs out of memory on my machine with 8GB of RAM + 2GB of swap. Can someone with a background in data analysis judge if this is expected?
This might be the main issue for me to scale the database for the physics section of arXiv, as I only have run the analysis on a small portion of it (less than a year for most section, and not all categories that are relevant).
I'll try to profile the memory usage, but I hope the attempt isn't futile. :p
Hi Andrej,
Thanks a lot, for your wonderful works and especially your attempt to further democratizing AI.
Quick question: is there any way to reset the password? I looked at the codes and http://www.arxiv-sanity.com/ didn't find any code for that.
Thanks,
Rasool
I'm trying to do a study of texts of ML papers, and am using these scripts to acquire paper texts. After running download_pdfs.py (with about 20000 candidate papers acquired using fetch_papers.py) I was seemingly blocked by arxiv after downloading 1201 papers.
Has anyone experienced this sort of rate-limiting, not during fetch_papers but during download_pdfs? I can't access arxiv at all (including for my regular research), and am wondering whether this sort of blocking goes away after a little while, or if I need to start worrying.
Thanks!
Right now the most recent paper from arxiv-sanity is from 11th of april while on arxiv there are several new paper since then.
Is there a problem with the refreshment?
Some feature requests/suggestions:
P.S. this is awesome...Many thanks!
This is a great tool, and I was thinking about extending it with paper archives beyond arXiv. But if everyone set up their own version of arxiv-sanity, the benefit of having access to other users' libraries for the recommendation system disappears.
Would you consider a system for exporting the library database? I guess something as simple as providing a regular static dump of DOI sets corresponding to anonymized libraries would be a good start. Then every admin of an arxiv-sanity setup (or other tools for that matter) can benefit from the user base of others.
elasticsearch is easy to use, to index your data, and it provides a RESTful API.
Would be great, e.g. for research groups to discuss papers/add comments/thoughts/etc
Hi,
ImageMagick is needed for the thumbnail creation, but is not listed in the readme.
sorry for this silly question but I couldn't find a sign up option.. how can I create an account and log in?
thanks
poppler is required for the pdftotext dependency and should IMHO be mentioned in the readme
First of all: I love the web app! I had actually built something similar (PubVis), when a reviewer made me aware of the arxiv sanity preserver. One feature that I had implemented and that I think from your setup you could probably easily add as well is a search using full text similarity. The idea here is that when you start drafting a paper, you want to make absolutely sure you didn't miss any essential references. Instead of conducting multiple keyword searches, with the full text search you can just paste your existing abstract (+ other text) and it is transformed into a tf-idf vector and then used to find related papers by computing the cosine similarity to the existing papers.
This recent deepmind paper doesn't appear in arxiv-sanity. I believe it's because it's listed under cs.AI, which isn't indexed by arxiv-sanity.
This is a bit strange as it seems relevant to the other categories included in arxiv-sanity. This particular paper could have just as easily been posted in stat.ML or cs.LG.
For instance, I'm reading this paper and I see it referred to ideas from previously published papers. I want to put this paper as a child of those research papers and maintain a tree so that I can keep track of the ideas from the paper in a systematic manner to aid my research. In other words, I want to visualize the path of knowledge that flows from one research paper to another.
It would be nice to receive an alert of some form when the authors upload a new revision of the paper i have previously saved in my library.
Let's add Top Hype
papers for Last Year
and All Time
It would be interesting to know how the social media was reacting to papers written last year
I notice you started accepting pull requests. I'm adding stemming now. Prepare for you to merge when it's done?
Also I put the data into dockerized ElasticSearch/Kibana (similar to @rsarxiv suggestion). Just several lines of code and you've got a nice Kibana GUI for exploration. Found some interesting insights there. Interested in this as well? But it likes good disks, preferably SSD, for indexing.
I'm writing an ios app named RSarXiv, it aims to recommend arxiv papers based on user's behavior.
Maybe u can try it.
It's my great honor if you can give me some advices.
You can search rsarxiv to get the app in app store.
thanks a lot
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.