GithubHelp home page GithubHelp logo

anopsy / arxiv-frontpage Goto Github PK

View Code? Open in Web Editor NEW

This project forked from koaning/arxiv-frontpage

0.0 0.0 0.0 18.46 MB

My personal frontpage app

Home Page: https://koaning.github.io/arxiv-frontpage/

Python 27.27% Makefile 0.09% HTML 72.64%

arxiv-frontpage's Introduction

Arxiv Frontpage

Today's frontpage can be viewed here:

https://koaning.github.io/arxiv-frontpage/

What's this?

This project is an attempt at making my own frontpage of Arxiv. Every day this project does git-scraping on new Arxiv articles via this Python API. Then, another cronjob runs a script that attemps to make recommendations based on annotations that reside in this repo. This is then comitted as a new index.html page which is hosted by Github pages.

This project is very much a personal one and may certainly see a bunch of changes in the future. But I figured it would be nice to host it publicly so that it may inspire other folks to make their own feed as well.

The project assumes that you're using Prodigy to annotate your data. You're still free to copy the code and change it to use alternative labelling tools but you will have to make some code changes to get that to work.

Contents

  • There is a config.yml file that contains definitions of the labels. All scripts will make assumptions based on the contents of this file.
  • There is a taskfile that contains some common commands.
  • There is a .github folder that contains all the cronjobs.
  • There is a frontpage Python module that contains all the logic to prepare data for annotation, to train sentence-models and to build the new site.
  • There are two benchmark*.ipynb files that contain some scripts that I've used to run benchmarks. Some attemps done with LLMs via spacy-llm while others were done with pretrained embeddings.
  • This project assumes a .env file, which you can use if you intend to use weights and biases to store custom sentence transformers or use external embedding providers.

Notes

This project also explores how to pragmatically bootstrap a predictive project. There are a few things in particular worth highlighting that feels somewhat unique.

First off, instead of active learning this project assumes active teaching. There are multiple methods available to select a subset of interest which help the user steer the algorithm.

If you want to explore the options, you can run:

python -m frontpage annotate

This will give a menu that you can use to select the subset selection method. You can annotate on a sentence-level or abstract-level and select from a number of tricks to find an interesting subset.

In terms of modelling, this project employes a sentence-model that makes a prediction per sentence. It's possibly not the most performany modelling approach, but it is easy to interpret. It also helps make the model more understandable in the UI.

This sentence-model can be trained by first finetuning the embedding by using a setfit-like approach. We first use all the labels to finetune a new embedding layer, after which we train a classifier head for each label.

I ended up writing custom implementations for a lot of this because my annotations don't really fit the "multi-label classification" setting. Some sentences may have two labels, but many will only have a single one. That means that my label matrix should have lots of nan-values, which goes against the assumptions of many libraries out there.

arxiv-frontpage's People

Contributors

koaning avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.