GithubHelp home page GithubHelp logo

bcqguo / papergraph Goto Github PK

View Code? Open in Web Editor NEW

This project forked from dennybritz/papergraph

0.0 1.0 0.0 4.9 MB

AI/ML citation graph with postgres + graphql

Home Page: http://papergraph.dbz.dev/

Rust 0.96% PLpgSQL 0.19% Jupyter Notebook 98.77% Dockerfile 0.05% TSQL 0.03%

papergraph's Introduction

papergraph

papergraph is a rust library and binary to build and manage a citation graph of Semantic Scholar, focused on AI/ML papers (for now). Data is stored in a postgres database with a Hasura GraphQL backend (schema) on top for easy graph queries. It comes with Jupyter notebooks that show you how to analyze and visualize the data.

Live version at https://papergraph.dbz.dev

Thanks to @ArtirKel for the useful feedback and ideas.

Notebooks

The folllowing notebooks work out of the box using a publicly available API endpoint for the data. You can run them locally, or in the cloud via Google Colab. Please read the caveats about the public endpoint below!

Use Cases

  • Finding landmark papers - Papers with a large citations may be considered landmark papers. The ideas in such papers often form the foundation for incremental improvements. Given some arbitrary paper you're interested in, you may want to know which landmark papers you should study for the required background knowledge.
  • Reference research - When writing a paper, you don't want to miss prior work. Looking through the citation graph for a related paper can help you find potentially interesting papers to read and cite.
  • Graph Analysis - Run sophisticated graph algorithms on the dataset to gain insights

Graph Example

IMPORTANT! Using the public endpoint

The database is publicly available at http://34.107.246.233/v1/graphql, so please be gentle with your queries! This is running on a small postgres server that I'm paying for, so please don't overload it with automated scripts. Be nice :) As long as you're running queries by hand through notebooks everything should be fine.

If you want to do lots of queries you should clone this repo and build the database yourself locally or in the cloud. Instructions for this are below. If you are running Kubernetes, you can also use the scripts in deploy/.

Building the database from a postgresql snapshot

TODO. See this issue

Building the database from scratch

Requirements:

  • Docker

If you want to build the database from scratch, you must download the full S2 research corpus. The total compressed size is currently around ~120GB.

Clone the repo

git clone https://github.com/dennybritz/papergraph
cd papergraph
aws s3 sync --no-sign-request s3://ai2-s2-research-public/open-corpus/2020-04-10/ data/s2-research-corpus

Start up an empty postgres database server and create the schema

export DATABASE_URL=postgres://papergraph:papergraph@postgres:5432/papergraph
export RUST_LOG=info

# Run the postgres docker container
docker-compose up postgres

# Setup the datase and run migrations
docker run --rm --network papergraph_default \
  -e DATABASE_URL \
  dennybritz/papergraph \
  diesel database setup

Now that we have a postgres server with the right database schema running, we need to insert the data:

# Assuming you downloaded the data into /data 
# as shown in the AWS command above
DATA_PATH=data/s2-research-corpus/s2-corpus-017.gz

# Repeat this for all files you want to insert
# This will take a while. On my laptop, each file takes around 1min.
docker run --rm -it --network papergraph_default \
  -e DATABASE_URL -e RUST_LOG \
  -v `pwd`/${DATA_PATH}:/data/${DATA_PATH} \
  dennybritz/papergraph \
  papergraph insert -d /data/${DATA_PATH}

Now that have seeded the database, we can also start Hasura to serve the graphql API. Stop the postgres docker process with ctrl+c and run

docker-compose up

You should now be able to access the API via http://localhost:8080.

Freshness

papergraph is updated when new data snapshots become available. This typically happens once a month. This means it will not contain all the latest papers.

Misc

Generating postgres database dumps

pg_dump -h localhost -p 15432 -F tar -U papergraph papergraph > pg_dump.tar

Build docker image

docker build -t dennybritz/papergraph .

Export graphql schema

gq http://34.107.246.233/v1/graphql --introspect > hasura/schema.graphql  

papergraph's People

Contributors

dennybritz avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.