GithubHelp home page GithubHelp logo

sirrice / pygg Goto Github PK

View Code? Open in Web Editor NEW
72.0 8.0 9.0 92 KB

ggplot2 syntax in python. Actually wrapper around Wickham's ggplot2 in R

License: MIT License

Python 95.98% R 1.61% Jupyter Notebook 2.41%

pygg's Introduction

pygg

ggplot2 syntax in python. Actually wrapper around Wickham's ggplot2 in R

Particularly good if you have preprocessed CSVs or Postgres data to render. Passable support for simple data in python lists, dictionaries, and panda DataFrame objects

pygg allows you to use ggplot2 syntax nearly verbatim in Python, and execute the ggplot program in R. Since this is just a wrapper and passes all arguments to the R backend, it is almost completely API compatible.

For a nearly exhaustive list of supported ggplot2 functions, see bin/make_ggplot2_functions.R.

Setup and Usage

Setup

  • install R
# on osx
brew install R

# on unix e.g., ubuntu
sudo apt-get install R
  • install R packages (run the following in the R shell)
install.packages("ggplot2")
install.packages("RPostgreSQL")   # optional

Install

pip install pygg

Command line usage

runpygg.py --help
runpygg.py -c "ggplot('diamonds', aes('carat', 'price')) + geom_point()" -o test.pdf
runpygg.py -c "ggplot('diamonds', aes('carat', 'price')) + geom_point()" -csv foo.csv

For Python usage, see tests/example.py

from pygg import *

# Example using diamonds dataset (comes with ggplot2)
p = ggplot('diamonds', aes('carat', y='price'))
g = geom_point() + facet_wrap(None, "color")
ggsave("test1.pdf", p+g, data=None)

Details, Utils, and Quirks

The library performs a simple syntactic translation from python ggplot objects to R code. Because of this, there are some quirks regarding datasets and how we deal with strings.

Datasets

In R, ggplot directly references the data frame object present in the runtime (e.g., ggplot(<datasetname>, aes(...)). However, the python objects being plotted are not directly available in the R runtime.
pygg provides two ways of loading datasets from Python into R.

The primary way is to explicitly pass the data object to ggsave using its data keyword argument. ggsave then converts the data object to a suitable CSV file, writes it to a temp file, and loads it into the data variable in R for use with the ggplot2 functions

For example (notice that the string "data" is passed to ggplot()):

    df = pandas.DataFrame(...)
    p = ggplot("data", aes(...)) + geom_point()
    ggsave("out.pdf", p, data=df)

In addition, we provide several convenience functions that generate the appropriate R code for common python dataset formats:

  • csv file: if you have a CSV file already, provide the filename to data
        p = ggplot("data", aes(...)) + geom_point()
        ggsave("out.pdf", p, data="file.csv")

        # or more explicitly, pass a wrapped object that represents the csv file:

        ggsave("out.pdf", p, data=data_py("file.csv"))

  • python object: if your data is a python object in columnar ({x: [1,2], y: [3,4]}) or row ([{x:1,y:3}, {x:2,y:4}]) format
        p = ggplot("data", aes(...)) + geom_point()
        ggsave("out.pdf", p, data={'x': [1,2], 'y': [3,4]})
  • pandas dataframe: if your data is a pandas data frame object already you can just provide the dataframe df directly to data
        p = ggplot("data", aes(...)) + geom_point()
        ggsave("out.pdf", p, data=df)
  • PostgresQL: if your data is stored in a postgres database
        p = ggplot("data", aes(...)) + geom_point()
        ggsave("out.pdf", p, data=data_sql('DBNAME', 'SELECT * FROM ...')
  • existing R datasets: can you refer to any R dataframe object using the first argument to ggplot()
        p = ggplot('diamonds', aes(...)) + geom_point()
        ggsave("out.pdf", p, data=None)

String arguments

By default, the library directly prints a python string argument into the R code string. For example the following python code to set the x axis label would generate incorrect R code:

    # incorrect python code
    scales_x_continuous(name="special label")

    # incorrect generated R code
    scales_x_continuous(name=special label)

    # correct python code
    scales_x_continuous(name="'special label'")

    # correct generated R code
    scales_x_continuous(name='special label')

    # less convenient but more explicit alternative syntax
    scales_x_continuous(name=pygg.esc('special label'))

You'll need to explicitly wrap these types of strings (intended as R strings) in a layer of quotes. For convenience, we automatically provide wrapping for common functions:

    # "filename.pdf" is wrapped
    ggsave("filename.pdf", p)

Convenience Functions

Passing data to ggplot() directly

It feels silly to pass a dummy "data" string to ggplot() and then pass the object to ggsave. We have extended the ggplot() call so it recognizes non string python data objects and uses the data object by default during the ggsave call:

    df = pandas.DataFrame(...)
    p = ggplot(df, aes(...)) + geom_point()
    ggsave("out.pdf", p)

    p = ggplot(dict(x=[0,1], y=[3,4]), aes(x='x', y='y')) + geom_point()
    ggsave("out.pdf", p)

Note that unlike ggsave, it is not smart enough to distinguish string arguments that are R variable names and file names. Thus, the following will likely lead to an error because it assumes the R variable data.csv exists in the environment when in reality it's the name of a csv file to be loaded:

    p = ggplot("data.csv", aes(x='x', y='y')) + geom_point()
    ggsave("out.pdf", p)

Simply wrap the filename with a data_py() call:

    p = ggplot(data_py("data.csv"), aes(x='x', y='y')) + geom_point()
    ggsave("out.pdf", p)
Axis Labels

axis_labels() is a shortcut for setting the x and y axis titles and scale types. The following names the x axis "Dataset Size (MB)"and sets it to log scale, names the y axis "Latency (sec)"and is by default continuous scale, and sets the breaks for the x axis to [0, 10, 100, 5000]:

    p = ggplot(...)
    p += axis_labels("Dataset Size (MB)", 
                    "Latency (sec)", 
                    "log10",  
                    xkwargs=dict(breaks=[0, 10, 100, 5000]))

Questions

Alternatives

  • yhat's ggplot: yhat's port of ggplot is really awesome. It runs everything natively in python, works with numpy data structures, and renders using matplotlib. pygg exists partly due to personal preference, and partly because the R version of ggplot2 is more mature, and its layout algorithms are really really good.

  • pyggplot: Pyggplot does not adhere strictly to R's ggplot syntax but pythonifies it, making it harder to transpose ggplot2 examples. Also pyggplot requires rpy2.

  • plotnine: another implementation of ggplot2 in Python

pygg's People

Contributors

psfotis avatar sirrice avatar tyberiusprime avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pygg's Issues

Pass data objects into ggplot call

Current version of ggplot() takes a variable name as input, by default "data", and relies on ggsave()'s prefix argument to set the data object.

ggplot('data', aes(...)) + ggsave(..., prefix=data_py(dataobject))

Modify ggplot() call to accept a data object as input, and let it configure the prefix under the covers. ggsave's prefix argument option can still be used for full control

Support Python 3

I am not familiar with the Python-R bindings, but glancing over pygg's codebase it seems fairly doable.

ggplot with two data

In this R script, there are two data: mean_wt and mtcars. However, current pygg only resolve one data and can not resolve the mean_wt.

mean_wt <- data.frame(cyl = c(4, 6, 8), wt = c(2.28, 3.11, 4.00))
ggplot(mtcars, aes(mpg, wt, colour = wt)) +
  geom_point() +
  geom_hline(aes(yintercept = wt, colour = wt), mean_wt) +
  facet_wrap(~ cyl)

Example source

Error with stat='identity' in geom_bar

A minimal example is

import pygg
import pandas as pd

data = pd.DataFrame({'x': range(10), 'y': range(10, 20)})
p = pygg.ggplot(data, pygg.aes(x='x', y='y'))
g = pygg.geom_bar(stat='identity')
pygg.ggsave('file.png', p + g, data=None)

The traceback says

ValueError: ggplot2 bridge failed for program: library(ggplot2)

data = read.csv("/tmp/tmpICcyg2",sep=",")

p = ggplot(data,aes(x=x,y=y)) + geom_bar(stat=identity)
ggsave("file.png",p,height=8,scale=1,width=10). Check for an error

Unit tests are too limited

Current unit tests for pygg are too limited in scope. Many individual functions (e.g., data_py, is_pandas_df, to_r) can be tested quite effectively.

Update ggplot functions generator

Maybe the R script can generate a standalone pygg_functions.py file that pygg.py can import *? That way don't need to copy and paste in the future

Generalize to_r function to convert more python expressions to R

Right now to_r doesn't understand basic python types like lists and dictionaries and it would be more nature to use those data structures when invoking ggplot via pygg in some function calls. to_r should be generalized to recursively convert common python data structures to reasonable R equivalents.

For example, right you can you have to say:

    p += pygg.scale_y_continuous(limits="c(0, 1)")

but it would be more natural to be able to say this as:

    p += pygg.scale_y_continuous(limits=[0, 1])

Remove special code for facet_*

facet_grid and facet_wrap have a special API right now to handle the formulas in R. Update the code so that the interface isn't specialized and you use the functions directly with escaped strings:

Today:

p = p + facet_grid('x', 'y')

Proposed API:

p = p + facet_grid(esc("x ~ y"))

Use tempfile to store temporary CSV files

The current code uses a hardcoded tempfile /tmp/_pygg_data.csv, which may cause conflicts between multiple versions of ggpy running on the same machine as well as leaks data onto the filesystem that lives after the lifetime of a program using pygg. Replace with a true temporary file from tempfile.

Python 3?

Is there any progress on python 3 compatibility?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.