GithubHelp home page GithubHelp logo

plaitpy / plaitpy Goto Github PK

View Code? Open in Web Editor NEW
425.0 10.0 22.0 1.03 MB

plait.py - a fake data modeler

License: MIT License

Makefile 0.60% Python 98.27% Shell 1.14%
declarative synthetic-data modeling

plaitpy's Introduction

plait.py

plait.py is a program for generating fake data from composable yaml templates.

The idea behind plait.py is that it should be easy to model fake data that has an interesting shape. Currently, many fake data generators model their data as a collection of IID variables; with plait.py we can stitch together those variables into a more coherent model.

some example uses for plait.py are:

  • generating mock application data in test environments
  • validating the usefulness of statistical techniques
  • creating synthetic datasets for performance tuning databases

features

  • declarative syntax
  • use basic faker.rb fields with #{} interpolators
  • sample and join data from CSV files
  • lambda expressions, switch and mixture fields
  • nested and composable templates
  • static variables and hidden fields

an example template

# a person generator
define:
  min_age: 10
  minor_age: 13
  working_age: 18

fields:
  age:
    random: gauss(25, 5)
    # minimum age is $min_age
    finalize: max($min_age, value)

  gender:
    mixture:
      - value: M
      - value: F

  name: "#{name.name}"
  job:
    value: "#{job.title}"
    onlyif: this.age > $working_age

  address:
    template: address/usa.yaml
  phone: # add a phone if the person is older than the minor age
    template: device/phone.yaml
    onlyif: this.age > ${minor_age}

  # we model our height as a gaussian that varies based on
  # age and gender
  height:
    lambda: this._base_height * this._age_factor
  _base_height:
    switch:
      - onlyif: this.gender == "F"
        random: gauss(60, 5)
      - onlyif: this.gender == "M"
        random: gauss(70, 5)

  _age_factor:
    switch:
      - onlyif: this.age < 15
        lambda: 1 - (20 - (this.age + 5)) / 20
      - default:
        value: 1

how its different

some specific examples of what plait.py can do:

  • generate proportional populations using census data and CSVs
  • create realistic zipcodes by state, city or region (also using CSVs)
  • create a taxi trip dataset with a cost model based on geodistance
  • add seasonal patterns (daily, weekly, etc) to data

usage

installation

# install with python
pip install plaitpy

# or with pypy
pypy-pip install plaitpy

cloning the repo for development

git clone https://github.com/plaitpy/plaitpy

# get the fakerb repo
git submodule init
git submodule update

generating records from command line

specify a template as a yaml file, then generate records from that yaml file.

# a simple example (if cloning plait.py repo)
python main.py templates/timestamp/uniform.yaml

# if plait.py is installed via pip
plait.py templates/timestamp/uniform.yaml

generating records from API

import plaitpy
t = plaitpy.Template("templates/timestamp/uniform.yaml")
print t.gen_record()
print t.gen_records(10)

looking up faker fields

plait.py also simplifies looking up faker fields:

# list faker namespaces
plait.py --list
# lookup faker namespaces
plait.py --lookup name

# lookup faker keys
# (-ll is short for --lookup)
plait.py --ll name.suffix

documentation

yaml file commands

  • see docs/FORMAT.md

datasets

  • see docs/EXAMPLES.md
  • also see templates/ dir

troubleshooting

  • see docs/TROUBLESHOOTING.md

Dependent Markov Processes

To simulate data that comes from many markov processes (a markov ecosystem), see the plaitpy-ipc repository.

future direction

If you have ideas on features to add, open an issue - Feedback is appreciated!

License

MIT

plaitpy's People

Contributors

isidentical avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

plaitpy's Issues

performance benchmark

Hey we had a correspondence on HN. I didn't realize how bad the faker library was until you pointed out the performance issues. Your benchmark isn't accurate, though, because of how you're timing.
You're not performance benchmarking plait as well as you could be. Instead, try the "perf" library from pypi:
pip install perf

Use this gist as a template to run plait through. The example is intuitive enough that you could figure out what the change for plait: https://gist.github.com/Dowwie/499d9fb0344d7f4345ff0e669a7d4a36

simply execute the benchmark python script from the command line and perf will run, eventually returning results to stdout

I was curious how much faster a rust binding would be than faker so I started to work on a port. The library currently only generates a fake full name but it won't take much time to port the rest over.

These are the perf results for pyfakers vs faker -- almost 100 times faster using rust and a little bit of cffi overhead:
pyfakers: Mean +- std dev: 1.73 ms +- 0.08 ms
faker: Mean +- std dev: 127 ms +- 1 ms

what are your plait perf results?

Introduce support for JSON Templates

With something like...

yaml.safe_dump(json.load(f), default_flow_style=False)

You can, easily add support for JSON templates, so that interoperation with other tools using the more popular JSON, also get to easily exploit this mother of a weaver xD

How to get formatted dates in a field?

I am trying to do something like the following using the datetime library:

define:
  now: time.time()
  max_offset: 60 * 1000

fields:
  _offset:
    random: (random() * $max_offset) - ($max_offset / 2)

  time:
    lambda: str(datetime.datetime.utcfromtimestamp($now + this._offset))

hide:
  - offset

But this always returns None. Is there something specific with using datetime?

Thanks-- love this package btw!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.