GithubHelp home page GithubHelp logo

datagen's Introduction

datagen: make sh[2] up

Datagen helps you create sample delimited data using a simple schema format. It runs on Python 2.6-3.4 and particularly well on PyPy.

Installation

pip install datagen

Or:

$ git clone https://github.com/toddwilson/datagen.git
$ cd datagen
$ python setup.py install

Usage

usage: datagen [-h] [-d DELIMITER] [--with-header] -n NUM_ROWS -s SCHEMA [output]

1. Create a schema file

$ cat > schema.txt <<EOL
    #name      type[argument]
    id         int[6]
    first      firstname
    last       lastname
    email      email
    dob        date[after=1945-01-01, before=2001-01-01]
    password   string[8]
    is_active  bool
    language   randomset[python,ruby,go,java,c,js,brainfuck]
EOL

2. Make data

$ datagen -s schema.txt -n 5 --with-header
id|first|last|email|dob|password|is_active|language
238476|Velma|Medrano|[email protected]|1948-01-12|KmAcXnnS|1|python
202490|Kathy|Wellman|[email protected]|1960-11-12|BwtZnRUN|1|java
905703|Fern|Odell|[email protected]|1972-12-12|ipVagvEB|0|c
130211|Khadijah|Sheffield|[email protected]|1961-02-02|ijAVDWUY|0|java
643257|Patricia|Cummings|[email protected]|1960-05-01|GJdImZaw|0|ruby

3. Actually start working on what you should be working on

Types

bool: 1 or 0 randomly.

int[length]: Random unsigned integer.

Params:

  • length: max-length

Example:

number  numberint[3]

509
49
783

incrementing_int: Automatically incrementing unsigned integer.

Example:

id  incrementing_int

1
2
3

string[length]: Random case-insensitive string.

Params:

  • length: max-length

Example:

code  string[4]

FiwH
Acbj
EtGM

randomset[list]: Random member from a list

Params:

  • set: a comma-separated list of values

Example:

country  randomset[US,UK,MX,CA,NZ]

MX
US
CA

ipv4: IPv4 address

Example:

ip  ipv4

18.149.184.112
66.170.176.163
186.49.28.83

date: ISO 8601 date (YYYY-MM-DD)

Params:

  • before: ISO 8601 date top limit
  • after: ISO 8601 bottom limit

Example:

start_date  date[after=2013-01-01, before=2014-01-01]

2013-10-05
2013-01-10
2013-05-14

datetime: ISO 8601 datetime (YYYY-MM-DDTHH:MM:SS)

Params:

  • before: ISO 8601 datetime top limit
  • after: ISO 8601 bottomtime limit

Example:

start_at  datetime[after=2013-01-01T00:00:00, before=2014-01-01T00:00:00]

2013-10-03T13:00:23
2013-05-12T00:00:06
2013-09-20T03:18:02

ssn: 9-digit Social Security Number

Example:

ssn  ssn

421-87-2421
889-27-3485
861-33-1570

firstname: Randomized first name (from top names in US Census data)

Example:

first  firstname

Todd
Jessika
Dustin

lastname: Randomized last name (from top names in US Census data)

Example:

last  lastname

Rivers
Akins
Reardon

zipcode: 5-digit zipcode

Example:

zip  zipcode

47245
59502
20191

state: US States (2 letter)

Example:

state  state

ID
KY
AK

email: Email address

Example:

email  email

[email protected]
[email protected]
[email protected]

Adding Your Own Types

It's really easy to add your own types to use in a schema file. Just create a method that accepts a single argument and decorate it with datagen.types.reg_type.

Example:

<my_datagen.py>

from random import uniform
from datagen.types import reg_type
from datagen import main


@register_type("price")  # the decorator sets the name of the type
def price(arg):  # the method must accept one argument (even if not used)
    return round(uniform(0, 100), 2)


if __name__ == '__main__':
    main()

<schema.txt>

item_id   int[5]
price     price
$ python my_datagen.py -s schema.txt -n 3
41746|7.32
4077|40.55
12814|43.82

Adding Arguments to Your Types

<my_datagen.py>

from random import uniform
from datagen.types import register_type, type_arg
from datagen import main


@type_arg("price")  # Use the same name as the type defined in reg_type()
def price_argument(arg):  # This method is passed the contents of what's in price[]
    return int(arg)  # This will get passed to price() when iterating


@register_type("price")  # the decorator sets the name of the type
def price(max_price):  # the method must accept one argument (even if not used)
    return round(uniform(0, max_price), 2)


if __name__ == '__main__':
    main()

<schema.txt>

item_id   int[5]
price     price[10]
$ python my_datagen.py -s schema.txt -n 3
66995|5.08
5894|7.86
53659|9.26

Performance

If you need datagen to write faster, use PyPy:

$ time python my_datagen.py -s schema.txt -n 1000000 > test_data
python my_datagen.py -s schema.txt -n 1000000 > test_data  7.87s user 0.07s system 99% cpu 7.950 total

$ time pypy my_datagen.py -s schema.txt -n 1000000 > test_data
pypy my_datagen.py -s schema.txt -n 1000000 > test_data  2.79s user 0.06s system 99% cpu 2.863 total

datagen's People

Contributors

toddwilson avatar

Watchers

 avatar  avatar  avatar

Forkers

isabella232

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.