finraos / datagenerator Goto Github PK

DataGenerator is a Java library for systematically producing large volumes of data. DataGenerator frames data production as a modeling problem, with a user providing a model of dependencies among variables and the library traversing the model to produce relevant data sets.

Home Page: http://finraos.github.io/DataGenerator

License: Apache License 2.0

Java 64.79% Scala 35.21%

datagenerator's Introduction

Quick Start Videos

https://www.youtube.com/playlist?list=PLB0Zha5q-7wJp3TLH782J7ZDQ2RwPS_hQ

Contributing

We encourage contribution from the open source community to make DataGenerator better. Please refer to the development page for more information on how to contribute to this project.

Maven Dependency

For the core

<dependency>
    <groupId>org.finra.datagenerator</groupId>
    <artifactId>dg-core</artifactId>
    <version>2.2</version>
</dependency>

For the commons library

<dependency>
    <groupId>org.finra.datagenerator</groupId>
    <artifactId>dg-common<artifactId>
    <version>2.2</version>
</dependency>

Building

DataGenerator uses Maven for build. Please install Maven by downloading it from here.

# Clone DataGenerator git repo
git clone git://github.com/FINRAOS/DataGenerator.git
cd DataGenerator

# Checkout master branch
git checkout master

# Run package to compile and create jar (also runs unit tests)
mvn package

# Compile and run unit tests only
mvn test

License

The DataGenerator project is licensed under Apache License Version 2.0

Overview

Data Generator generates pattern using two pieces of user provided information:

An SCXML state chart representing interactions between different states, and setting values to output variables
A user Transformer that formats the variables and stores them.

The user can optionally provide their own distributor that distributes the search of bigger problems on systems like hadoop. By default, DataGenerator will use a multithreaded distributor.

Quick start

For the full compilable code please see the default example

First step, define an SCXML model:

<scxml xmlns="http://www.w3.org/2005/07/scxml"
       xmlns:cs="http://commons.apache.org/scxml"
       version="1.0"
       initial="start">

    <state id="start">
        <transition event="SETV1" target="SETV1"/>
    </state>

    <state id="SETV1">
        <onentry>
            <assign name="var_out_V1_1" expr="set:{A1,B1,C1}"/>
            <assign name="var_out_V1_2" expr="set:{A2,B2,C2}"/>
            <assign name="var_out_V1_3" expr="77"/>
        </onentry>
        <transition event="SETV2" target="SETV2"/>
    </state>

    <state id="SETV2">
        <onentry>
            <assign name="var_out_V2" expr="set:{1,2,3}"/>
            <assign name="var_out_V3" expr="#{customplaceholder}"/>
        </onentry>
        <transition event="end" target="end"/>
    </state>

    <state id="end">
        <!-- We're done -->
    </state>
</scxml>

This model contains five variables controlled by two states. The transition between those states is unconditional. One of those variables is always constant ( var_out_V1_3 ). Three will acquire every value from a set ( var_out_V1_1, var_out_V1_2 and var_out_V2 ). var_out_V3 will be set to a holder value that will be replaced by the user at a later point.

The second step will be to write a Transformer. The code is here

public class SampleMachineTransformer implements DataTransformer {

    private static final Logger log = Logger.getLogger(SampleMachineTransformer.class);
    private final Random rand = new Random(System.currentTimeMillis());

    /**
     * The transform method for this DataTransformer
     * @param cr a reference to DataPipe from which to read the current map
     */
    public void transform(DataPipe cr) {
        for (Map.Entry<String, String> entry : cr.getDataMap().entrySet()) {
            String value = entry.getValue();

            if (value.equals("#{customplaceholder}")) {
                // Generate a random number
                int ran = rand.nextInt();
                entry.setValue(String.valueOf(ran));
            }
        }
    }

}

The above transformer will intercept every generated row, and convert the place holder "#customplaceholder" with a random number.

The last step will be writing a main function that ties both pieces together. Code is here

    public static void main(String[] args) {

        Engine engine = new SCXMLEngine();

        //will default to samplemachine, but you could specify a different file if you choose to
        InputStream is = CmdLine.class.getResourceAsStream("/" + (args.length == 0 ? "samplemachine" : args[0]) + ".xml");

        engine.setModelByInputFileStream(is);

        // Usually, this should be more than the number of threads you intend to run
        engine.setBootstrapMin(1);

        //Prepare the consumer with the proper writer and transformer
        DataConsumer consumer = new DataConsumer();
        consumer.addDataTransformer(new SampleMachineTransformer());
        consumer.addDataWriter(new DefaultWriter(System.out,
                new String[]{"var_out_V1_1", "var_out_V1_2", "var_out_V1_3", "var_out_V2", "var_out_V3"}));

        //Prepare the distributor
        DefaultDistributor defaultDistributor = new DefaultDistributor();
        defaultDistributor.setThreadCount(1);
        defaultDistributor.setDataConsumer(consumer);
        Logger.getLogger("org.apache").setLevel(Level.WARN);

        engine.process(defaultDistributor);
    }

The first few lines will open an input stream on the SCXML file and pass the stream to the engine. Calling setBootStrapMin will attempt to split the graph generated from the state chart to at least the given number of splits. Here we passed 1 but in case you will execute the same code over hadoop or use a multithreaded version, you will need to increase that number to be at least the number of threads or mappers you wish to run. The rest of the code will set our transformer to the engine and create a writer based on the DefaultWriter. The function of the writer is to write the output to the user's desired destination.

The final piece sets the number of threads and called engine.process.

datagenerator's People

Contributors

Stargazers

Watchers

Forkers

dpulitano mibrahim kood1 sirithink smxjrz bryantrobbins yukareal mpeter28 amax1mov isian14 koshya mchao47 nguyen02 hanhanx ewd-avengers dawsonwu rad-raiders dedgar1 anshaa nperera101 checkmyfunk moonkim abing313 wnilkamal yourfrienddhruv prathithacb alexlee316 mailaugus m-dub rnalakurthi ejoynes frictionlesscoin aditi457 ankitad87 pateldhavalj mauricio1990silva stmcpherson ibachabi maheshchandra-cs sumtimes amyorchid188 nolanc33 nids786 gryinn nellorish sunileman codeaudit begolas tdnhome aayushidwivedi01 soyjesusguillen shaneebersole skryzhya willsp96 aayushidwivedi mzbotr smartpcr kitsushadow hasfari vbriabrin sameradra johanwitters vijayachander cakelly wmudge alexocculate victoremmanuel justudin knappk tmdufresne shraddha-patel leeny324 cagerber medamineosm gitrdd cocoon24 lingjiangxi dchisarick zsmj513 goutham470 mostafamowafy castillag ajfritsch lkafle skrahul brijeshrpatel9 k23936 mikev37 mauriciosilva1990 santiagos11 katespue vcdurmo nithinmangalore muhammadali14 tspannhw newtex junglekan gitter-badger petermaffessanti brandi1221

datagenerator's Issues

CheckStyle: DataWriter.java

Update the pom to reflect DataGenerator name

Job controller

Allow the user the execute jobs from the GUI editor.

Create a DG maven archetype

We need standard way of postprocess data

One customer want to generate json data (to use it as mocks in testing his system)...

Maybe we can provide easy way to:

export as json
export as xml
export to rational DB
...

Add output hdfs

CheckStyle: ChartExec.java

Implement Interface and Standard Behavior to help Map Reduce based Jobs stop generation

Create an interface to allow the user of DG to generate a set amount of rows (like 1 million rows instead of 15 million). Provide a standard implementation, as well as ensure the interface is useful for people using DG across other distributed frameworks #36

Ability define expressions and calculations on nodes and edges.

R1.4.8: Ability define expressions and calculations on nodes and edges. Expressions should be able to reference variable types described in R1.4.2 to R1.4.5.

R1.4.8.1: Simple mathematical expressions. (Add, Subtract, Divide, Multiply)
R1.4.8.2: Simple string manipulation. (Concatenate, Replace, Trim, Sub String, RegEx replace)

https://github.com/FINRAOS/DataGenerator/wiki/DataGenerator-V2-Requirements#r14-common-rules-for-r11-r12-and-r13

Code cleanup

Fix all checkstyle issues. Switch checkstyle to break the build on errors.

CheckStyle: SearchDistributor.java

CheckStyle: ScXmlUtils.java

Create a disk based list using JE

We need result analyser

Now we need to analysis / verify results manually.
It's easy to make a mistake in xml script and generate little bit different data set from desire one.

So, if we will have some kind of analysis tool, we will be able to analysis generated data and easy verify it.

Also we can use it for analysis generated data and external/original data to see difference.

BFS hangs sometimes when the number of splits don't exactly match

BFS hangs sometimes when the number of splits don't exactly match the ranks of the sets in the states.

CheckStyle: DefaultWriter.java

We need page with example of use

Now new user will be frustrated about real power of DG because of lack of knowledge.

Review #39

Setup the release process with SONATYPE

https://docs.sonatype.org/display/Repository/Sonatype+OSS+Maven+Repository+Usage+Guide

Prepare a release
Deploy release to a staging repository (see 7.a.3 in the docs above)
Use the Nexus UI to manually push a staged and “all clear” release to Maven central (See section 8 in docs above) They have several style checks (I think all are related to the POM)

Ability to set complex objects as values for variables within the dependency model

Ability to set complex objects as values for variables within the dependency model.

Refer to:
https://github.com/FINRAOS/DataGenerator/wiki/DataGenerator-V2-Requirements#r14-common-rules-for-r11-r12-and-r13

Implement the following:
R1.4.3
R1.4.4

CheckStyle: ReportingHost.java

Develop a format-preserving encryption

R1.4.12: Ability define conditionals on nodes and edges which would effectively enable selective traversal dependent on the condition evaluating to true.

R1.4.12: Ability define conditionals on nodes and edges which would effectively enable selective traversal dependent on the condition evaluating to true.

Note: This means if a conditional fails on a node or an edge there will be no further traversal till the end node. The variables already set in the path traversed will be either considered or not considered as a valid scenario set depending on the global job configuration variable set.

Research further features possibly required to support transformation of other formats into SCXML

Research further features possibly required to support transformation of other formats into SCXML.

e.g.
Requirement formats --> SCXML Model XML

http://www.w3.org/TR/scxml/

Implementation of all combinations and pairwise combinations within the dependency model

Implementation of all combinations and pairwise combinations within the dependency model.

Refer to section R1.4.20(1.4.20.1 , 1.4.20.2 and1.4.20.3) in https://github.com/FINRAOS/DataGenerator/wiki/DataGenerator-V2-Requirements#r14-common-rules-for-r11-r12-and-r13

Implement ALL NODES traversal based scenario generation

Ability to generate data such that all nodes of a dependency model will be exercised in the final resultant data set.

GUI editor

A locally running jetty can share a GUI (jenkins style) that can be used to edit state charts and save them locally, possibly allowing user collaboration.

Jetty's data distribution

Allow the main job to fire a jetty instance that responds to requests from the tasks and distribute computations during execution.

Add checkstyle & PMD

Add checkstyle & PMD to the pom such that travis build will fail on violations.

Default style is here:
https://java.net/projects/woodstock/sources/svn/content/trunk/webui/tools/checkstyle/sun_checks.xml

We can copy it, and use it as a starting point for checkstyle.

Add more unit tests

#17

Negative scenarios

Features that allow DataGenerator to insert negative scenarios in the generated data set.

Implement ALL EDGES traversal based scenario generation

Ability to generate data such that all edges of a dependency model will be exercised in the final resultant data set.

SearchWorker Race Condition

In simple models, the first SearchWorker finishes its DFS and sets the exit flag before subsequent SearchWorkers can enqueue results.

Split into multiple modules

Split into multiple modules:

The main DG library
Samples depending on the DG library
Main pom governing both. Make sure checkstyle still works for the new configuration.

CheckStyle: SystemParameter.java

CheckStyle: DataTransformer.java

Have the ability to learn a given data set.

https://github.com/FINRAOS/DataGenerator/wiki/DataGenerator-V2-Requirements#r30-have-the-ability-to-learn-a-given-data-set

Test the impact of SCXML2.0

http://commons.apache.org/proper/commons-scxml/roadmap.html

We need to check what are the differences between the current library and what's coming next. Also if we can contribute back to them.

We need a good documentation + Getting start in 5 / 30 docs

Review Issue #39

CheckStyle: LogInitializer.java

Ability define a database queries and simple flat files(R1.4.9.2) as a way to define variable assignments.

R1.4.9: Ability define a database queries(R1.4.9.1) and simple flat files(R1.4.9.2) as a way to define variable assignments. Look at the diagram below; the specification on the left is identical to the specification on the right. The only difference is the specification on the right is using a SQL query to define its dataset. Note the SQL can be replaced with a file URL which would contain the table/joined data in a CSV format.

https://github.com/FINRAOS/DataGenerator/wiki/DataGenerator-V2-Requirements#r14-common-rules-for-r11-r12-and-r13

CheckStyle: DataPipe.java

CheckStyle: DefaultDistributor.java

Update DataGenerator landing website with V2 changes.

DataGenerator Website

Update website to contain V2 release notice/advertisement banner.
Quick Start for V2 on landing page. (#50)
Update User Docs/ API Docs/FAQs
Update download latest executable to have V1 and V2.

Multiple variables with sets in a single state

Having multiple variables with sets that split into different states within a single state does not work properly.

Use SCXML as the state machine engine in DG

Allow DataGenerator to Generate Data based off of another Model's dataset

I don't think we've solved this one yet:

Suppose we wanted to generate a set of accounts, which contains some identifying information that is tied to that account (like an owner's name, birthday etc,.).

We now want generate a set of transactions that tie the transactions to each account that we generated above. As it stands right now, we need to generate the set of accounts and generate the transactions using a custom consumer that parses through the account file.

Would like to see if it's possible to eliminate the need for consumers to handle parsing previously generated datasets, and move towards generating the two data sets side by side (if feasable)
#55