gems-uff / noworkflow Goto Github PK

Supporting infrastructure to run scientific experiments without a scientific workflow management system.

Home Page: http://gems-uff.github.io/noworkflow

License: MIT License

Python 18.19% CSS 0.24% JavaScript 8.82% HTML 0.01% Prolog 0.38% Jupyter Notebook 68.16% TypeScript 2.12% Cython 0.03% Less 1.02% SCSS 1.03% Shell 0.01%

noworkflow's Introduction

noWorkflow

The noWorkflow project aims at allowing scientists to benefit from provenance data analysis even when they don't use a workflow system. Also, the goal is to allow them to avoid using naming conventions to store files originated in previous executions. Currently, when this is not done, the result and intermediate files are overwritten by every new execution of the pipeline.

noWorkflow was developed in Python and it currently is able to capture provenance of Python scripts using Software Engineering techniques such as abstract syntax tree (AST) analysis, reflection, and profiling, to collect provenance without the need of a version control system or any other environment.

Installing and using noWorkflow is simple and easy. Please check our installation and basic usage guidelines below, and the tutorial videos at our Wiki page.

Team

The main noWorkflow team is composed by researchers from Universidade Federal Fluminense (UFF) in Brazil and New York University (NYU), in the USA.

João Felipe Pimentel (UFF) (main developer)
Juliana Freire (NYU)
Leonardo Murta (UFF)
Vanessa Braganholo (UFF)
Arthur Paiva (UFF)

Collaborators

David Koop (University of Massachusetts Dartmouth)
Fernando Chirigati (NYU)
Paolo Missier (Newcastle University)
Vynicius Pontes (UFF)
Henrique Linhares (UFF)
Eduardo Jandre (UFF)
Jessé Lima

Publications

History

The project started in 2013, when Leonardo Murta and Vanessa Braganholo were visiting professors at New York University (NYU) with Juliana Freire. At that moment, David Koop and Fernando Chirigati also joined the project. They published the initial paper about noWorkflow in IPAW 2014. After going back to their home university, Universidade Federal Fluminense (UFF), Leonardo and Vanessa invited João Felipe Pimentel to join the project in 2014 for his PhD. João, Juliana, Leonardo and Vanessa integrated noWorkflow and IPython and published a paper about it in TaPP 2015. They also worked on provenance versioning and fine-grained provenance collection and published papers in IPAW 2016. During the same time, David, João, Leonardo and Vanessa worked with the YesWorkflow team on an integration between noWorkflow & YesWorkflow and published a demo in IPAW 2016. The research and development on noWorkflow continues and is currently under the responsibility of João Felipe, in the context of his PhD thesis.

Quick Installation

To install noWorkflow, you should follow these basic instructions:

First your Python version must be 3.7, then if you have pip, just run:

$ pip install noworkflow[all]

This installs noWorkflow, PyPosAST, SQLAlchemy, python-future, flask, IPython, Jupyter and PySWIP. The only requirements for running noWorkflow are PyPosAST, SQLAlchemy and python-future. The other libraries are only used for provenance analysis.

If you only want to install noWorkflow, PyPosAST, SQLAlchemy and python-future please do:

$ pip install noworkflow

If you do not have pip, but already have Git (to clone our repository) and Python:

$ git clone [email protected]:gems-uff/noworkflow.git
$ cd noworkflow/capture
$ python setup.py install

This installs noWorkflow on your system. It will download the dependencies from PyPI

If you want to install the dependencies to run the demos execute the following commands:

$ cd noworkflow
$ pip install -e capture[demo]

Upgrade

To upgrade the version of a previously installed noWorkflow using pip, you should run the following command:

$ pip install --upgrade noworkflow[all]

Basic Usage

noWorkflow is transparent in the sense that it requires neither changes to the script, nor any laborious configuration. Run

$ now --help

to learn the usage options.

noWorkflow comes with a demonstration project. To extract it, you should run

$ now demo 1
$ cd demo1

To run noWorkflow with the demo script called simulation.py with input data data1.dat and data2.dat, you should run

$ now run -v simulation.py data1.dat data2.dat

The -v option turns the verbose mode on, so that noWorkflow gives you feedback on the steps taken by the tool. The output, in this case, is similar to what follows.

$ now run -v simulation.py data1.dat data2.dat
[now] removing noWorkflow boilerplate
[now] setting up local provenance store
[now] using content engine noworkflow.now.persistence.content.plain_engine.PlainEngine
[now] collecting deployment provenance
[now]   registering environment attributes
[now] collection definition and execution provenance
[now]   executing the script
[now] the execution of trial 91f4fdc7-6c36-4c9d-a43a-341eaee9b7fb finished successfully

Each new run produces a different trial that will be stored with a universally unique identifier in the relational database.

Verifying the module dependencies is a time consuming step, and scientists can bypass this step by using the -b flag if they know that no library or source code has changed. The current trial then inherits the module dependencies of the previous one.

To list all trials, just run

$ now list

Assuming we run the experiment again and then run now list, the output would be as follows. Note that 9 trials were extracted from the demonstration.

$ now list
[now] trials available in the provenance store:
  [f]Trial 7fb4ca3d-8046-46cf-9c54-54923d2076ba: run -v .\simulation.py .\data1.dat .\data2.dat
                                                 with code hash 6a28e58e34bbff0facaf55f80313ab2fd2505a58
                                                 ran from 2023-04-12 19:38:50.234485 to 2023-04-12 19:38:51.672300
                                                 duration: 0:00:01.437815
  [*]Trial 01482b72-2005-4319-bd57-773291f9f7b1: run -v .\simulation.py .\data1.dat .\data2.dat
                                                 with code hash 6a28e58e34bbff0facaf55f80313ab2fd2505a58
                                                 ran from 2023-04-12 19:40:18.747749 to 2023-04-12 19:40:48.401719
                                                 duration: 0:00:29.653970
  [*]Trial c320d339-09d1-4d10-ad38-e565fa1f1f08: run simulation.py data1.dat data2.dat
                                                 with code hash 6a28e58e34bbff0facaf55f80313ab2fd2505a58
                                                 ran from 2023-04-12 19:44:28.459500 to 2023-04-12 19:44:43.310089
                                                 duration: 0:00:14.850589
  [f]Trial 28a6e5da-9a3c-473b-902c-44574beeef29: run simulation_complete.py
                                                 with code hash 78b5b11f3e6f7dca48a6ab9851df2cc0fb5157bc
                                                 ran from 2023-04-12 19:44:44.987635 to 2023-04-12 19:44:58.970957
                                                 duration: 0:00:13.983322
  [*]Trial 4a30be20-e295-4a38-8aea-6b36e4fd2bcd: run simulation.py data1.dat data2.dat
                                                 with code hash 8f73e09f17e877cb2d3ce3604cc66293abed2300
                                                 ran from 2023-04-12 19:45:00.667359 to 2023-04-12 19:45:15.783596
                                                 duration: 0:00:15.116237
  [*]Trial 87161c9c-9a8b-4742-ab3a-df1cdf1779d5: run simulation.py data2.dat data1.dat
                                                 with code hash 6a28e58e34bbff0facaf55f80313ab2fd2505a58
                                                 ran from 2023-04-12 19:45:19.122164 to 2023-04-12 19:45:35.050733
                                                 duration: 0:00:15.928569
  [b]Trial 8bf59cf5-cd06-409e-97f6-185063b1cfc3: restore 3
                                                 with code hash c3aeb4cb9af363b375aec603010dd1b97460f6b1
                                                 ran from 2023-04-12 19:45:36.937565 to 2023-04-12 19:45:37.141808
                                                 duration: 0:00:00.204243
  [*]Trial 0adee409-bebf-4119-ae57-8a9d5ba345ce: run simulation.py data1.dat data2.dat
                                                 with code hash 8f73e09f17e877cb2d3ce3604cc66293abed2300
                                                 ran from 2023-04-12 19:45:38.873199 to 2023-04-12 19:45:53.370662
                                                 duration: 0:00:14.497463
  [f]Trial 035a4749-1c58-4f1b-b296-d708779e258a: run simulation.py data1.dat data2.dat
                                                 with code hash c3aeb4cb9af363b375aec603010dd1b97460f6b1
                                                 ran from 2023-04-12 19:45:54.945150 to 2023-04-12 19:46:08.792798
                                                 duration: 0:00:13.847648
  [f]Trial b14bf7b9-a0e5-4f12-a1ae-fb3922c1cd5f: run simulation_complete.py
                                                 with code hash c7c8de76eb564530131abfab4d510bb187ec4b04
                                                 ran from 2023-04-12 19:46:10.360999 to 2023-04-12 19:46:23.811610
                                                 duration: 0:00:13.450611
  [*]Trial 231368e0-786a-4bf4-8e21-a8d05cc72585: run simulation.py data1.dat data2.dat
                                                 with code hash 6a28e58e34bbff0facaf55f80313ab2fd2505a58
                                                 ran from 2023-04-12 19:46:25.385022 to 2023-04-12 19:46:42.141455
                                                 duration: 0:00:16.756433
  [*]Trial 91f4fdc7-6c36-4c9d-a43a-341eaee9b7fb: run -v simulation.py data1.dat data2.dat
                                                 with code hash 6a28e58e34bbff0facaf55f80313ab2fd2505a58
                                                 ran from 2023-04-12 19:48:29.463034 to 2023-04-12 19:48:46.930577
                                                 duration: 0:00:17.467543

Each symbol between brackets is its respective trial status. They can express if

a trial is a backup: b

a trial has not finished: f

a trial has finished: *

To look at details of an specific trial, use

$ now show [trial]

This command has several options, such as -m to show module dependencies; -d to show function definitions; -e to show the environment context; -a to show function activations; and -f to show file accesses.

Running

$ now show -a 91f4fdc7-6c36-4c9d-a43a-341eaee9b7fb

would show details of trial 91f4fdc7-6c36-4c9d-a43a-341eaee9b7fb. Notice that the function name is preceded by the line number where the call was activated.

$ now show -a 91f4fdc7-6c36-4c9d-a43a-341eaee9b7fb
[now] trial information:
  Id: 91f4fdc7-6c36-4c9d-a43a-341eaee9b7fb
  Sequence Key: 21
  Status: Finished
  Inherited Id: None
  Script: simulation.py
  Code hash: 6a28e58e34bbff0facaf55f80313ab2fd2505a58
  Start: 2023-04-12 19:48:29.463034
  Finish: 2023-04-12 19:48:46.930577
  Duration: 0:00:17.467543
[now] this trial has the following function activation tree:
  1: __main__ (2023-04-12 19:48:30.263701 - 2023-04-12 19:48:42.070729)
     Return value: <module '__main__' from '/home/joao/demotest/demo1/simulation.py'>
    38: run_simulation (2023-04-12 19:48:38.590221 - 2023-04-12 19:48:40.676348)
        Parameters: data_a = 'data1.dat', data_b = 'data2.dat'
        Return value: [['0.0', '0.6'], ['1.0', '0.0'], ['1.0', '0.0']
        ...

To restore files used by trial 91f4fdc7-6c36-4c9d-a43a-341eaee9b7fb, run

$ now restore 91f4fdc7-6c36-4c9d-a43a-341eaee9b7fb

By default, the restore command will restore the trial script, imported local modules and the first access to files. Use the option -s to leave out the script; the option -l to leave out modules; and the option -a to leave out file accesses. The restore command track the evolution history. By default, subsequent trials are based on the previous Trial (e.g. Trial 01482b72-2005-4319-bd57-773291f9f7b1 is based on 7fb4ca3d-8046-46cf-9c54-54923d2076ba). When you restore a Trial, the next Trial will be based on the restored Trial (e.g. c320d339-09d1-4d10-ad38-e565fa1f1f08 based on Trial 7fb4ca3d-8046-46cf-9c54-54923d2076ba).

The restore command also provides a -f path option. This option can be used to restore a single file. With this command there are extra options: -t path2 specifies the target of restored file; -i id identifies the file. There are 3 possibilities to identify files: by access time, by code hash, or by number of access.

$ now restore 91f4fdc7-6c36-4c9d-a43a-341eaee9b7fb -f data1.dat -i "A|2023-04-12 19:48:46"
$ now restore 91f4fdc7-6c36-4c9d-a43a-341eaee9b7fb -f output.png -i 90451b101 -t output_trial1.png
$ now restore 91f4fdc7-6c36-4c9d-a43a-341eaee9b7fb -f simulation.py -i 1

The first command queries data1.dat of Trial 91f4fdc7-6c36-4c9d-a43a-341eaee9b7fb accessed at "2023-04-12 19:48:46", and restores the resulting content after the access. The second command restores output.png with subhash 90451b101, and save it to output_trial1.png. The third command restores the first access to simulation.py, which represents the trial script.

The option -f does not affect evolution history.

The remaining options of noWorkflow are diff, export, history, dataflow, and vis.

The diff option compares two trials. It has options to compare modules (-m), environment (-e), file accesses (-f). It has also an option to present a brief diff, instead of a full diff (--brief)

The export option exports provenance data of a given trial to Prolog facts, so inference queries can be run over the database.

The history option presents a textual history evolution graph of trials.

The dataflow option exports fine-grained provenance data to a graphviz dot representing the dataflow. This command has many options to change the resulting graph. Please, run "now dataflow -h" to get their descriptions.

$ now dataflow 91f4fdc7-6c36-4c9d-a43a-341eaee9b7fb -l -m prospective | dot -Tpng -o prospective.png

The vis option starts a visualization tool that allows interactive analysis:

$ now vis -b

The visualization tool shows the evolution history, the trial information, an activation graph. It is also possible to compare different trials in the visualization tool.

The visualization tool requires Flask to be installed. To install Flask, you can run

$ pip install flask==2.1.3

Collaboration Usage

noWorkflow can also be used to run collaborative experiments. Scientists with different computers can work on the same experiments without much trouble. To do this they must do push and pull operations to a server.

The server can be a central one or a peer-to-peer connection. To set up a server or connection online the command below must be run

$ now vis --force true

The command line output will show the server address

 * Serving Flask app 'noworkflow.now.vis.views'
 * Debug mode: off
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on http://localhost:5000
Press CTRL+C to quit

In the case above it's http://localhost:5000

To create a new experiment you must open the server address and choose the "Add Experiment" option

Then you must give the experiment a name, write its description, and choose "Confirm"

If the experiment is successfully created you should see a message stating so As you can see in the image above, an id and an url for the experiment will be generated after the experiment is created. The url is extremely important since it will be used to do the push and pull operations.

To get the experiment on a computer you first need to navigate to the folder where you want the experiment, then execute the pull command. The pull command accepts a --url parameter that must be followed by the experiment's url. For example

$ now pull --url http://localhost:5000/experiments/958273cc-b90a-4d1c-b617-43bd2dca20de

The command will download the experiment's files and provenience in the folder. If there are already any files or trials in the experiment you must execute the command "now restore" with or without a trial id.

To push(or commit) to the server(or peer-to-peer connection) you must run the push command. The push command accepts a --url parameter that must be followed by the experiment's url. For example

$ now push --url http://localhost:5000/experiments/958273cc-b90a-4d1c-b617-43bd2dca20de

You can also add groups to a server by navigating to the "Group Information" tab and choosing the "Add Group" option

Then you should write the group's name and choose "Confirm"

If the group is added successfully, you should see a message confirming that the group was created. You should also see the options to add a user to a group or to delete the group

If the option to add a user is chosen, you must select the user from a list and choose "Confirm".

To delete a group just select "Delete Group", then "OK" on the alert that will appear on the screen

Annotations

You can also add annotations to an experiment. To do this you need to access the experiment's url, then go to the "Annotation" tab, and select "Add Annotation"

After filling the annotation's information, choose "Confirm"

If the annotation is added, you will see a success message and will be able to download the annotation as seen below

Annotations can also be added to a trial by following the same procedure above. But first, you must select a trial, choose "Manage Annotations"

IPython Interface

Another way to run, visualize, and query trials is to use Jupyter notebook with IPython kernel. To install Jupyter notebook and IPython kernel, you can run

$ pip install jupyter
$ pip install ipython
$ jupyter nbextension install --py --sys-prefix noworkflow
$ jupyter nbextension enable noworkflow --py --sys-prefix

Then, to run Jupyter notebook, go to the project directory and execute:

$ jupyter notebook

It will start a local webserver where you can create notebooks and run python code.

Before loading anything related to noworkflow on a notebook, you must initialize it:

In  [1]: %load_ext noworkflow
    ...: import noworkflow.now.ipython as nip

It is equivalent to:

In  [1]: %load_ext noworkflow
    ...: nip = %now_ip

After that, you can either run a new trial or load an existing object (History, Trial, Diff).

There are two ways to run a new trial:

1- Load an external file

In  [1]: arg1 = "data1.dat"
         arg2 = "data2.dat"

In  [2]: trial = %now_run simulation.py {arg1} {arg2}
    ...: trial
Out [2]: <Trial "7fb4ca3d-8046-46cf-9c54-54923d2076ba"> # Loads the trial object represented as a graph

2- Load the code inside a cell

In  [3]: arg = 4

In  [4]: %%now_run --name new_simularion --interactive
    ...: l = range(arg)
    ...: c = sum(l)
    ...: print(c)
         6
Out [4]: <Trial "01482b72-2005-4319-bd57-773291f9f7b1"> # Loads the trial object represented as a graph

In  [5]: c
Out [5]: 6

Both modes supports all the now run parameters.

The --interactive mode allows the cell to share variables with the notebook.

Loading existing trials, histories and diffs:

In  [6]: trial = nip.Trial("7fb4ca3d-8046-46cf-9c54-54923d2076ba") # Loads trial with Id = 7fb4ca3d-8046-46cf-9c54-54923d2076ba
    ...: trial # Shows trial graph
Out [6]: <Trial 7fb4ca3d-8046-46cf-9c54-54923d2076ba>

In  [7]: history = nip.History() # Loads history
    ...: history # Shows history graph
Out [7]: <History>

In  [8]: diff = nip.Diff("7fb4ca3d-8046-46cf-9c54-54923d2076ba", "01482b72-2005-4319-bd57-773291f9f7b1") # Loads diff between trial 7fb4ca3d-8046-46cf-9c54-54923d2076ba and 01482b72-2005-4319-bd57-773291f9f7b1
    ...: diff # Shows diff graph
Out [8]: <Diff "7fb4ca3d-8046-46cf-9c54-54923d2076ba" "01482b72-2005-4319-bd57-773291f9f7b1">

To visualize the dataflow of a trial, it is possible to use the dot attribute of trial objects:

In  [9]: trial.dot
Out [9]: <png image>

This command requires an installation of graphviz.


There are attributes on those objects to change the graph visualization, width, height and filter values. Please, check the documentation by running the following code on jupyter notebook:
```python
In  [10]: trial?

In  [11]: history?

It is also possible to run prolog queries on IPython notebook. To do so, you will need to install SWI-Prolog with shared libraries and the pyswip module.

You can install pyswip module with the command:

$ pip install pyswip-alt

Check how to install SWI-Prolog with shared libraries at https://github.com/yuce/pyswip/blob/master/INSTALL

To query a specific trial, you can do:

In  [12]: result = trial.query("activation(_, 550, X, _, _, _)")
    ...: next(result) # The result is a generator
Out [12]: {'X': 'range'}

To check the existing rules, please do:

In  [13]: %now_schema prolog -t
Out [13]: [...]

Finally, it is possible to run the CLI commands inside ipython notebook:

In  [14]: !now export {trial.id}
Out [14]: %
     ...: % FACT: activation(trial_id, id, name, start, finish, caller_activation_id).
     ...: %
     ...: ...

Contributing

Pull requests for bugfixes and new features are welcome!

For installing the python dependencies locally, clone the repository and run:

pip install -e noworkflow/capture

For changes on the now vis or IPython integration files, install nodejs, Python 3 and run:

cd noworkflow/npm
python watch.py

Included Software

Parts of the following software were used by noWorkflow directly or in an adapted form:

The Python Debugger
Copyright (c) 2001-2016 Python Software Foundation.
All Rights Reserved.

Acknowledgements

We would like to thank CNPq, FAPERJ, and the National Science Foundation (CNS-1229185, CNS-1153503, IIS-1142013) for partially supporting this work.

License Terms

The MIT License (MIT)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

noworkflow's People

Contributors

Stargazers

Watchers

noworkflow's Issues

Collect environment provenance bug

I tried to run in ubuntu, and I got the exception:

~/data_mining$ python ../noworkflow/capture/noworkflow/now.py  run id3.py 
Traceback (most recent call last):
  File "../noworkflow/capture/noworkflow/now.py", line 59, in <module>
    now.main()
  File "/home/joao/noworkflow/capture/noworkflow/now.py", line 54, in main
    args.func(args)
  File "/home/joao/noworkflow/capture/noworkflow/cmd_run.py", line 43, in execute
    prov_deployment.collect_provenance(args)
  File "/home/joao/noworkflow/capture/noworkflow/prov_deployment.py", line 80, in collect_provenance
    environment = collect_environment_provenance()
  File "/home/joao/noworkflow/capture/noworkflow/prov_deployment.py", line 18, in collect_environment_provenance
    environment[name] = os.sysconf(name)
OSError: [Errno 22] Invalid argument

The bug is in the function os.sysconf(name). (https://github.com/gems-uff/noworkflow/blob/master/capture/noworkflow/prov_deployment.py#L18)
It only occurs when it tries to read 'SC_EQUIV_CLASS_MAX' or the code 41. All the other values of os.sysconf_names are fine
41 == os.sysconf_names['SC_EQUIV_CLASS_MAX']

Metaheuristic for matching activation graphs

Use a metaheuristics to improve the activation graph matching on diffs

Add parameter to select directory for show, list, export, diff, vis and checkout

Rename checkout operation to restore

Sumatra

Paper: Automated capture of experiment context for easier reproducibility in computational research
http://andrewdavison.info/media/files/reproducible_research_CiSE.pdf

Implementation: https://pypi.python.org/pypi/Sumatra

Slides: http://icerm.brown.edu/materials/Slides/tw-12-5/Sumatra:_a_toolkit_for_provenance_capture_and_reuse_%5D_Andrew_Davison,_Centre_National_de_la_Recherche_Scientifique_(CNRS).pdf

Pyprov 1.0.1: A Python implementation of PROV data model

https://pypi.python.org/pypi/pyprov/1.0.1

AttributeError on self.calls.append when there are no functions

Issue occurs in prov_definition.py when the script has no functions

self.calls, self.global_vars, self.arguments are never initialized, because the initialization occurs in visit_FunctionDef (https://github.com/gems-uff/noworkflow/blob/master/capture/noworkflow/prov_definition.py#L39) that is just called where there are defined functions

Script:

f1 = open('in.dat', 'r')
total = 0
for myline in f1:
    myval = int(myline.strip())
    total += myval
f1.close()

f2 = open('out.dat', 'w')
print >>f2, total
f2.close()

Traceback:

$ now run -v script2.py
[now] removing noWorkflow boilerplate
[now] setting up local provenance store
[now] collecting definition provenance
[now]   registering user-defined functions
Traceback (most recent call last):
  File "/usr/local/bin/now", line 9, in <module>
    load_entry_point('noworkflow==0.3', 'console_scripts', 'now')()
  File "/Library/Python/2.7/site-packages/noworkflow-0.3-py2.7.egg/noworkflow/now.py", line 63, in main
    args.func(args)
  File "/Library/Python/2.7/site-packages/noworkflow-0.3-py2.7.egg/noworkflow/cmd_run.py", line 43, in execute
    prov_definition.collect_provenance(args)
  File "/Library/Python/2.7/site-packages/noworkflow-0.3-py2.7.egg/noworkflow/prov_definition.py", line 92, in collect_provenance
    functions = find_functions(args.script, code)
  File "/Library/Python/2.7/site-packages/noworkflow-0.3-py2.7.egg/noworkflow/prov_definition.py", line 75, in find_functions
    visitor.visit(tree)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ast.py", line 241, in visit
    return visitor(node)
  File "/Library/Python/2.7/site-packages/noworkflow-0.3-py2.7.egg/noworkflow/prov_definition.py", line 32, in generic_visit
    ast.NodeVisitor.generic_visit(self, node)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ast.py", line 249, in generic_visit
    self.visit(item)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ast.py", line 241, in visit
    return visitor(node)
  File "/Library/Python/2.7/site-packages/noworkflow-0.3-py2.7.egg/noworkflow/prov_definition.py", line 32, in generic_visit
    ast.NodeVisitor.generic_visit(self, node)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ast.py", line 251, in generic_visit
    self.visit(value)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ast.py", line 241, in visit
    return visitor(node)
  File "/Library/Python/2.7/site-packages/noworkflow-0.3-py2.7.egg/noworkflow/prov_definition.py", line 62, in visit_Call
    self.calls.append(func.id)
AttributeError: 'NoneType' object has no attribute 'append'

Git in Python

http://pythonhosted.org/GitPython/0.3.1/index.html

Semantic Versioning for Trial

I liked the idea proposed to Vitor to use a trial id using the format MAJOR.MINOR.PATCH, and I think it can be applied here

Meaning:
MAJOR: code version
MINOR: params version
PATCH: execution number

For example, let`s say I have the following script

# script.py
import sys
print sys.argv

If I execute now run script.py 1, it should generate the trial id 0.0.0 (or maybe another starting number)
If I execute the exactly same command again, it generate the trial id 0.0.1 and so on

If I change the params and execute now run script.py 2, it should generate the trial id 0.1.0
If I change the params again and execute now run script.py 3, it should generate the trial id 0.2.0

If I decide to change the code:

# script.py
import sys
print sys.argv[1]

And execute it again now run script.py 1, the new version should be 1.0.0

Questions:
1- Is it natural to have a Trial 0.0.0? Or it is better to start with 1.1.1? Or even 0.0.1?
2- Are the params independent from the code version? In the previous example, after changing the code, if I run now run script.py 3, should it generate the trial id 1.2.0 (knowing that the param '3' was used in 0.2.0) or generate the trial id 1.1.0, because the last MINOR was 0?

Sample pipelines in Python: Rosalind

http://rosalind.info/problems/locations/

IPython Required?

I think that the .now.ipython import in noworkflow/__init__.py means that IPython is required to run noWorkflow conflicting the README. However, I tried moving that import inside the load_ipython_extension method, and that did allow the help message to be printed without IPython installed.

tests fails

Installed from pip and running with python 2.7.6.

File "/User/lib/python2.7/site-packages/noworkflow/now/prov_definition/function_visitor.py", line 89, in extract_disasm
self.metascript['code'], self.metascript['path'], 'exec')
File "/noworkflow/tests/example1.py", line 22
print 40
^
SyntaxError: invalid syntax

Integrate noWorkflow and ReproZip

ReproZip could have provenance analysis like noWorkflow; also, noWorkflow can use the provenance captured by ReproZip to improve its analysis. Many of the applied math examples use Python + Fortran: in fact, many of the file operations happen in the Fortran side. When combining both tools, even the non-Python operations could be detected and integrated.

Program slicing

Extract the program slicing to better identify function/data dependencies

Prolog export bug

I wrote a simple python script, executed it (now run script.py) and tried to export it to prolog (now export -r >> result.pl)
The generated .pl didn't have the access facts and when I tried to query the results, I received errors

script.py

def fn(arg1):
    print arg1
def main():
    fn("teste")
if __name__ == '__main__':
    main()

result.pl

%
% FACT: activation(id, name, start, finish, caller_activation_id).
%

activation(1, '/home/joao/projects/exemplo/script.py', 1408974736.650279, 1408974736.651658, nil).
activation(2, 'main', 1408974736.650899, 1408974736.651635, 1).
activation(3, 'fn', 1408974736.651465, 1408974736.651602, 2).

%
% FACT: access(id, name, mode, content_hash_before, content_hash_after, timestamp, activation_id).
%


%
% ID-BASED ACCESSOR RULES FOR
% activation(id, name, start, finish, caller_activation_id).
% access(id, name, mode, content_hash_before, content_hash_after, timestamp, activation_id).
%

name([], []).
name([Id|Ids], [Name|Names]) :- name(Id, Name), name(Ids, Names).
name(Id, Name) :- activation(Id, Name, _, _, _).
name(Id, Name) :- access(Id, Name, _, _, _, _, _).

timestamp_id(Id, Start, start) :- activation(Id, _, Start, _, _).
timestamp_id(Id, Finish, finish) :- activation(Id, _, _, Finish, _).
timestamp_id(Id, Timestamp) :- access(Id, _, _, _, _, Timestamp, _).
duration_id(Id, Duration) :- timestamp_id(Id, Start, start), timestamp_id(Id, Finish, finish), Duration is Finish - Start.
successor_id(Before, After) :- timestamp_id(Before, TS1, start), timestamp_id(After, TS2, finish), TS1 =< TS2.
successor_id(Before, After) :- timestamp_id(Before, TS1), timestamp_id(After, TS2), TS1 =< TS2.

activation_id(Caller, Called) :- activation(Called, _, _, _, Caller).

mode_id(Id, Mode) :- access(Id, _, Mode, _, _, _, _).
file_read_id(Id) :- mode_id(Id, Mode), atom_prefix(Mode, 'r').
file_written_id(Id) :- mode_id(Id, Mode), atom_prefix(Mode, 'w').

hash_id(Id, Hash, before) :- access(Id, _, _, Hash, _, _, _).
hash_id(Id, Hash, after) :- access(Id, _, _, _, Hash, _, _).
changed_id(Id) :- hash_id(Id, Hash1, before), hash_id(Id, Hash2, after), Hash1 \== Hash2.

access_id(Function, File) :- access(File, _, _, _, _, _, Function).

%
% ID-BASED INFERENCE RULES
%

activation_stack_id(Called, []) :- activation_id(nil, Called). 
activation_stack_id(Called, [Caller|Callers]) :- activation_id(Caller, Called), activation_stack_id(Caller, Callers).

indirect_activation_id(Caller, Called) :- activation_stack_id(Called, Callers), member(Caller, Callers).

% Naive! Should check arguments and return values (program slicing?) to avoid false positives
activation_influence_id(Influencer, Influenced) :- successor_id(Influencer, Influenced).

access_stack_id(File, [Function|Functions]) :- access_id(Function, File), activation_stack_id(Function, Functions).

indirect_access_id(Function, File) :- access_stack_id(File, Functions), member(Function, Functions).

access_influence_id(Influencer, Influenced) :- file_read_id(Influencer), file_written_id(Influenced), successor_id(Influencer, Influenced), access_id(F1, Influencer), access_id(F2, Influenced), activation_influence_id(F1, F2).

%
% NAME-BASED ACCESSOR RULES
%

timestamp(Name, Timestamp, Moment) :- timestamp_id(Id, Timestamp, Moment), name(Id, Name).
timestamp(Name, Timestamp) :- timestamp_id(Id, Timestamp), name(Id, Name).
duration(Name, Duration) :- duration_id(Id, Duration), name(Id, Name). 
successor(Before, After) :- successor_id(Before_id, After_id), name(Before_id, Before), name(After_id, After).
mode(Name, Mode) :- mode(Id, Mode), name(Id, Name).
file_read(Name) :- file_read_id(Id), name(Id, Name).
file_written(Name) :- file_written_id(Id), name(Id, Name).
hash(Name, Hash, Moment) :- hash_id(Id, Hash, Moment), name(Id, Name).
changed(Name) :- changed_id(Id), name(Id, Name).

%
% NAME-BASED INFERENCE RULES
%

activation_stack(Called, Callers) :- activation_stack_id(Called_id, Caller_ids), name(Called_id, Called), name(Caller_ids, Callers).
indirect_activation(Caller, Called) :- indirect_activation_id(Caller_id, Called_id), name(Called_id, Called), name(Caller_id, Caller).
activation_influence(Influencer, Influenced) :- activation_influence_id(Influencer_id, Influenced_id), name(Influencer_id, Influencer), name(Influenced_id, Influenced).
access_stack(File, Functions) :- access_stack_id(File_id, Functions_id), name(File_id, File), name(Functions_id, Functions).
indirect_access(Function, File) :- indirect_access_id(Function_id, File_id), name(Function_id, Function), name(File_id, File).
access_influence(Influencer, Influenced) :- access_influence_id(Influencer_id, Influenced_id), name(Influencer_id, Influencer), name(Influenced_id, Influenced).

Execution:

$ swipl
?- [result].
% result compiled 0.00 sec, 44 clauses
true.

?- name(X, 'fn').
X = 3
?- activation_stack('fn', X).
ERROR: name/2: Undefined procedure: access/7
   Exception: (8) access(1, fn, _G1015, _G1016, _G1017, _G1018, _G1019) ?

Wanted: An Entry-Level Provenance Library

http://software-carpentry.org/blog/2012/10/wanted-an-entry-level-provenance-library.html

Diff for function activations

Compare activations parameters and returns between trials

Erro ao tentar pegar valor atual da SEQUENCE no BD

Tentei rodar a versão mais recente do master e recebo o seguinte erro. Minha versão do Python é 2.7.3:

camundongo:weather leomurta$ now run simulation.py data1.dat data2.dat 
Traceback (most recent call last):
  File "/Users/leomurta/Library/Enthought/Canopy_64bit/User/bin/now", line 9, in <module>
    load_entry_point('noworkflow==0.3.1-dev', 'console_scripts', 'now')()
  File "/Users/leomurta/workspace/noworkflow/capture/noworkflow/now.py", line 65, in main
    args.func(args)
  File "/Users/leomurta/workspace/noworkflow/capture/noworkflow/cmd_run.py", line 64, in execute
    prov_execution.store()  # TODO: exceptions should be registered as return from the activation and stored in the database. We are currently ignoring all the activation tree when exceptions are raised.
  File "/Users/leomurta/workspace/noworkflow/capture/noworkflow/prov_execution.py", line 269, in store
    provider.store()
  File "/Users/leomurta/workspace/noworkflow/capture/noworkflow/prov_execution.py", line 226, in store
    persistence.update_trial(now, self.function_activation)
  File "/Users/leomurta/workspace/noworkflow/capture/noworkflow/persistence.py", line 132, in update_trial
    store_function_activation(function_activation, None)
  File "/Users/leomurta/workspace/noworkflow/capture/noworkflow/persistence.py", line 219, in store_function_activation
    function_activation_id = function_activation_id_seq()
  File "/Users/leomurta/workspace/noworkflow/capture/noworkflow/persistence.py", line 114, in function_activation_id_seq
    (an_id,) = db.execute("select seq from SQLITE_SEQUENCE WHERE name='function_activation'").fetchone()
TypeError: 'NoneType' object is not iterable

Complex data slicing

The current program slicing does not handle complex data such as lists, dicts and objects very well.
ToDo:

Consider getitem: a[b]
Consider getattr: a.b
Treat parameters and assignments of variables that are references.

Current issues:

def fn(x, y, z, w):
    x.attr = y 

o = SomeObject()
a, b, c = 1, 2, 3
o[a] = b 
v = o
v[a] = c 

fn(o, a, b, c)

With the line "v[a] = c", both "o[a]" and "v[a]" should depend on "c".
With the line "x.attr = y", both "x.attr", "v.attr" and "o.attr" should depend on "y".
How to deal with these situations?

There are some discussion regarding this issue in #19

Navigate to previous versions

I just realized that I didn't create an issue for this part of the cm project

I need a suggestion for the command name: navigate, checkout, restore?

Argument overlap bug

Noworkflow tries to read the script arguments as being its own arguments.

Code:

# argv.py
import sys
if __name__ == '__main__':
    print(sys.argv)

Execution:

$ python argv.py -c test
['argv.py', '-c', 'test']

$ now run argv.py -c test
usage: now run [-h] [-v] [-b] [-c {non-user,all}] [-d DEPTH] script
now run: error: argument -c/--depth-context: invalid choice: 'test' (choose from 'non-user', 'all')

Python 3 support

Provide python 3 support, and keep python 2.7 support

Find a Python database to store the hashes and execution results

Include data in the provenance graph

When possible, include argument and return values,
This would not be possible for loops

Integrate provenance at different levels

E.g.: Integrate operating system level with function level

Provenance

http://software-carpentry.org/v4/essays/provenance.html

Dynamic Slicing of python programs

http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6899220&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6899220

Very slow execution

A script using netCDF4 that executes in less than 5s with pure python is taking more than 2 hours to execute with noWorkflow. I believe it is an issue with the storage of activations with sqlite since the script is O(n³) and n seems to be big

Cell Magic for IPython Notebook

Support cell magic for ipython notebook:
Example

In [1]: my_var = 2
In [2]: %%now --name script1 $my_var 30
   ...: import sys
   ...: print(sys.argv[1])
   ...: print(sys.argv[2])
2
30
Out [2]: <Trial 1>

This should run the code in the cell using noworkflow and return a trial object.
The default visualization of a trial object is an activation graph using ipython notebook.

Invalid call identification by column causes KeyError with Tracer

The error appears by running “now run -e Tracer tests/weather/simulation.py tests/weather/data1.dat tests/weather/data2.dat” on version 0.7.0. The error message is attached below.

[now] the execution finished with an uncaught exception. Traceback (most recent call last):
  File "/Library/Python/2.7/site-packages/noworkflow-0.7.0-py2.7.egg/noworkflow/now/cmd/cmd_run.py", line 109, in run
    exec(metascript['compiled'], __main__.__dict__)
  File "/Users/syc/Documents/noworkflow/tests/weather/simulation.py", line 38, in <module>
    data = run_simulation(data_a, data_b)
  File "/Users/syc/Documents/noworkflow/tests/weather/simulation.py", line 7, in run_simulation
    a = csv_read(data_a)
  File "/Library/Python/2.7/site-packages/noworkflow-0.7.0-py2.7.egg/noworkflow/now/prov_execution/slicing.py", line 330, in tracer
    return super(Tracer, self).tracer(frame, event, arg)
  File "/Library/Python/2.7/site-packages/noworkflow-0.7.0-py2.7.egg/noworkflow/now/prov_execution/profiler.py", line 174, in tracer
    super(Profiler, self).tracer(frame, event, arg)
  File "/Library/Python/2.7/site-packages/noworkflow-0.7.0-py2.7.egg/noworkflow/now/prov_execution/base.py", line 44, in tracer
    self.event_map[event](frame, event, arg)
  File "/Library/Python/2.7/site-packages/noworkflow-0.7.0-py2.7.egg/noworkflow/now/prov_execution/slicing.py", line 306, in trace_return
    super(Tracer, self).trace_return(frame, event, arg)
  File "/Library/Python/2.7/site-packages/noworkflow-0.7.0-py2.7.egg/noworkflow/now/prov_execution/profiler.py", line 160, in trace_return
    self.close_activation(event, arg)
  File "/Library/Python/2.7/site-packages/noworkflow-0.7.0-py2.7.egg/noworkflow/now/prov_execution/slicing.py", line 176, in close_activation
    self.slice_line(*line)
  File "/Library/Python/2.7/site-packages/noworkflow-0.7.0-py2.7.egg/noworkflow/now/prov_execution/slicing.py", line 169, in slice_line
    add_dependencies(variables[vid], activation, others)
  File "/Library/Python/2.7/site-packages/noworkflow-0.7.0-py2.7.egg/noworkflow/now/prov_execution/slicing.py", line 142, in add_dependencies
    add_dependency(var, dep, activation)
  File "/Library/Python/2.7/site-packages/noworkflow-0.7.0-py2.7.egg/noworkflow/now/prov_execution/slicing.py", line 120, in add_dependency
    call = self.call_by_col[dep[0]][dep[1]]
KeyError: -1

Bug reported by Suzanna Yang Cao

Workflow evolution story

Create a graph with trial nodes as revisions. (VisTrails-style provenance)
When you clink on a node, it should open the trial workflow

Eg.: http://www.vistrails.org/index.php/File:Cmop-ss.png

IPython notebook support for data visualization

Since IPython notebook files can be shared, it would be good to support an export method for trials, history and even prolog facts that could be loaded on IPython notebook without relying on the visualization tool server.

Configure granularity of capture

Possible configurations

Capture all variables being used
Capture provenance at user-defined function level (default)
Capture provenance at deeper function levels

Python Omniscient Debugger

Python Omniscient Debugger: https://github.com/rodsenra/pode

They capture variables and events in a similar way that we do for program slicing (using python tracer).
The main difference is that they don't capture the dependencies. They capture only the values after the assignments.
Another difference is that while we look at the AST, they look at the disasm to extract assingments, which is probably easier. Maybe we can explore it in the future since we already look at the disasm to extract the position of function calls.

There is a hangout in portuguese explaining the code: https://www.youtube.com/watch?v=MxzZXBI5T1s

Investigate techniques for summarizing and visualizing provenance graphs

Also analyze different ways to contrast different trials.

Detect variable usage during del operation for program slicing

When a variable is deleted, its ctx is set as "Del". Variable usages are created using only ctx "Load". So it is not possible to query "Where did the variable X deleted in line 7 come from", as this deletion is not stored.

Parametrize the graph visualization (combine counting vs independent counting)

As @leomurta suggested in #28:

This can be parametrized. Depending on the analysis scenario, the user may switch the plug and observe the data in a different way. Actually, try to play with both visualizations to figure out which questions they help on answering.

The problem that I foresee in not propagating is this: if the function has a control structure that switches 90% of the time to a() and 10% of the time to b(), you would show two call edges, both with weight 1, one to a() and the other to b(). This does not clearly reflect the call distribution.

My feeling: the visualization without propagation mimics the syntactic structure of the script. The visualization with propagation mimics the dynamic structure of the system. In a more complex graph, the explicit count (with propagation) is probably easier to see how many times each function is being executed.

yield during slicing bug

The following code does not work with the program slicing:

def f(l):
    for x in l:
        yield x

a = range(10)
for x in f(a):
    print x

Traceback

python ~/noworkflow/capture/noworkflow/main.py run -e Tracer test_cov.py 
33
[now] the execution finished with an uncaught exception. Traceback (most recent call last):
  File "/home/joao/noworkflow/capture/noworkflow/now/cmd/cmd_run.py", line 91, in run
    exec(metascript['compiled'], __main__.__dict__)
  File "/home/joao/noworkflow/tests/test_cov.py", line 6, in <module>
    for x in f(a):
  File "/home/joao/noworkflow/tests/test_cov.py", line 1, in f
    def f(l):
  File "/home/joao/noworkflow/capture/noworkflow/now/prov_execution/slicing.py", line 323, in tracer
    return super(Tracer, self).tracer(frame, event, arg)
  File "/home/joao/noworkflow/capture/noworkflow/now/prov_execution/profiler.py", line 151, in tracer
    super(Profiler, self).tracer(frame, event, arg)
  File "/home/joao/noworkflow/capture/noworkflow/now/prov_execution/base.py", line 34, in tracer
    self.event_map[event](frame, event, arg)
  File "/home/joao/noworkflow/capture/noworkflow/now/prov_execution/slicing.py", line 285, in trace_call
    self.add_argument_variables(frame)
  File "/home/joao/noworkflow/capture/noworkflow/now/prov_execution/slicing.py", line 243, in add_argument_variables
    call = self.call_by_lasti[back.f_lineno][back.f_lasti]
KeyError: 33

Disasm

  1           0 LOAD_CONST               0 (<code object f at 0x7f66e3267730, file "/home/joao/noworkflow/tests/test_cov.py", line 1>)
              3 MAKE_FUNCTION            0
              6 STORE_NAME               0 (f)

  5           9 LOAD_NAME                1 (range)
             12 LOAD_CONST               1 (10)
             15 CALL_FUNCTION            1 | F(func=['range'], args=[[]], keywords={}, *args=[], **kwargs=[])
             18 STORE_NAME               2 (a)

  6          21 SETUP_LOOP              25 (to 49)
             24 LOAD_NAME                0 (f)
             27 LOAD_NAME                2 (a)
             30 CALL_FUNCTION            1 | F(func=['f'], args=[['a']], keywords={}, *args=[], **kwargs=[])
             33 GET_ITER            
        >>   34 FOR_ITER                11 (to 48)
             37 STORE_NAME               3 (x)

  7          40 LOAD_NAME                3 (x)
             43 PRINT_ITEM          
             44 PRINT_NEWLINE       
             45 JUMP_ABSOLUTE           34
        >>   48 POP_BLOCK           
        >>   49 LOAD_CONST               2 (None)
             52 RETURN_VALUE        
  2           0 SETUP_LOOP              19 (to 22)
              3 LOAD_FAST                0 (l)
              6 GET_ITER            
        >>    7 FOR_ITER                11 (to 21)
             10 STORE_FAST               1 (x)

  3          13 LOAD_FAST                1 (x)
             16 YIELD_VALUE         
             17 POP_TOP             
             18 JUMP_ABSOLUTE            7
        >>   21 POP_BLOCK           
        >>   22 LOAD_CONST               0 (None)
             25 RETURN_VALUE

line = 6
last_i = 33 
call_by_lasti = defaultdict(<type 'dict'>, {
    1: {}, 
    2: {}, 
    3: {}, 
    5: {15: F(func=['range'], args=[[]], keywords={}, *args=[], **kwargs=[])}, 
    6: {30: F(func=['f'], args=[['a']], keywords={}, *args=[], **kwargs=[])}, 
    7: {}})

In the line 6, it captures the function "f" with last_i 30, but during the execution the "for" calls the function through get_iter (last_i 33)

Function args not being captured

After the refactoring that replaced the inspect.getargvalues(frame), the sequential args stopped being captured.

_args and *_kwargs still work

Cache activation graphs

Some activation graphs are taking a long time for calculations.
Maybe a cache can minimize this problem

A cache may also help on loading pre-calculated results of #49

Strong Links

http://dl.acm.org/citation.cfm?id=1876071

The id column in the database does not express anymore the order in which the element appears

Previous versions of noWorkflow assumed that the id column in the database reflects the order in which functions are activated, arguments are passed, and so on. After the change to store data in batch, this property got lost. Not sure if it is important to keep this property or not, but this should be checked.

The graph visualization, for instance, processes function activations ordered by id. I am changing it to order by start time. However, it is worth it to search for other "order by id" in the project and evaluate possible side effects. Arguments may need to have an extra field to express the order they appear.

Data Provenance with GitPython

http://penandpants.com/2013/04/25/data-provenance-with-gitpython/

A Python Library for Provenance Recording and Querying

http://link.springer.com/chapter/10.1007%2F978-3-540-89965-5_24

Python prov

A suites of Python modules for encoding, tracking, and storing provenance assertions

ReStore: Reusing Results of MapReduce Jobs

Proceedings of the VLDB Endowment (PVLDB), volume 5, number 6, pages 586-597, February 2012
https://cs.uwaterloo.ca/~ashraf/pubs/pvldb12restore.pdf

SIGMOD DEMO:
https://cs.uwaterloo.ca/~ashraf/pubs/sigmod12restoredemo.pdf

Set, dict, generic comprehension and disasm bug

The only comprehension that is working is the list comprehension. The others (dict, set, parameter) fail during the disasm phase

a = range(10)
b = [x for x in a] # ok
c = sum(x for x in a) # fails
d = {x for x in a} # fails
e = {x:x for x in a} # fails

It seems that those comprehensions call a function in the disassembly but the function is not being caught in the AST phase, so the disasm phase cannot match the function in the following line

calls_by_lasti[f_lasti] = calls_by_line[col]