VevestaX

Library to track machine learning experiments, features as well as automatic EDA in a spreadsheet

VevestaX is an open source Python package for ML Engineers and Data Scientists. It includes modules for EDA, tracking features sourced from data, feature engineering and variables. The output is an excel file which has tabs namely, data sourcing, feature engineering, modelling, performance plots for tracking performance of variables(accuracy etc) over multiple experiments and lastly, multiple EDA plots. The library can be used with Jupyter notebook, IDEs like spyder, Colab, Kaggle notebook or while running the python script through command line. VevestaX is framework agnostic. You can use it with any machine learning or deep learning framework.

How to install VevestaX

pip install vevestaX

How to import VevestaX and create the experiment object

#import the vevesta Library
from vevestaX import vevesta as v
V=v.Experiment()

How to extract features present in input pandas or pyspark dataframe

Code snippet:

#read the dataset
import pandas as pd
df=pd.read_csv("salaries.csv")
df.head(2)

#Extract the columns names for features
V.ds=df
# you can also use:
#   V.dataSourcing = df

How to extract engineered features

Code snippet

#Extract features engineered
V.fe=df  
# you can also use:
V.featureEngineering = df

How to track variables used

V.start() and V.end() form a code block and can be called multiple times in the code to track variables used within the code block. Any technique such as XGBoost, decision tree, etc can be used within this code block. All computed variables will be tracked between V.start() and V.end(). If V.start() and V.end() is not used, all the variables used in the code will be tracked.

Code snippet:

#Track variables which have been used for modelling
V.start()
# you can also use: V.startModelling()


# All the variables mentioned here will be tracked
epochs=100
seed=3
accuracy = computeAccuracy() #this will be computed variable
recall = computeRecall() #This will be computed variable
loss='rmse'


#end tracking of variables
V.end()
# or, you can also use : V.endModelling()

How to track all variables in the code while writing less code

You can absolutely eliminate using V.start() and V.end() function calls. All the primitive data type variables used in the code are tracked and written to the excel file by default. Note: while on colab or kaggle, V.start() and V.end() feature hasn't been rolled out. Instead all the variables used in the code are tracked by default.

How to write the features and modelling variables in an given excel file

Code snippet:

# Dump the datasourcing, features engineered and the variables tracked in a xlsx file
V.dump(techniqueUsed='XGBoost',filename="vevestaDump1.xlsx",message="XGboost with data augmentation was used",version=1)

Alternatively, write the experiment into the default file, vevesta.xlsx Code snippet:

V.dump(techniqueUsed='XGBoost')

How to commit file, features and parameters to Vevesta

Vevesta is next generation knowledge repository/GitHub for data science project. The tool is free to use. Please create a login on vevesta . Then go to Setting section, download the access token. Place this token in the same folder as the jupyter notebook or python script. If my chance you face difficulties, please do mail [email protected].

You can commit the file(code),features and parameters to Vevesta by using the following command. You will find the project id for your project on the home page.

Code Snippet:

V.commit(techniqueUsed = "XGBoost", message="increased accuracy", version=1, projectId=1, attachmentFlag=True)

A sample output excel file has been uploaded on google sheets. Its url is here

Snapshots of output excel file

After running calling the dump or commit function for each run of the code. The features used, features engineered and the variables used in the experiments get logged into the excel file. In the below experiment, the commit/dump function is called 6 times and each time an experiment/code run is written into the excel sheet.

For example, code snippet used to track code runs/experiments are as below:

#import the vevesta Library
from vevestaX import vevesta as v
V=v.Experiment()
df = pd.read_csv("wine.csv") 
V.ds = df
df["salary_Ratio1"] = df["alchol_content"]/5
V.fe = df
epoch = 1000
accuracy = 90 #this will be a computed variable, may be an output of XGBoost algorithm
recall = 89  #this will be a computed variable, may be an output of XGBoost algorithm

For the above code snippet, each row in the excel sheet corresponds to an experiment/code run. The excel sheet will have the following:

Data Sourcing tab: Marks which Features (or columns) in wine.csv were read from the input file. Presence of the feature is marked as 1 and absence as 0.
Feature Engineering tab: Features engineered such as salary_Ratio1 exist as columns in the excel. Value 1 means that feature was engineered in that particular experiment and 0 means it was absent.
Modelling tab: This tab tracks all the variables used in the code. Say variable precision was computed in the experiment, then for the experiment ID i, precision will be a column whose value is computed precision variable. Note: V.start() and V.end() are code blocks that you might define. In that case, the code can have multiple code blocks. The variables in all these code blocks are tracked together. Let us define 3 code blocks in the code, first one with precision, 2nd one with recall and accuracy and 3rd one with epoch, seed and no of trees. Then for experiment Id , all the variables, namely precision, recall, accuracy, epoch, seed and no. of trees will be tracked as one experiment and dumped in a single row with experiment id . Note, if code blocks are not defined then it that case all the variables are logged in the excel file.
Messages tab: Data Scientists like to create new files when they change technique or approach to the problem. So everytime you run the code, it tracks the experiment ID with the name of the file which had the variables, features and features engineered.
EDA-correlation: correlation is calculated on the input data automatically.
EDA-box Plot tab: Box plots for numeric features
EDA-Numeric Feature Distribution: Scatter plot with x axis as index in the data and y axis as the value of the data point.
EDA-Feature Histogram: Histogram of numeric features

Please note, EDA computation can be skipped by passing true during the creation of the object v.Experiment(True). The following is the code snippet:

#import the vevesta Library
from vevestaX import vevesta as v
V=v.Experiment(true)

Sourced Data tab

Feature Engineering tab

Modelling tab

Messages tab

Sample data tab

EDA-correlation tab

Overall data profile report tab

Variables data profile report tab

Scatterplot for numeric features

Histogram for numeric features

Box plot for numeric features

Experiments performance plots

How to speed up the code

The library does EDA automatically on the data. In order to accelerate compute and skip EDA, set the flag speedUp=True as shown in the code snippet.

#import the vevesta Library
from vevestaX import vevesta as v
V = v.Experiment(True)
#or u can also use
#V=v.Experiment(speedUp = True)

If you liked the library, please give us a github star and retweet .

For additional features, explore our tool at Vevesta . For comments, suggestions and early access to the tool, reach out at [email protected]

Looking for beta users for the library. Register here

We at Vevesta Labs are maintaining this library and we welcome feature requests. Find detailed blog on the vevestaX on Medium

anasaijaz / vevestax Goto Github PK

vevestax's Introduction

VevestaX

Library to track machine learning experiments, features as well as automatic EDA in a spreadsheet

Table of Contents

How to install VevestaX

How to import VevestaX and create the experiment object

How to extract features present in input pandas or pyspark dataframe

How to extract engineered features

How to track variables used

How to track all variables in the code while writing less code

How to write the features and modelling variables in an given excel file

How to commit file, features and parameters to Vevesta

Snapshots of output excel file

Sourced Data tab

Feature Engineering tab

Modelling tab

Messages tab

Sample data tab

EDA-correlation tab

Overall data profile report tab

Variables data profile report tab

Scatterplot for numeric features

Histogram for numeric features

Box plot for numeric features

Experiments performance plots

How to speed up the code

Recommend Projects

Recommend Topics

Recommend Org

Jobs