GithubHelp home page GithubHelp logo

anasaijaz / vevestax Goto Github PK

View Code? Open in Web Editor NEW

This project forked from vevesta/vevestax

0.0 0.0 0.0 438 KB

2 Lines of code to track features + machine learning experiments + EDA in a spreadsheet

Home Page: https://vevesta.com

License: Apache License 2.0

Python 23.91% Jupyter Notebook 76.09%

vevestax's Introduction

VevestaX

image

Downloads Downloads Downloads License Twitter URL

Library to track machine learning experiments, features as well as automatic EDA in a spreadsheet

VevestaX is an open source Python package for ML Engineers and Data Scientists. It includes modules for EDA, tracking features sourced from data, feature engineering and variables. The output is an excel file which has tabs namely, data sourcing, feature engineering, modelling, performance plots for tracking performance of variables(accuracy etc) over multiple experiments and lastly, multiple EDA plots. The library can be used with Jupyter notebook, IDEs like spyder, Colab, Kaggle notebook or while running the python script through command line. VevestaX is framework agnostic. You can use it with any machine learning or deep learning framework.

Table of Contents

  1. How to Install VevestaX
  2. How to import VevestaX and create the experiment object
  3. How to extract features present in input pandas/pyspark dataframe
  4. How to extract engineered features
  5. How to track variables used
  6. How to track all variables in the code while writing less code
  7. How to write the features and modelling variables in an given excel file
  8. How to commit file, features and parameters to Vevesta
  9. Snapshots of output excel file
  10. How to speed up the code

How to install VevestaX

pip install vevestaX

How to import VevestaX and create the experiment object

#import the vevesta Library
from vevestaX import vevesta as v
V=v.Experiment()

How to extract features present in input pandas or pyspark dataframe

image Code snippet:

#read the dataset
import pandas as pd
df=pd.read_csv("salaries.csv")
df.head(2)

#Extract the columns names for features
V.ds=df
# you can also use:
#   V.dataSourcing = df

How to extract engineered features

image

Code snippet

#Extract features engineered
V.fe=df  
# you can also use:
V.featureEngineering = df

How to track variables used

V.start() and V.end() form a code block and can be called multiple times in the code to track variables used within the code block. Any technique such as XGBoost, decision tree, etc can be used within this code block. All computed variables will be tracked between V.start() and V.end(). If V.start() and V.end() is not used, all the variables used in the code will be tracked.

Code snippet:

#Track variables which have been used for modelling
V.start()
# you can also use: V.startModelling()


# All the variables mentioned here will be tracked
epochs=100
seed=3
accuracy = computeAccuracy() #this will be computed variable
recall = computeRecall() #This will be computed variable
loss='rmse'


#end tracking of variables
V.end()
# or, you can also use : V.endModelling()

How to track all variables in the code while writing less code

You can absolutely eliminate using V.start() and V.end() function calls. All the primitive data type variables used in the code are tracked and written to the excel file by default. Note: while on colab or kaggle, V.start() and V.end() feature hasn't been rolled out. Instead all the variables used in the code are tracked by default.

How to write the features and modelling variables in an given excel file

image Code snippet:

# Dump the datasourcing, features engineered and the variables tracked in a xlsx file
V.dump(techniqueUsed='XGBoost',filename="vevestaDump1.xlsx",message="XGboost with data augmentation was used",version=1)

Alternatively, write the experiment into the default file, vevesta.xlsx image Code snippet:

V.dump(techniqueUsed='XGBoost')

How to commit file, features and parameters to Vevesta

Vevesta is next generation knowledge repository/GitHub for data science project. The tool is free to use. Please create a login on vevesta . Then go to Setting section, download the access token. Place this token in the same folder as the jupyter notebook or python script. If my chance you face difficulties, please do mail [email protected].

You can commit the file(code),features and parameters to Vevesta by using the following command. You will find the project id for your project on the home page.

image

Code Snippet:

V.commit(techniqueUsed = "XGBoost", message="increased accuracy", version=1, projectId=1, attachmentFlag=True)

A sample output excel file has been uploaded on google sheets. Its url is here

Snapshots of output excel file

After running calling the dump or commit function for each run of the code. The features used, features engineered and the variables used in the experiments get logged into the excel file. In the below experiment, the commit/dump function is called 6 times and each time an experiment/code run is written into the excel sheet.

For example, code snippet used to track code runs/experiments are as below:

#import the vevesta Library
from vevestaX import vevesta as v
V=v.Experiment()
df = pd.read_csv("wine.csv") 
V.ds = df
df["salary_Ratio1"] = df["alchol_content"]/5
V.fe = df
epoch = 1000
accuracy = 90 #this will be a computed variable, may be an output of XGBoost algorithm
recall = 89  #this will be a computed variable, may be an output of XGBoost algorithm

For the above code snippet, each row in the excel sheet corresponds to an experiment/code run. The excel sheet will have the following:

  1. Data Sourcing tab: Marks which Features (or columns) in wine.csv were read from the input file. Presence of the feature is marked as 1 and absence as 0.
  2. Feature Engineering tab: Features engineered such as salary_Ratio1 exist as columns in the excel. Value 1 means that feature was engineered in that particular experiment and 0 means it was absent.
  3. Modelling tab: This tab tracks all the variables used in the code. Say variable precision was computed in the experiment, then for the experiment ID i, precision will be a column whose value is computed precision variable. Note: V.start() and V.end() are code blocks that you might define. In that case, the code can have multiple code blocks. The variables in all these code blocks are tracked together. Let us define 3 code blocks in the code, first one with precision, 2nd one with recall and accuracy and 3rd one with epoch, seed and no of trees. Then for experiment Id , all the variables, namely precision, recall, accuracy, epoch, seed and no. of trees will be tracked as one experiment and dumped in a single row with experiment id . Note, if code blocks are not defined then it that case all the variables are logged in the excel file.
  4. Messages tab: Data Scientists like to create new files when they change technique or approach to the problem. So everytime you run the code, it tracks the experiment ID with the name of the file which had the variables, features and features engineered.
  5. EDA-correlation: correlation is calculated on the input data automatically.
  6. EDA-box Plot tab: Box plots for numeric features
  7. EDA-Numeric Feature Distribution: Scatter plot with x axis as index in the data and y axis as the value of the data point.
  8. EDA-Feature Histogram: Histogram of numeric features

Please note, EDA computation can be skipped by passing true during the creation of the object v.Experiment(True). The following is the code snippet:

#import the vevesta Library
from vevestaX import vevesta as v
V=v.Experiment(true)

Sourced Data tab

image

Feature Engineering tab

image

Modelling tab

image

Messages tab

image

Sample data tab

image

EDA-correlation tab

image

Overall data profile report tab

image

Variables data profile report tab

image

Scatterplot for numeric features

image

Histogram for numeric features

image

Box plot for numeric features

image

Experiments performance plots

image image

How to speed up the code

The library does EDA automatically on the data. In order to accelerate compute and skip EDA, set the flag speedUp=True as shown in the code snippet.

#import the vevesta Library
from vevestaX import vevesta as v
V = v.Experiment(True)
#or u can also use
#V=v.Experiment(speedUp = True)

If you liked the library, please give us a github star and retweet .

For additional features, explore our tool at Vevesta . For comments, suggestions and early access to the tool, reach out at [email protected]

Looking for beta users for the library. Register here

We at Vevesta Labs are maintaining this library and we welcome feature requests. Find detailed blog on the vevestaX on Medium

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.