nginyc / rafiki Goto Github PK

Rafiki is a distributed system that supports training and deployment of machine learning models using AutoML, built with ease-of-use in mind.

License: Apache License 2.0

Dockerfile 1.22% Python 53.67% Shell 4.88% JavaScript 39.44% HTML 0.23% CSS 0.56%

rafiki's People

Contributors

Stargazers

Watchers

rafiki's Issues

Add documentation for Rafiki Admin's REST API

To use Swagger/OpenAPI
https://issues.apache.org/jira/projects/SINGA/issues/SINGA-402

Simplify the installation and starting steps

I have gone through all the steps in the readme file and in the doc site. There is no bug.
Following are some comments.

can we use a single script to deploy/start rafiki including the creating the docker swarm and network overlay, and starting admin, db and cache?
the docker images should be built and pushed to a docker hub account.
add docs for each role. e.g., how can model contributor add a new model? how can the app developer train a model using his own data?
https://nginyc.github.io/rafiki2/docs/docs/guides/rafiki-tasks.html is not very clear to me on the training and inference data format.
where are the data stored, e.g., the user data and the models?
will the training workers stop and terminate automatically?

repo folder structure

src
- client.py
- admin.py
- worker.py (includes the Worker class and Advisor class for hyper-parameter tuning, e.g., GP)
- model.py (includes the Model and Knob class)
- auth.py (for authentication for both admin and frontend)
- container.py (includes Docker/Kubernetes for managing containers)
- frontend.py (for query batching and results ensembling)
- db.py (for database operations)
doc
- architecture.md
- client.rst
example
script
test
requirements.txt
.travis.yml
LICENSE
README

Get basic training framework working

Since we have a running version now, we can stop for a while on the functionality and try to make the code clean/tidy including the naming, folder structure, documentation. I am reviewing the code and listing some issues. Pls let me know your comments. Once we reach an agreement, we can proceed to update the code. The goal is to make the code easy to read and the system easy to use (which is more important than extensibility to me).

shall we separate the db, admin and advisor into 3 dockers, or launch them in a single docker? if we put them together, it will simplify the launching process. the drawback is that if one of them fails, we have to restart the whole docker container. any other related issues?
cache.Dockerfile and db.Dockerfile are not necessary? just pull from docker hub for the redis and postgres
rename the workdir as /root/rafiki ? app looks like rafiki application. rename model.Dockerfile with worker.Dockerfile? any better name (a single word) for query_frontend?
shall we rearrange the variables in .env.sh and src/config.py? e.g., put the environment variables (for running other tools, the db, redis, etc.) like postgres host and port in env.sh, and put rafiki variables (introduced by rafiki) like super user in config.py. another solution is to put variables read by python programs into config.py and variables read by bash into env.sh (any variable is read by both python and bash)? the good thing of using config.py is that we copy the files into the docker container instead of exporting the variable explicitly. later when we want to change or add a variable, we do not need to change the code.
should the self._db.create_inference_job_worker be called inside self._create_service. In other words, the former should always be called inside the later.
stop_train_job_worker is not called by stop_train_job_services?
rename stop_train_job_services to stop_train_services?
Can merge ServiceManager.py into Admin.py? The code in both files are similar, most of them are interacting with the databases. ServiceManager does not have many attributes/properties and Admin has only one ServiceManager instance.
Let's unify the notations for hyper-parameter tuning: one knob is one hyper-parameter; one trial is the one assignment of all knobs of a model; one study tunes the hyper-parameters until the given budget uses up.
I do not quite understand the advisor folder. From my view, I think the advisor service can be implemented in this way: a keep a dict of (job_id->advisor instance) in app.py; new advisor instance is inserted; next trial is generated by calling the corresponding advisor; trial and result is appended into a database; for failure recovery (low priority), we can recover the advisor by feeding the pairs from the database into the advisor.
Move the VGG and SingleHiddenLayerTensorflowModel into a top folder example. Only keep the base model in src.
Try to use flatten folder structure as much as possible. The advantage is that the import statement would be simpler. It is not common to see a requirements.txt in every subfolder. If one project has multiple sub-packages that can be installed independently, then we better have mulitple requirments.txt (one for each sub-package). For other cases, we usually put the requirement.txt at the top folder.
Use consistent naming style. Name all files like xxx_yyy.py and all classes like XxxYyy.

14 Move the json files at the root folder into somewhere else?

sentiment analysis application

we are using image classification as our application.
it would be better to consider at least one more application during the development to make sure that the design is extensible. sometimes we have to trade off between extensibility and usability.
sentiment analysis is quite different to image classification. the input is a sequence of words and the output is a probability (the larger the more positive). we can use svm/xgboost as the model and tf.idf as the feature.

Ensure that advisor uses only 1 thread

Add new POS tagging task

I have 2 ready-to-use custom models for POS tagging - 1 bi-gram HMM, 1 CNN+RNN - that I can contribute to Rafiki as new models and a new task

Add tracking of metrics of services

With Prometheus

E.g. GPU/CPU/memory

Add better task specification

https://nginyc.github.io/rafiki2/docs/docs/guides/rafiki-tasks.html is not very clear to me on the training and inference data format.

Rename `query_frontend` to `predictor`

Also to rename train_job_worker -> train_worker

Plan codebase API & abstractions

Add validation to admin endpoints

Resolve SINGA-399

https://issues.apache.org/jira/projects/SINGA/issues/SINGA-399

Add model method for tracking metrics like loss

database schemas and design issues

app developer A trained a job; can B train a new version for the same app?
app_name to app?
get_train_job_workers_by_train_job to get_workers_of_train_job

get_train_jobs and get_train_job have quite different input and outputs. how about

get_train_job_of_app(app:str, version=-1) -> info of the job for the given version of the app; -1 for latest version.
get_all_train_job_of_app(app:str) -> list of jobs
get_train_job(job_id: int) -> job info
get_workers_of_job(job_id: int) -> list of workers

Similar functions for the inference job? The inference jobs and training jobs share the same version space?

we should not let developer of app A to create inference job for app B?
who calls predict_with_trial ?
create_model(..., docker_image_name) to create_model(..., docker_image, docker_file). Either use the built-in docker_image or a Dockerfile provided by the model contributor.
For application users, we need one field for the tier of the user, e.g., basic, premium, golden, etc..
app.py needs to check the data from get_request_params(), e.g., the data type or missing fields.
to be consistent with the rafiki paper, use knob and advisor for hyparameter and tuner respectively.
Can docker swarm and kubernetes give us hardware resource info?
TrainJobWorker -> Worker?

Add a CNN model with raw images as dataset

With fashion mnist dataset

Model developers can configure conditional hyperparameters

Resolve SINGA-397

https://issues.apache.org/jira/browse/SINGA-397

"Thank you! I'm sorry that I have made a mistake. Now the client-usage.py can successfully run on my machine. By the way, I strongly recommend you highlight the pr-requisites of the python libraries such as tensorflow and keras. It seems that the installation of new versions of these python libraries can also support the whole training and inference process(on my machine) but I still want to confirm whether the specific version is necessary. Since we should firstly create models, would it be okay for you to shift the Quickstart(Model Developers) to the first part of the User Guide？It seems now the order is alphabetical."

Push Docker images to DockerHub

docker images should be built and pushed to a docker hub account.

Fix out-of-memory `OperationalError` for large models

Discussion @ https://issues.apache.org/jira/browse/SINGA-414

Add simple regression model

Rename terms & methods

Integrate Clipper

App developer can deploy ATM-trained TF models & create apps with models
App users can make predictions with deployed models on clipper

Track no. of requests to predictor

Including visualization of preset graphs in grafana

Add GPU support for training

https://issues.apache.org/jira/browse/SINGA-403

Persist metadata when Rafiki is stopped

Rework model logging

Discussion @ https://issues.apache.org/jira/browse/SINGA-412

Create "Worker" and "Service" abstractions

TrainJob 1:N Service 1:N Worker
InferenceJob 1:N Service 1:N Worker

Improve API for configuring model knobs

Discussion @ https://issues.apache.org/jira/browse/SINGA-413

Lowercase all python files

Flatten folder structure

To only have 2 levels of nesting at most (e.g. rafiki > admin)

App developers can deploy multiple trained models with auto ensembling per app

To add get_meta_data() on Model, to store in DB

Add independent advisor component

Sketch

Has its own DB
Accepts hyperparameter config and trial scores, recommends hyperparameters

APIs

AdvisorService

constructor()
create_advisor(knob_config): advisor_id
propose(advisor_id): (knobs, proposal_id)
add_result(advisor_id, proposal_id, score)
add_feedback(advisor_id, proposal_id, score): should_stop
delete_advisor(advisor_id)

AdvisorStore

constructor()
create_advisor(advisor_inst, knob_config): advisor
get_advisor(advisor_id): advisor?
update_advisor(advisor, advisor_inst): advisor
add_proposal(advisor, knobs): proposal
get_proposals(advisor_id): [proposal]
get_proposal(advisor_id, proposal_id): proposal?
update_proposal(advisor, proposal, result_score, feedback_scores): proposal
delete_advisor(advisor_id)

Advisor

id
advisor_inst
knob_config
proposals: [Proposal]

Proposal

id
knobs
result_score: double?
feedback_scores: [double]

KnobConfig: Dict<string, any>
Knobs: Dict<string, any>

Model developers to tune architecture

With "Efficient Neural Architecture Search via Parameter Sharing"

Planned major changes

To better support architecture tuning with ENAS, I'm planning changes to Rafiki's current model training framework:

Replacing budget option `MODEL_TRIAL_COUNT` with `TIME_HOURS`

Context

Currently, when application developers create model training jobs, they pass a budget like { 'GPU_COUNT': 1, 'MODEL_TRIAL_COUNT': 20 }, with MODEL_TRIAL_COUNT deciding the no. of trials to conduct for each model template.

Change

Replace MODEL_TRIAL_COUNT option with TIME_HOURS option, which specifies how long the train job should run for. It is a soft time target. At the same time, I'll be reworking the Advisor component (which proposes trials' knobs) such that it is additionally in charge of deciding how many trials to run, when to stop each worker, when to stop the train job, given the budget e.g. GPU_COUNT and TIME_HOURS.

Reasons for change

May not be intuitive to the application developer to specify no. of trials while creating a train job ("how many trials should I put as budget? how long do I need to wait?"), especially if they're not supposed to be familiar with details like how model are trained and tuned. In contrast, TIME_HOURS is more straightforward.
Currently, different models & model tuning strategies would require different no. of trials to be effective. For example, the original ENAS tuning strategy requires maybe (301x150+10+1) trials for sufficient train-eval cycles.
In the future, it gives more flexibility for model tuning strategies at the Advisor component - for example, I'll be adding a new type of tuning strategy that takes all the models with no hyperparameters (e.g. model's knob config only consists of fixed values) and just conducts a single trial (since there's nothing to tune). It's also possible that a new tuning strategy can situationally conduct more/fewer trials based on feedback from workers.

Introducing `PolicyKnob`

Motivation

I have been integrating ENAS as a new model tuning strategy on Rafiki (e.g. at the Advisor component). If model templates want to do architecture tuning with ENAS, the model's training code needs to switch between different "modes":

During the ENAS architecture search phase, the model needs to alternate between "train my parameters for 1 epoch" and "don't train my parameters; just evaluate on the validation dataset"
At the end of the architecture search, the model needs to switch to training its parameters from scratch with a full-sized architecture stacked with more cells, and train for 310 epochs

Similarly, when you think about a standard hyperparameter tuning procedure, you might want the model to do early-stopping for the first e.g. 100 trials, then conduct a final trial for a full e.g. 300 epochs.

In both architecture tuning & hyperparameter tuning, the model needs to be configured by Rafiki somehow to switch between these "modes" on a trial-basis.

Change

We can model the configuration of a model template for different training "modes" with different model policies. For example, if a model is to engage in policy QUICK_TRAIN, it is to prematurely speed up its training step e.g. by either doing early-stopping or reducing the no. of epochs. The model communicates to Rafiki which policies it supports by adding PolicyKnob(policy_name) to its knob_config. On the other hand, Rafiki configures the activation of the model's policies on a trial-basis by realising the values of PolicyKnobs to either True (activated) or False (not activated).

For example, here is a example knob config of a model which supports the policy QUICK_TRAIN:

Whenever the model is to do early-stopping, Rafiki will pass quick_train=True as part of the model's knobs. Otherwise, the model defaults to full-length training.

Here is my current documentation for PolicyKnob:

'''
    Knob type representing whether a certain policy should be activated, as a boolean.
    E.g. the `QUICK_TRAIN` policy knob decides whether the model should stop model training early, or not. 
    Offering the ability to activate different policies can optimize hyperparameter search for your model. 
    Activation of all policies default to false.

    =====================       =====================
    **Policy**                  Description
    ---------------------       --------------------- 
    ``SHARE_PARAMS``            Whether model supports parameter sharing       
    ``QUICK_TRAIN``             Whether model should stop training early in `train()`, e.g. with use of early stopping or reduced no. of epochs
    ``SKIP_TRAIN``              Whether model should skip training its parameters
    ``QUICK_EVAL``              Whether model should stop evaluation early in `evaluate()`, e.g. by evaluating on only a subset of the validation dataset
    ``DOWNSCALE``               Whether a smaller version of the model should be constructed e.g. with fewer layers
    =====================       =====================
    
'''

Remove get_predict_label_mapping and instances of predict_label_mapping in Rafiki & sample models
load_dataset reads from a metafile .csv in dataset and does the mapping to indices

nginyc / rafiki Goto Github PK

rafiki's People

Contributors

Stargazers

Watchers

Forkers

rafiki's Issues

Sketch

APIs

Planned major changes

Replacing budget option MODEL_TRIAL_COUNT with TIME_HOURS

Context

Change

Reasons for change

Introducing PolicyKnob

Motivation

Change

Recommend Projects

Recommend Topics

Recommend Org

Jobs

Replacing budget option `MODEL_TRIAL_COUNT` with `TIME_HOURS`

Introducing `PolicyKnob`