GithubHelp home page GithubHelp logo

nginyc / rafiki Goto Github PK

View Code? Open in Web Editor NEW
36.0 36.0 23.0 19.17 MB

Rafiki is a distributed system that supports training and deployment of machine learning models using AutoML, built with ease-of-use in mind.

License: Apache License 2.0

Dockerfile 1.22% Python 53.67% Shell 4.88% JavaScript 39.44% HTML 0.23% CSS 0.56%

rafiki's People

Contributors

airlovelq avatar cadmusthefounder avatar nginyc avatar nudles avatar pinpom avatar wild-flame avatar zhaoxuanwu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rafiki's Issues

Simplify the installation and starting steps

I have gone through all the steps in the readme file and in the doc site. There is no bug.
Following are some comments.

  1. can we use a single script to deploy/start rafiki including the creating the docker swarm and network overlay, and starting admin, db and cache?
  2. the docker images should be built and pushed to a docker hub account.
  3. add docs for each role. e.g., how can model contributor add a new model? how can the app developer train a model using his own data?
  4. https://nginyc.github.io/rafiki2/docs/docs/guides/rafiki-tasks.html is not very clear to me on the training and inference data format.
  5. where are the data stored, e.g., the user data and the models?
  6. will the training workers stop and terminate automatically?

repo folder structure

  • src
    • client.py
    • admin.py
    • worker.py (includes the Worker class and Advisor class for hyper-parameter tuning, e.g., GP)
    • model.py (includes the Model and Knob class)
    • auth.py (for authentication for both admin and frontend)
    • container.py (includes Docker/Kubernetes for managing containers)
    • frontend.py (for query batching and results ensembling)
    • db.py (for database operations)
  • doc
    • architecture.md
    • client.rst
  • example
  • script
  • test
  • requirements.txt
  • .travis.yml
  • LICENSE
  • README

code review Sep 23

Since we have a running version now, we can stop for a while on the functionality and try to make the code clean/tidy including the naming, folder structure, documentation. I am reviewing the code and listing some issues. Pls let me know your comments. Once we reach an agreement, we can proceed to update the code. The goal is to make the code easy to read and the system easy to use (which is more important than extensibility to me).

  1. shall we separate the db, admin and advisor into 3 dockers, or launch them in a single docker? if we put them together, it will simplify the launching process. the drawback is that if one of them fails, we have to restart the whole docker container. any other related issues?

  2. cache.Dockerfile and db.Dockerfile are not necessary? just pull from docker hub for the redis and postgres

  3. rename the workdir as /root/rafiki ? app looks like rafiki application. rename model.Dockerfile with worker.Dockerfile? any better name (a single word) for query_frontend?

  4. shall we rearrange the variables in .env.sh and src/config.py? e.g., put the environment variables (for running other tools, the db, redis, etc.) like postgres host and port in env.sh, and put rafiki variables (introduced by rafiki) like super user in config.py. another solution is to put variables read by python programs into config.py and variables read by bash into env.sh (any variable is read by both python and bash)? the good thing of using config.py is that we copy the files into the docker container instead of exporting the variable explicitly. later when we want to change or add a variable, we do not need to change the code.

  5. should the self._db.create_inference_job_worker be called inside self._create_service. In other words, the former should always be called inside the later.

  6. stop_train_job_worker is not called by stop_train_job_services?

  7. rename stop_train_job_services to stop_train_services?

  8. Can merge ServiceManager.py into Admin.py? The code in both files are similar, most of them are interacting with the databases. ServiceManager does not have many attributes/properties and Admin has only one ServiceManager instance.

  9. Let's unify the notations for hyper-parameter tuning: one knob is one hyper-parameter; one trial is the one assignment of all knobs of a model; one study tunes the hyper-parameters until the given budget uses up.

  10. I do not quite understand the advisor folder. From my view, I think the advisor service can be implemented in this way: a keep a dict of (job_id->advisor instance) in app.py; new advisor instance is inserted; next trial is generated by calling the corresponding advisor; trial and result is appended into a database; for failure recovery (low priority), we can recover the advisor by feeding the pairs from the database into the advisor.

  11. Move the VGG and SingleHiddenLayerTensorflowModel into a top folder example. Only keep the base model in src.

  12. Try to use flatten folder structure as much as possible. The advantage is that the import statement would be simpler. It is not common to see a requirements.txt in every subfolder. If one project has multiple sub-packages that can be installed independently, then we better have mulitple requirments.txt (one for each sub-package). For other cases, we usually put the requirement.txt at the top folder.

  13. Use consistent naming style. Name all files like xxx_yyy.py and all classes like XxxYyy.

14 Move the json files at the root folder into somewhere else?

sentiment analysis application

we are using image classification as our application.
it would be better to consider at least one more application during the development to make sure that the design is extensible. sometimes we have to trade off between extensibility and usability.
sentiment analysis is quite different to image classification. the input is a sequence of words and the output is a probability (the larger the more positive). we can use svm/xgboost as the model and tf.idf as the feature.

Add new POS tagging task

I have 2 ready-to-use custom models for POS tagging - 1 bi-gram HMM, 1 CNN+RNN - that I can contribute to Rafiki as new models and a new task

database schemas and design issues

  1. app developer A trained a job; can B train a new version for the same app?

  2. app_name to app?

  3. get_train_job_workers_by_train_job to get_workers_of_train_job

  4. get_train_jobs and get_train_job have quite different input and outputs. how about

    get_train_job_of_app(app:str, version=-1) -> info of the job for the given version of the app; -1 for latest version.
    get_all_train_job_of_app(app:str) -> list of jobs
    get_train_job(job_id: int) -> job info
    get_workers_of_job(job_id: int) -> list of workers  
    

Similar functions for the inference job? The inference jobs and training jobs share the same version space?

  1. we should not let developer of app A to create inference job for app B?
  2. who calls predict_with_trial ?
  3. create_model(..., docker_image_name) to create_model(..., docker_image, docker_file). Either use the built-in docker_image or a Dockerfile provided by the model contributor.
  4. For application users, we need one field for the tier of the user, e.g., basic, premium, golden, etc..
  5. app.py needs to check the data from get_request_params(), e.g., the data type or missing fields.
  6. to be consistent with the rafiki paper, use knob and advisor for hyparameter and tuner respectively.
  7. Can docker swarm and kubernetes give us hardware resource info?
  8. TrainJobWorker -> Worker?

Resolve SINGA-397

https://issues.apache.org/jira/browse/SINGA-397

"Thank you! I'm sorry that I have made a mistake. Now the client-usage.py can successfully run on my machine. By the way, I strongly recommend you highlight the pr-requisites of the python libraries such as tensorflow and keras. It seems that the installation of new versions of these python libraries can also support the whole training and inference process(on my machine) but I still want to confirm whether the specific version is necessary. Since we should firstly create models, would it be okay for you to shift the Quickstart(Model Developers) to the first part of the User Guide?It seems now the order is alphabetical."

Rename terms & methods

  • app_name to app
  • get_train_job_workers_by_train_job to get_workers_of_train_job
  • get_train_job_of_app(app:str, version=-1) -> info of the job for the given version of the app; -1 for latest version.
  • get_all_train_jobs_of_app(app:str) -> list of jobs
  • get_train_job(train_job_id: int) -> job info
  • get_workers_of_job(job_id: int) -> list of workers
  • get_all_inference_jobs_of_app(app:str) -> list of jobs
  • get_inference_job(job_id: int) -> job info
  • hyperparameter to knob
  • tuner to advisor

Integrate Clipper

  • App developer can deploy ATM-trained TF models & create apps with models
  • App users can make predictions with deployed models on clipper

Add independent advisor component

Sketch

  • Has its own DB
  • Accepts hyperparameter config and trial scores, recommends hyperparameters

APIs

AdvisorService

  • constructor()
  • create_advisor(knob_config): advisor_id
  • propose(advisor_id): (knobs, proposal_id)
  • add_result(advisor_id, proposal_id, score)
  • add_feedback(advisor_id, proposal_id, score): should_stop
  • delete_advisor(advisor_id)

AdvisorStore

  • constructor()
  • create_advisor(advisor_inst, knob_config): advisor
  • get_advisor(advisor_id): advisor?
  • update_advisor(advisor, advisor_inst): advisor
  • add_proposal(advisor, knobs): proposal
  • get_proposals(advisor_id): [proposal]
  • get_proposal(advisor_id, proposal_id): proposal?
  • update_proposal(advisor, proposal, result_score, feedback_scores): proposal
  • delete_advisor(advisor_id)

Advisor

  • id
  • advisor_inst
  • knob_config
  • proposals: [Proposal]

Proposal

  • id
  • knobs
  • result_score: double?
  • feedback_scores: [double]

KnobConfig: Dict<string, any>
Knobs: Dict<string, any>

Model developers to tune architecture

With "Efficient Neural Architecture Search via Parameter Sharing"

Planned major changes

To better support architecture tuning with ENAS, I'm planning changes to Rafiki's current model training framework:

Replacing budget option MODEL_TRIAL_COUNT with TIME_HOURS

Context

Currently, when application developers create model training jobs, they pass a budget like { 'GPU_COUNT': 1, 'MODEL_TRIAL_COUNT': 20 }, with MODEL_TRIAL_COUNT deciding the no. of trials to conduct for each model template.

Change

Replace MODEL_TRIAL_COUNT option with TIME_HOURS option, which specifies how long the train job should run for. It is a soft time target. At the same time, I'll be reworking the Advisor component (which proposes trials' knobs) such that it is additionally in charge of deciding how many trials to run, when to stop each worker, when to stop the train job, given the budget e.g. GPU_COUNT and TIME_HOURS.

Reasons for change

  • May not be intuitive to the application developer to specify no. of trials while creating a train job ("how many trials should I put as budget? how long do I need to wait?"), especially if they're not supposed to be familiar with details like how model are trained and tuned. In contrast, TIME_HOURS is more straightforward.
  • Currently, different models & model tuning strategies would require different no. of trials to be effective. For example, the original ENAS tuning strategy requires maybe (301x150+10+1) trials for sufficient train-eval cycles.
  • In the future, it gives more flexibility for model tuning strategies at the Advisor component - for example, I'll be adding a new type of tuning strategy that takes all the models with no hyperparameters (e.g. model's knob config only consists of fixed values) and just conducts a single trial (since there's nothing to tune). It's also possible that a new tuning strategy can situationally conduct more/fewer trials based on feedback from workers.

Introducing PolicyKnob

Motivation

I have been integrating ENAS as a new model tuning strategy on Rafiki (e.g. at the Advisor component). If model templates want to do architecture tuning with ENAS, the model's training code needs to switch between different "modes":

  • During the ENAS architecture search phase, the model needs to alternate between "train my parameters for 1 epoch" and "don't train my parameters; just evaluate on the validation dataset"
  • At the end of the architecture search, the model needs to switch to training its parameters from scratch with a full-sized architecture stacked with more cells, and train for 310 epochs

Similarly, when you think about a standard hyperparameter tuning procedure, you might want the model to do early-stopping for the first e.g. 100 trials, then conduct a final trial for a full e.g. 300 epochs.

In both architecture tuning & hyperparameter tuning, the model needs to be configured by Rafiki somehow to switch between these "modes" on a trial-basis.

Change

We can model the configuration of a model template for different training "modes" with different model policies. For example, if a model is to engage in policy QUICK_TRAIN, it is to prematurely speed up its training step e.g. by either doing early-stopping or reducing the no. of epochs. The model communicates to Rafiki which policies it supports by adding PolicyKnob(policy_name) to its knob_config. On the other hand, Rafiki configures the activation of the model's policies on a trial-basis by realising the values of PolicyKnobs to either True (activated) or False (not activated).

For example, here is a example knob config of a model which supports the policy QUICK_TRAIN:

image

Whenever the model is to do early-stopping, Rafiki will pass quick_train=True as part of the model's knobs. Otherwise, the model defaults to full-length training.

Here is my current documentation for PolicyKnob:

'''
    Knob type representing whether a certain policy should be activated, as a boolean.
    E.g. the `QUICK_TRAIN` policy knob decides whether the model should stop model training early, or not. 
    Offering the ability to activate different policies can optimize hyperparameter search for your model. 
    Activation of all policies default to false.

    =====================       =====================
    **Policy**                  Description
    ---------------------       --------------------- 
    ``SHARE_PARAMS``            Whether model supports parameter sharing       
    ``QUICK_TRAIN``             Whether model should stop training early in `train()`, e.g. with use of early stopping or reduced no. of epochs
    ``SKIP_TRAIN``              Whether model should skip training its parameters
    ``QUICK_EVAL``              Whether model should stop evaluation early in `evaluate()`, e.g. by evaluating on only a subset of the validation dataset
    ``DOWNSCALE``               Whether a smaller version of the model should be constructed e.g. with fewer layers
    =====================       =====================
    
'''

Add per-role documentation

add docs for each role. e.g., how can model contributor add a new model? how can the app developer train a model using his own data?

Also to add installation instructions in docs site

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.