GithubHelp home page GithubHelp logo

tencentmusic / cube-studio Goto Github PK

View Code? Open in Web Editor NEW
2.5K 67.0 474.0 136.25 MB

cube studio开源云原生一站式机器学习/深度学习AI平台,支持sso登录,多租户/多项目组,大数据平台对接,notebook在线开发,拖拉拽任务流pipeline编排,多机多卡分布式训练,超参搜索,推理服务VGPU,边缘计算,serverless,标注平台,自动化标注,数据集管理,大模型微调,vllm大模型推理,llmops,私有知识库,AI模型应用商店,支持模型一键开发/推理/微调,支持国产cpu/gpu/npu芯片,支持RDMA,支持pytorch/tf/mxnet/deepspeed/paddle/colossalai/horovod/spark/ray/volcano分布式

License: Other

Dockerfile 0.29% Shell 0.62% Python 26.65% HTML 1.35% JavaScript 0.93% TypeScript 7.72% Mako 0.01% CSS 0.42% Jupyter Notebook 60.19% Less 1.62% Mustache 0.19% Smarty 0.02%
kubernetes inference mlops workflow ai pytorch spark argo kubeflow automl aihub gpt llmops notebook pipeline vgpu

cube-studio's Introduction

Cube Studio

English | 简体中文

Infra

image

cube-studio is a one-stop cloud-native machine learning platform open sourced by Tencent Music, Currently mainly includes the following functions

  • 1、data management: feature store, online and offline features; dataset management, structure data and media data, data label platform
  • 2、develop: notebook(vscode/jupyter); docker image management; image build online
  • 3、train: pipeline drag and drop online; open template market; distributed computing/training tasks, example tf/pytorch/mxnet/spark/ray/horovod/kaldi/volcano; batch priority scheduling; resource monitoring/alarm/balancing; cron scheduling
  • 4、automl: nni, ray
  • 5、inference: model manager; serverless traffic control; tf/pytorch/onnx/tensorrt model deploy, tfserving/torchserver/onnxruntime/triton inference; VGPU; load balancing、high availability、elastic scaling
  • 6、infra: multi-user; multi-project; multi-cluster; edge cluster mode; blockchain sharing;

Doc

https://github.com/tencentmusic/cube-studio/wiki

WeChat group

learning、deploy、consult、contribution、cooperation, join group, wechart id luanpeng1234 remark<open source>, construction guide

Job Template

tips:

  • 1、You can develop your own template, Easy to develop and more suitable for your own scenarios
template type describe
linux base Custom stand-alone operating environment, free to implement all custom stand-alone functions
datax import export Import and export of heterogeneous data sources
hadoop data processing hdfs,hbase,sqoop,spark client
sparkjob data processing spark serverless
volcanojob data processing volcano multi-machine distributed framework
ray data processing python ray multi-machine distributed framework
ray-sklearn machine learning sklearn based on ray framework supports multi-machine distributed parallel computing
xgb machine learning xgb model training and inference
tfjob deep learning Multi-machine distributed training of tensorflow
pytorchjob deep learning Multi-machine distributed training of pytorch
horovod deep learning Multi-machine distributed training of horovod
paddle deep learning Multi-machine distributed training of paddle
mxnet deep learning Multi-machine distributed training of mxnet
kaldi deep learning Multi-machine distributed training of kaldi
tfjob-train model train distributed training of tensorflow: plain and runner
tfjob-runner model train distributed training of tensorflow: runner method
tfjob-plain model train distributed training of tensorflow: plain method
tf-model-evaluation model evaluate distributed model evaluation of tensorflow2.3
tf-offline-predict model inference distributed offline model inference of tensorflow2.3
model-register model service register model to platform
model-offline-predict model service distributed offline model inference of framework
deploy-service model service deploy inference service
media-download multimedia data processing Distributed download of media files
video-audio multimedia data processing Distributed extraction of audio from video
video-img multimedia data processing Distributed extraction of pictures from video
yolov7 machine vision object-detection with yolov7

Deploy

wiki

cube

Company

图片 1

cube-studio's People

Contributors

674345386 avatar cdllp2 avatar chendile avatar clementine124 avatar colorfuldick avatar cyxnzb avatar data-infra avatar datascientistsamchan avatar ferdinandward avatar goldworker avatar gxin0426 avatar harry201706 avatar jacktao007 avatar jlwll avatar kalenhaha avatar ldd91 avatar lkad avatar nowbug avatar nutsjian avatar paopjian avatar stewart482 avatar winifred43 avatar xiaoyangmai avatar yanghua avatar yann-su avatar zhangchunsheng avatar zhuyaguang avatar znanjie avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cube-studio's Issues

内网穿透部署无法访问notebook

采用的是frp进行内网穿透,目前只能打得开平台,打不开创建的notebook,直接是404,请问该怎么办,可否给予简单明了的解决办法,感谢!
frp我只穿透了80端口

在平台部署notebook直接提示unkown,选择reset会报错

版本是2022年6月16日凌晨拉的master分支最新版本
单机部署
在k8s看所有节点状态正常
操作触发条件:
{
1.创建新的notebook项目
2.发现状态unkown后,点击reset触发
}
微信图片_20220616090308
微信图片_20220616090637
微信图片_20220616090645
点击“名称”直接提示404页面

mysql error

工作负载: mysql
show ReplicaSet "mysql-69b7f785c9" has timed out progressing.

ImagePullBackOff: Back-off pulling image "mysql:5.7"

how to fix it

更新了最新版后还是会拉取镜像失败

Failed to pull image "ai.tencentmusic.com/tme-public/notebook:jupyter-ubuntu-cpu-1.0.0": rpc error: code = Unknown desc = Error response from daemon: pull access denied for ai.tencentmusic.com/tme-public/notebook, repository does not exist or may require 'docker login': denied: requested access to the resource is denied

can add claimRef in PV?

pv can not Bound to pvc ,add claimRef can solve it !

sometime i add storageClassName: local in pv,but Sometimes it works, sometimes it doesn't, I'm confused。

image

任务部署

可以使用win11部署这个到本地吗,docker desktop可以一键部署k8s,linux会容易些。

腾讯云单机部署碰到的Bug

  1. 腾讯云的机器连不上的谷歌镜像源,如果脚本中缺少某个镜像,可用的镜像源中拉取不到,就会失败,比如我这里是kubeflow-prometheus-adapter;
  2. start.sh脚本中需要下载kfctl,需要连接外网下载,腾讯云服务器无法实现,进而kubeflow基础组件无法安装;
  3. 腾讯云CVM是双网卡,有内网和外网ip,k8s的配置中要使用内网ip,进入浏览器界面需要外网ip;
  4. 部署cube-studio前需要先部署对应版本的docker和k8s,如果你是用rancher来部署k8s,记得使用v2.5.2的rancher,而不是latest,docker的部署也最好按照官网的来,先装载好仓库,否则容易失败;
  5. 腾讯云单机部署cube-studio经验贴:https://blog.csdn.net/weixin_39750084/article/details/124986488?spm=1001.2014.3001.5502。

能否有一个一键部署的docker compse

这个项目看起来很棒,能够实现数据处理、建模、分析流水线操作,但是目前看部署起来还是挺复杂的,能否简化部署流程,直接docker compse up就行,最好能有个在线体验地址~

提个小建议,希望作者能更新cuda11的镜像

不知道大家使用的cuda是啥版本的,我的机器用的cuda是11.x版本的,所以在部署notebook节点的时候会有些问题
主要问题出现在会检测不到显卡、调用GPU训练的时候会提示缺少与cuda11有关的文件,以及cuDnn文件缺少
然后每次都要重复造轮子去给拉的节点重新安装cuda11,再装cuDnn,比较繁琐
自己制造镜像的话,体积过于庞大,如果没有经验的话可能比较麻烦。

在jupyter lab中加入大数据功能

是否有计划在jupyter lab中加入大数据处理spark组件呢?通过livy server打通数据交换的流程。

p.s.如何加入你们开发呢?

some namespaces without pods

in create_ns_secret.sh
for namespace in 'infra' 'kubeflow' 'istio-system' 'knative-serving' 'pipeline' 'katib' 'jupyter' 'kfserving' 'service' 'pre-service' 'cert-manager' 'monitoring' 'logging' 'kube-system' 'volcano-system'

katib jupyter kfserving pre-service cert-manager logging

this namespaces without pods

dataset+job-template+pipeline+inference demo

视觉:yolo相关模型、darknet相关模型、PaddleSeg 图像分割,orc相关模型,等训练和推理支持

语音:wenet语音识别的训练和推理支持。

推荐:bin算法,deepfm,ple等算法的训练和推理服务支持

文本: bert框架模型的训练和推理支持

有个小小的建议。A little suggestion

作为一个全功能性的平台,安全性应该是很重要的,毕竟在平台的使用中,很多功能都要牵涉到多机器的集群式部署、计算等等,有些甚至需要内外网调用,因此,个人拙见作者可以在相关功能上,做一些敏感操作预警
例如:

  1. 登录的时候对 登陆地点,时间进行比对
  2. 训练任务发布的时候进行比对
  3. 修改密码的时候身份比对
  4. 敏感操作可以绑定社交工具或者短信sms进行提醒等等

As a full-featured platform, security should be very important. After all, in the use of the platform, many functions involve multi-machine cluster deployment, computing, etc., and some even require internal and external network calls. Therefore, personal In my humble opinion, the author can do some sensitive operation warning on related functions.
E.g:

  1. When logging in, compare the login location and time
  2. Compare when the training task is released
  3. Identity comparison when changing the password
  4. Sensitive operations can be bound to social tools or SMS for reminders, etc.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.