GithubHelp home page GithubHelp logo

federatedai / fate-flow Goto Github PK

View Code? Open in Web Editor NEW
50.0 22.0 45.0 167.68 MB

Solution for deploying and managing end-to-end federated learning workflows

License: Apache License 2.0

Shell 2.66% Python 97.34%
pipelines workflow resource-management model-management data-management

fate-flow's People

Contributors

chengtcc avatar dylan-fan avatar elvis-xiao avatar gps949 avatar gxcuit avatar jarviszeng-zjc avatar jat001 avatar jingchen23 avatar jsuper avatar mdgbdgmg avatar mgqa34 avatar mikkiyang avatar owlet42 avatar sagewe avatar shadowxz avatar webankaiteam avatar wfangchi avatar zhihuiwan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fate-flow's Issues

使用flow table bind命令连接mysql无法获取数据和连接hdfs报错

使用的fate版本是1.9.0,在三个虚拟机上以分布式方式启动,后端计算和存储引擎是eggroll.

  1. 连接mysql无法获取表内数据
    mysql应用程序基于docker镜像mysql:5.7在另一台虚拟机上启动,在数据库huian的test表中插入建模用的数据。test表内数据如下所示,
    image
    进入fate中一方节点的confs-xxx_client_1容器内,编写用于mysql数据库连接的json文件并命名为data.json,如下所示,
    image
    使用flow table bind -c data.json命令后,可以和远程数据库建立连接,但是使用flow table info命令查询数据表中没有数据,返回结果的count字段值为null,无法使用该DTable表中数据进行建模任务。
    image
    请问需要如何修改data.json文件,才可以把mysql表中的数据加载到fate存储系统中,还是需要再使用完flow table bind -c data.json命令后,还需要使用其他命令来加载数据到fate存储系统中?

  2. 连接hdfs后报错
    hdfs通过singularities/hadoop:latest(hadoop版本应该是2.8.3)镜像在一台虚拟机上启动,包括一个namenode和3个datanode节点。编写用于数据系统映射的data.json文件如下所示,
    image
    使用flow table bind -c data.json报错,如下所示,
    image
    请问需要如何配置可以正确导入hdfs中的文件?

新增特性-算法安全审计

System information

  • FATE Flow version (use command: python fate_flow_server.py --version): 1.8
  • Python version (use command: python --version): 3.6.9
  • Are you willing to contribute it (yes/no): yes

Describe the feature and the current behavior/state.
在算法组件中每一步通过多方通信向对端传输信息时增加日志打印,打印出输出的内容以及本次传输的操作说明,日志输出在对应job日志目录里的独立文件中,同时提供api可以让用户进行查看指定job指定算法组件执行过程中关于多方通信都传输了什么信息,是否是加密安全的,当然是否打印算法组件在多方通信时的日志输出,我会提供一个开关,打开开关才会增加算法执行过程中多方通信的审计日志输出,以此来降低性能损耗,不过实际上我个人理解,相较于多方通信这个rpc过程,增加日志打印其实不会增加太多性能损耗,但是为了产品化考虑,我还是会增加一个开启算法执行过程多方通信日志打印的开关

Will this change the current api? How?
不涉及对存量api的修改,会新增api

Who will benefit with this feature?
每一个围绕fate打造联邦学校产品的用户都会从中受益,如果没有算法执行过程中多方通信的日志审计,那么用户对于联邦平台的安全性就会打一个疑问,增加了算法执行过程多方通信日志打印,会大大提升产品的安全性说明,便于用户理解,增强用户对于联邦学习产品的信任
Any Other info.

Add Worker Manager to management workers

System information

  • FATE Flow version (use command: python fate_flow_server.py --version): 1.7.0
  • Python version (use command: python --version): 3.6.5
  • Are you willing to contribute it (yes/no): yes

Describe the feature and the current behavior/state.

Will this change the current api? How?

Who will benefit with this feature?

Any Other info.

homo_nn使用pipeline加验证集后错误

arbiter方报错。
报错信息如下:
微信截图_20220402223046
pipeline代码(添加验证集后错误,reader_0是训练集,reader_1是验证集)
微信截图_20220402223329
在代码逻辑完全一致的情况下,跑home_lr是没问题的,跑homo_nn出现了问题

heterolr执行报错,提示processor in session meta is not valid,求助定位思路

用kubefate基于k8s版本部署的fate集群
fate版本:1.7.2
kubefate版本:1.4.3

对应arbiter节点上heterolr算法组件报错日志
[ERROR] [2022-03-29 12:24:59,438] [202203291210246798510] [30:140092026906368] - [job_tracker.clean_task] [line:491]: cleanup error
Traceback (most recent call last):
File "/data/projects/fate/fateflow/python/fate_flow/operation/job_tracker.py", line 455, in clean_task
sess.init_computing(computing_session_id=f"{computing_temp_namespace}_clean", options=session_options)
File "/data/projects/fate/fate/python/fate_arch/session/_session.py", line 116, in init_computing
session_id=computing_session_id, options=options
File "/data/projects/fate/fate/python/fate_arch/computing/eggroll/_csession.py", line 36, in init
self._rp_session = session_init(session_id=session_id, options=options)
File "/data/projects/fate/eggroll/python/eggroll/core/session.py", line 42, in session_init
er_session = ErSession(session_id=session_id, options=options)
File "/data/projects/fate/eggroll/python/eggroll/core/session.py", line 199, in init
self.__session_meta = self._cluster_manager_client.get_or_create_session(session_meta)
File "/data/projects/fate/eggroll/python/eggroll/core/client.py", line 189, in get_or_create_session
serdes_type=self.__serdes_type))
File "/data/projects/fate/eggroll/python/eggroll/core/client.py", line 243, in __check_processors
raise ValueError(f"processor in session meta is not valid: {session_meta}")
ValueError: processor in session meta is not valid: <ErSessionMeta(id=202203291210246798510_hetero_lr_0_0_arbiter_10002_clean, name=, status=NEW, tag=, process
ors=[, len=1], options=[{'eggroll.session.processors.per.node': '1', 'eggroll.session.deploy.mode': 'cluster'}]) at 0x7f6ab8e38f28>
[ERROR] [2022-03-29 12:25:42,204] [202203291210246798510] [283:139704828536640] - [_session.get_session_from_record] [line:394]: processor in session meta is not valid: <ErSessionMeta(id=202203291210246798510_hetero_lr_0_0_arbiter_10002, name=, status=NEW, tag=, processors=[
, len=1], options=[{'python.venv': '/opt/app-root', 'eggroll.session.deploy.mode': 'cluster', 'python.path': '/data/projects/fate/fate/python/federatedml:/opt/rh/rh-nodejs10/root/usr/lib/python2.7/site-packages:$PYTHONPATH:/data/projects/fate/fate/python:/data/projects/fate/eggroll/python:/data/projects/fate/fateflow/python:/data/projects/fate/fate/python/fate_client', 'eggroll.session.processors.per.node': '1'}]) at 0x7f0f6a7a5908>
[ERROR] [2022-03-29 12:26:14,137] [202203291210246798510] [304:139838163154752] - [_session.get_session_from_record] [line:394]: processor in session meta is not valid: <ErSessionMeta(id=202203291210246798510_hetero_lr_0_0_arbiter_10002, name=, status=NEW, tag=, processors=[***, len=1], options=[{'eggroll.session.deploy.mode': 'cluster', 'python.venv': '/opt/app-root', 'python.path': '/data/projects/fate/fate/python/federatedml:/opt/rh/rh-nodejs10/root/usr/lib/python2.7/site-packages:$PYTHONPATH:/data/projects/fate/fate/python:/data/projects/fate/eggroll/python:/data/projects/fate/fateflow/python:/data/projects/fate/fate/python/fate_client', 'eggroll.session.processors.per.node': '1'}]) at 0x7f2e75d72908>

对应arbiter节点上session_processor表中的数据:
image

job reuse

Added job inheritance function to reuse tasks that have been successful in a job

如果想要当前作业各个参与方都完成bind操作,是否需要手动调用各方的bind接口执行后才ok?

模型部署加载绑定过程中,部署deploy,加载load从源码实现看都是从guest发起调用后,会将请求通过rollsite分发给当前作业中其他参与方进行调用执行,达到在guest方进行deploy、load后,其余参与方也执行了相应的deploy、load操作
但是bind接口从源码实现看并没有将bind请求分发给各个参与方的操作
那么如果想要当前作业各个参与方都完成bind操作,是否需要手动调用各方的bind接口执行后才ok?

Does not print error log if the process no longer exists when trying to kill worker process

System information

  • FATE Flow version (use command: python fate_flow_server.py --version): 1.8
  • Python version (use command: python --version): 3.6.5
  • Are you willing to contribute it (yes/no): yes

Describe the feature and the current behavior/state.
Does not print error log if the process no longer exists when trying to kill worker process

Will this change the current api? How?
no

Who will benefit with this feature?

Any Other info.

provider register not work as expected

  1. Try to

register a SQL-related components with

{
  "name": "fate_sql",
  "version": "0.1",
  "path": "python/fate_sql"
}
  1. Exception

db related exception occur with keyError

As register user-provided provider is a planed feature in 1.7.0, we hope this issue could be fixed as soon as possible 🙏

FATE 1.7.0 may be incompatible with FATE-Serving 2.0.5 (or older versions)

This bug affects the versions before 2.1.0 (including 2.0.5, 2.0.4, etc.) of FATE-Serving. FATE-Serving may cannot find the pre-trained models if FATE 1.7.0 is used with FATE-Serving 2.0.5 (or older versions).

We strongly suggest using FATE-Serving 2.1.1 or 2.1.0 with FATE-1.7.0.

这个 bug 会影响 FATE-Serving 2.1.0 之前的版本(包括 2.0.5、2.0.4 等),当 FATE 1.7.0 与 FATE-Serving 2.0.5(或更早的版本)同时使用时,FATE-Serving 可能无法找到已经训练好的模型。

推荐在使用 FATE 1.7.0 的时候,搭配使用 FATE-Serving 2.1.1 或 2.1.0。


If you want to keep using FATE-Serving 2.0.5 (or older versions), here is a temporary solution.

Open the config file of FATE-Serving conf/serving-server.properties and change model.transfer.url to the correct URL, e.g. http://127.0.0.1:9380/v1/model/transfer.

如果您仍要使用 FATE-Serving 2.0.5(或更早的版本),以下是临时解决方案。

打开 FATE-Serving 的配置文件 conf/serving-server.properties,修改 model.transfer.url 为正确的地址,如 http://127.0.0.1:9380/v1/model/transfer

关于model_local_cache目录下的模型及jobs目录下的job相关配置文件是否有自动清理的机制?

有个问题想请教下,fate1.7.x版本目前有对model_local_cache目录里的模型以及jobs目录里的历史job相关配置文件自动清理的处理吗?就是下图中红线标出的,我在1.7.2fateflow源码中看了下,貌似没找到,但是无法确定,包括后续1.8版本中是否会增加这俩目录的自动清理呢?希望得到你的答复
image

Originally posted by @MiKKiYang in #204 (comment)

writer component

The writer component supports outputting fate storage data to designated external storage

Currently doc of `flow-client` missing `data upload-history`

Describe the current behavior
The data doc in the current version missing the description of the upload-history, which is an command in the current implement.

Contributing

  • Do you want to contribute a PR? (Yes):
  • Briefly describe your candidate solution(if contributing):
    Update the corresponding doc.

add `drop` parameter to `flow table bind` command for flow cli client

when rebind table with flow table bind command:

$ flow table bind -c bind_local_path.json 
{
    "retcode": 100,
    "retmsg": "The data table already exists.If you still want to continue uploading, please add the parameter -drop.1 means to add again after deleting the table"
}

yet, we find --drop parameter not supported in this version 😢

REST Doc update request:

Describe the feature and the current behavior/state.
It would be more convenient for the developers if the REST document is updated.

Thanks!

Who will benefit from this feature?
Developers

Tracker Client support dict arguments

System information

  • FATE Flow version (use command: python fate_flow_server.py --version): 1.7.0
  • Python version (use command: python --version): 3.6.5
  • Are you willing to contribute it (yes/no): yes

Describe the feature and the current behavior/state.

Will this change the current api? How?

Who will benefit with this feature?

Any Other info.

FATE-Flow1.9 模型Store和Restore任务失败

我部署了fate v1.9.0,调用flow model export导出模型到存储引擎任务失败。
报错内容是:TypeError: _run() got an unexpected keyword argument 'cpn_input'

初步看了下,fateflow/component/_base.py line 132 的 _run() method,入参名是 "cpn_input";
但ModelStore和ModelRestore重写这个method时,入参名是 "input_cpn"。我将入参名改成 "cpn_input"后可以正常运行任务。

然后这个任务的日志里会把service_conf.yaml的model_store_address部分全都打印出来(包括密码),可能不太合适。

上传100M文件到hdfs失败

上传100M文件到hdfs失败

  • json
   cat /data/projects/fate/examples/data-1W/upload_lr7w_host.json
  {
    "file": "/data/projects/fate/examples/data-1W/data_A402_1_party2_7w.csv",
    "table_name": "lr7w_h",
    "namespace": "RATEST",
    "head": 1,
    "partition": 8,
    "work_mode": 1,
    "backend": 1,
    "use_local_data": 0,
    "drop": 1,
        "storage_engine": "HDFS"
  
  }
  
   ls -hl "/data/projects/fate/examples/data-1W/data_A402_1_party2_7w.csv"
  -rw------- 1 root root 109M Feb 23 10:10 /data/projects/fate/examples/data-1W/data_A402_1_party2_7w.csv
  
  
flow data upload -c /data/projects/fate/examples/data-1W/upload_lr7w_host.json --drop



  • 报错日志
197
[INFO] [2022-02-23 10:12:54,048] [202202231012241074990] [57899:139746043860800] - [_table._put_all] [line:64]: put in hdfs file: hdfs://namenode:9000//fate/input_data/RATEST/lr7w_h
198
[ERROR] [2022-02-23 10:12:54,181] [202202231012241074990] [57899:139746043860800] - [task_executor._run_] [line:243]: HDFS Flush failed, errno: 255 (Unknown error 255) Please check that you are connecting to the correct HDFS RPC port
199
Traceback (most recent call last):
200
 File "./fate/fate/python/fate_arch/storage/hdfs/_table.py", line 77, in _put_all
201
   writer.write(hdfs_utils.serialize(k, v))
202
 File "pyarrow/io.pxi", line 283, in pyarrow.lib.NativeFile.write
203
 File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
204
OSError: HDFS Write failed, errno: 255 (Unknown error 255) Please check that you are connecting to the correct HDFS RPC port
205

关于Fate Flow高可用的新特性的问题

在v1.9.0版本中支持了fateflow的高可用和负载均衡,请问在多大数据量情况下、什么样的场景下需要部署高可用,还有关键是要怎样去部署呢,在部署文档中好像没有看到关于这方面的内容

Fix parameter inheritance problem.

A. When use model loader to training a new model, for example: use model_loader to load hetero_nn_0's model and train hetero_nn_1, then hetero_nn_0's parameters does not be inherited, which is wrong.
B. If job contains hetero_nn_0 and hetero_nn_1, hetero_nn_0 is a training task and hetero_nn_1 is a predicing task, if the user deploy hetero_nn_1, hetero_nn_1 can not inherit parameters from hetero_nn_0 in single predicting task.

Bug In Download Component Output Data

When download data of component outputs using "flow data download -c",the csv file contains strings like "<federatedml.feature.instance.Instance object at 0x7fc7514621d0>", but if we use "component_output_data" to download data directly, the data is normal. There may be some bug in "flow data download -c"

site authentication optimization

System information

  • FATE Flow version (use command: python fate_flow_server.py --version):
  • Python version (use command: python --version):
  • Are you willing to contribute it (yes/no):

Describe the feature and the current behavior/state.

Will this change the current api? How?

Who will benefit with this feature?

Any Other info.

集群迁移无法执行预测任务

System information

centos 7, fate1.7

Describe the current behavior
参考文档:https://federatedai.github.io/FATE-Flow/latest/zh/fate_flow_model_migration/#_2
完成模型迁移后,在新集群,执行预测任务,报错
原机器migrate结果截图
image
目标机器import截图
image

Describe the expected behavior

Other info / logs Include any logs or source code that would be helpful to
配置信息:
migrate json
{ "job_parameters": { "federated_mode": "MULTIPLE" }, "role": { "guest": [ 10005 ], "host": [ 10003 ] }, "execute_party": { "guest": [ 10005 ], "host": [ 10003 ] }, "migrate_initiator": { "role": "guest", "party_id": 10005 }, "migrate_role": { "guest": [ 10001 ], "host": [ 10002 ] }, "model_id": "guest-10005#host-10003#model", "model_version": "202204161500311591470" }

import 配置
{ "role": "guest", "party_id": 10001, "model_id": "guest-10001#host-10002#model", "model_version": "202204210856597457810", "file": "/data/projects/fate/examples/zhanht/migrate/guest#10001#guest-10001#host-10002#model_202204210856597457810.zip" }

log日志
[ERROR] [2022-04-21 09:22:10,777] [202204210916388364990] [37172:140459868743488] - [model_loader._run] [line:133]: Get 'model_alias' failed. Trying to find a checkpoint...
Traceback (most recent call last):
File "./fate/fateflow/python/fate_flow/components/model_loader.py", line 130, in _run
self.get_model_alias()
File "./fate/fateflow/python/fate_flow/components/model_loader.py", line 56, in get_model_alias
raise ValueError('The job was not found.')
ValueError: The job was not found.
[ERROR] [2022-04-21 09:22:10,778] [202204210916388364990] [37172:140459868743488] - [model_loader._run] [line:144]: Read checkpoint error.
Traceback (most recent call last):
File "./fate/fateflow/python/fate_flow/components/model_loader.py", line 142, in _run
return self.read_checkpoint()
File "./fate/fateflow/python/fate_flow/components/model_loader.py", line 101, in read_checkpoint
raise ValueError('The checkpoint was not found.')
ValueError: The checkpoint was not found.
[ERROR] [2022-04-21 09:22:10,779] [202204210916388364990] [37172:140459868743488] - [task_executor.run] [line:243]: No checkpoint was found.
Traceback (most recent call last):
File "./fate/fateflow/python/fate_flow/components/model_loader.py", line 142, in _run
return self.read_checkpoint()
File "./fate/fateflow/python/fate_flow/components/model_loader.py", line 101, in read_checkpoint
raise ValueError('The checkpoint was not found.')
ValueError: The checkpoint was not found.

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./fate/fateflow/python/fate_flow/worker/task_executor.py", line 195, in run
cpn_output = run_object.run(cpn_input)
File "./fate/fateflow/python/fate_flow/components/_base.py", line 149, in run
method(cpn_input)
File "./fate/fateflow/python/fate_flow/components/model_loader.py", line 147, in _run
else 'No checkpoint was found.')
OSError: No checkpoint was found.

tag is not definded

file: model_app.py

line 493, tag is not definded
else:
if not (tag_operation is TagOperation.RETRIEVE and not request_data.get('with_model')):
try:
tag = Tag.get(Tag.f_name == tag_name)
except peewee.DoesNotExist:
raise Exception("Can not found '{}' tag.".format(tag_name))

    if tag_operation is TagOperation.RETRIEVE:
        if request_data.get('with_model', False):
            res = {'models': []}
            models = (MLModel.select().join(ModelTag, on=ModelTag.f_m_id == MLModel.f_model_version).where(ModelTag.f_t_id == **tag**.f_id))
            for model in models:
                    res["models"].append({
                    "model_id": model.f_model_id,
                    "model_version": model.f_model_version,
                    "model_size": model.f_size,
                    "role": model.f_role,
                    "party_id": model.f_party_id
                })
            res["count"] = models.count()
            return get_json_result(data=res)
        else:
            tags = Tag.filter(Tag.f_name.contains(tag_name))
            if not tags:
                return get_json_result(retcode=100, retmsg="No tags found.")
            res = {'tags': []}
            for tag in tags:
                res['tags'].append({'name': tag.f_name, 'description': tag.f_desc})
            return get_json_result(data=res)

    elif tag_operation is TagOperation.UPDATE:
        new_tag_name = request_data.get('new_tag_name', None)
        new_tag_desc = request_data.get('new_tag_desc', None)
        if (**tag**.f_desc == new_tag_name) and (tag.f_desc == new_tag_desc):
            return get_json_result(100, "Nothing to be updated.")
        else:
            if request_data.get('new_tag_name'):
                if not Tag.get_or_none(Tag.f_name == new_tag_name):
                    tag.f_name = new_tag_name
                else:
                    return get_json_result(100, retmsg="'{}' tag already exists.".format(new_tag_name))

            tag.f_desc = new_tag_desc
            tag.save()
            return get_json_result(retmsg="Infomation of '{}' tag has been updated successfully.".format(tag_name))

    else:
        delete_query = ModelTag.delete().where(ModelTag.f_t_id == tag.f_id)
        delete_query.execute()
        Tag.delete_instance(tag)
        return get_json_result(retmsg="'{}' tag has been deleted successfully.".format(tag_name))

Add pass status for task when need_run component parameter is used

System information

  • FATE Flow version (use command: python fate_flow_server.py --version):
    1.7.0

  • Python version (use command: python --version):
    3.6.5

  • Are you willing to contribute it (yes/no):
    yes

Describe the feature and the current behavior/state.

  1. task will be had pass party status and task status
  2. when calculating the task status by task party status and the job status by task status, the pass status is the same with the success status
  3. when the need_run param is set to false in the job configuration, such as set evalution component need_run=false

Will this change the current api? How?
no

Who will benefit with this feature?

Any Other info.

The reader component can't load other job output data table

System information

  • Have I written custom code (yes/no):
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
  • FATE Flow version (use command: python fate_flow_server.py --version):
  • Python version (use command: python --version):

Describe the current behavior

Describe the expected behavior

Other info / logs Include any logs or source code that would be helpful to
diagnose the problem. If including tracebacks, please include the full
traceback. Large logs and files should be attached.

  • fateflow/logs/$job_id/fate_flow_schedule.log: scheduling log for a job
  • fateflow/logs/$job_id/* : all logs for a job
  • fateflow/logs/fate_flow/fate_flow_stat.log: a log of server stat
  • fateflow/logs/fate_flow/fate_flow_schedule.log: the starting scheduling log for all jobs
  • fateflow/logs/fate_flow/fate_flow_detect.log: the starting detecting log for all jobs

Contributing

  • Do you want to contribute a PR? (yes/no):
  • Briefly describe your candidate solution(if contributing):

更新参数接口能否针对某些或某个组件进行更新

System information

  • FATE Flow version (use command: python fate_flow_server.py --version):
  • 1.7
  • Python version (use command: python --version):
  • 3.6.10

在更新参数接口 /parameter/update, 能否做到判断具体时那些参数更新了,影响了那些组件,并为这些组件生成新的版本任务,然后找到本修改的最上游的组件,从这个组件从新运行job会比较合理。我看了1.7的代码发现修改参数时要传入修改后完整的配置,哪怕我只修改了一个组件。代码里面也有修改组件的变量但是没有实现相关逻辑。想问下设计这个参数修改时的初衷是什么。后续是否会实现我描述的功能

Clean task temporary table when clean job

System information

  • FATE Flow version (use command: python fate_flow_server.py --version): 1.8.0
  • Python version (use command: python --version): 3.6.5
  • Are you willing to contribute it (yes/no): yes

Describe the feature and the current behavior/state.
Clean task temporary table when clean job

Will this change the current api? How?
No.

Who will benefit with this feature?

Any Other info.

FATE后端使用spark_rabbitmq无法运行示例任务的问题

在KubeFATE的v1.8.0版本的docker-deploy部署方法中,使用两台虚拟机,使用eggroll可以正常运行示例流程;使用spark和rabbitmq无法运行示例数据。

为了使用spark和rabbitmq,根据收集的信息,

  1. 首先修改parties.conf文件中的backend=spark_rabbitmq
    image

  2. 然后修改training_template/public/fate_flow/conf/service_conf.yaml文件中的default_engines部分,
    分别设置为 computing: spark federation: rabbitmq storage: hdfs
    1651914151(1)

  3. 修改完成后,分别运行generate_config.sh和docker_deploy.sh all命令,在两台虚拟机上启动了所有docker容器。在host端进入client_1容器,修改fateflow/examples/upload/upload_host.json文件,在最后添加”storage_engine“: "HDFS"后,使用flow data upload 提交数据
    image

  4. 在guest端以同样的方式修改upload_guest.json文件,添加”storage_engine“: "HDFS"后,使用flow data upload 提交数据。然后修改fateflow/examples/lr/test_hetero_lr_job_conf.json文件,在job_parameters中提交”spark_run“和"rabbitmq_run"的配置信息
    image

  5. 使用命令flow job submit提交任务
    image

整个流程没有报错信息,但是提交任务后,所有的f_status一直处于waiting状态,训练流程无法运行。请问有可能是什么问题导致上述任务无法运行的情况???

flow toy test bug in 1.7.0

  1. There is no timeout error on the client, when job is timeout
  2. toy job log is put in /data/projects/fate, it is very bad to experience

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.