federatedai / fate-flow Goto Github PK
View Code? Open in Web Editor NEWSolution for deploying and managing end-to-end federated learning workflows
License: Apache License 2.0
Solution for deploying and managing end-to-end federated learning workflows
License: Apache License 2.0
使用的fate版本是1.9.0,在三个虚拟机上以分布式方式启动,后端计算和存储引擎是eggroll.
连接mysql无法获取表内数据
mysql应用程序基于docker镜像mysql:5.7在另一台虚拟机上启动,在数据库huian的test表中插入建模用的数据。test表内数据如下所示,
进入fate中一方节点的confs-xxx_client_1容器内,编写用于mysql数据库连接的json文件并命名为data.json,如下所示,
使用flow table bind -c data.json命令后,可以和远程数据库建立连接,但是使用flow table info命令查询数据表中没有数据,返回结果的count字段值为null,无法使用该DTable表中数据进行建模任务。
请问需要如何修改data.json文件,才可以把mysql表中的数据加载到fate存储系统中,还是需要再使用完flow table bind -c data.json命令后,还需要使用其他命令来加载数据到fate存储系统中?
连接hdfs后报错
hdfs通过singularities/hadoop:latest(hadoop版本应该是2.8.3)镜像在一台虚拟机上启动,包括一个namenode和3个datanode节点。编写用于数据系统映射的data.json文件如下所示,
使用flow table bind -c data.json报错,如下所示,
请问需要如何配置可以正确导入hdfs中的文件?
System information
Describe the feature and the current behavior/state.
在算法组件中每一步通过多方通信向对端传输信息时增加日志打印,打印出输出的内容以及本次传输的操作说明,日志输出在对应job日志目录里的独立文件中,同时提供api可以让用户进行查看指定job指定算法组件执行过程中关于多方通信都传输了什么信息,是否是加密安全的,当然是否打印算法组件在多方通信时的日志输出,我会提供一个开关,打开开关才会增加算法执行过程中多方通信的审计日志输出,以此来降低性能损耗,不过实际上我个人理解,相较于多方通信这个rpc过程,增加日志打印其实不会增加太多性能损耗,但是为了产品化考虑,我还是会增加一个开启算法执行过程多方通信日志打印的开关
Will this change the current api? How?
不涉及对存量api的修改,会新增api
Who will benefit with this feature?
每一个围绕fate打造联邦学校产品的用户都会从中受益,如果没有算法执行过程中多方通信的日志审计,那么用户对于联邦平台的安全性就会打一个疑问,增加了算法执行过程多方通信日志打印,会大大提升产品的安全性说明,便于用户理解,增强用户对于联邦学习产品的信任
Any Other info.
System information
Describe the feature and the current behavior/state.
Will this change the current api? How?
Who will benefit with this feature?
Any Other info.
任务运行到evaluation组件时,取消任务响应超时
fix the param hava_head=>have_head
用kubefate基于k8s版本部署的fate集群
fate版本:1.7.2
kubefate版本:1.4.3
对应arbiter节点上heterolr算法组件报错日志
[ERROR] [2022-03-29 12:24:59,438] [202203291210246798510] [30:140092026906368] - [job_tracker.clean_task] [line:491]: cleanup error
Traceback (most recent call last):
File "/data/projects/fate/fateflow/python/fate_flow/operation/job_tracker.py", line 455, in clean_task
sess.init_computing(computing_session_id=f"{computing_temp_namespace}_clean", options=session_options)
File "/data/projects/fate/fate/python/fate_arch/session/_session.py", line 116, in init_computing
session_id=computing_session_id, options=options
File "/data/projects/fate/fate/python/fate_arch/computing/eggroll/_csession.py", line 36, in init
self._rp_session = session_init(session_id=session_id, options=options)
File "/data/projects/fate/eggroll/python/eggroll/core/session.py", line 42, in session_init
er_session = ErSession(session_id=session_id, options=options)
File "/data/projects/fate/eggroll/python/eggroll/core/session.py", line 199, in init
self.__session_meta = self._cluster_manager_client.get_or_create_session(session_meta)
File "/data/projects/fate/eggroll/python/eggroll/core/client.py", line 189, in get_or_create_session
serdes_type=self.__serdes_type))
File "/data/projects/fate/eggroll/python/eggroll/core/client.py", line 243, in __check_processors
raise ValueError(f"processor in session meta is not valid: {session_meta}")
ValueError: processor in session meta is not valid: <ErSessionMeta(id=202203291210246798510_hetero_lr_0_0_arbiter_10002_clean, name=, status=NEW, tag=, process
ors=[, len=1], options=[{'eggroll.session.processors.per.node': '1', 'eggroll.session.deploy.mode': 'cluster'}]) at 0x7f6ab8e38f28>
[ERROR] [2022-03-29 12:25:42,204] [202203291210246798510] [283:139704828536640] - [_session.get_session_from_record] [line:394]: processor in session meta is not valid: <ErSessionMeta(id=202203291210246798510_hetero_lr_0_0_arbiter_10002, name=, status=NEW, tag=, processors=[, len=1], options=[{'python.venv': '/opt/app-root', 'eggroll.session.deploy.mode': 'cluster', 'python.path': '/data/projects/fate/fate/python/federatedml:/opt/rh/rh-nodejs10/root/usr/lib/python2.7/site-packages:$PYTHONPATH:/data/projects/fate/fate/python:/data/projects/fate/eggroll/python:/data/projects/fate/fateflow/python:/data/projects/fate/fate/python/fate_client', 'eggroll.session.processors.per.node': '1'}]) at 0x7f0f6a7a5908>
[ERROR] [2022-03-29 12:26:14,137] [202203291210246798510] [304:139838163154752] - [_session.get_session_from_record] [line:394]: processor in session meta is not valid: <ErSessionMeta(id=202203291210246798510_hetero_lr_0_0_arbiter_10002, name=, status=NEW, tag=, processors=[***, len=1], options=[{'eggroll.session.deploy.mode': 'cluster', 'python.venv': '/opt/app-root', 'python.path': '/data/projects/fate/fate/python/federatedml:/opt/rh/rh-nodejs10/root/usr/lib/python2.7/site-packages:$PYTHONPATH:/data/projects/fate/fate/python:/data/projects/fate/eggroll/python:/data/projects/fate/fateflow/python:/data/projects/fate/fate/python/fate_client', 'eggroll.session.processors.per.node': '1'}]) at 0x7f2e75d72908>
Added job inheritance function to reuse tasks that have been successful in a job
模型部署加载绑定过程中,部署deploy,加载load从源码实现看都是从guest发起调用后,会将请求通过rollsite分发给当前作业中其他参与方进行调用执行,达到在guest方进行deploy、load后,其余参与方也执行了相应的deploy、load操作
但是bind接口从源码实现看并没有将bind请求分发给各个参与方的操作
那么如果想要当前作业各个参与方都完成bind操作,是否需要手动调用各方的bind接口执行后才ok?
System information
Describe the feature and the current behavior/state.
Does not print error log if the process no longer exists when trying to kill worker process
Will this change the current api? How?
no
Who will benefit with this feature?
Any Other info.
register a SQL-related components with
{
"name": "fate_sql",
"version": "0.1",
"path": "python/fate_sql"
}
db related exception occur with keyError
As register user-provided provider
is a planed feature in 1.7.0
, we hope this issue could be fixed as soon as possible 🙏
I want to get the party_id by cmd or api. How do I get the party_id of the current node?
This bug affects the versions before 2.1.0 (including 2.0.5, 2.0.4, etc.) of FATE-Serving. FATE-Serving may cannot find the pre-trained models if FATE 1.7.0 is used with FATE-Serving 2.0.5 (or older versions).
We strongly suggest using FATE-Serving 2.1.1 or 2.1.0 with FATE-1.7.0.
这个 bug 会影响 FATE-Serving 2.1.0 之前的版本(包括 2.0.5、2.0.4 等),当 FATE 1.7.0 与 FATE-Serving 2.0.5(或更早的版本)同时使用时,FATE-Serving 可能无法找到已经训练好的模型。
推荐在使用 FATE 1.7.0 的时候,搭配使用 FATE-Serving 2.1.1 或 2.1.0。
If you want to keep using FATE-Serving 2.0.5 (or older versions), here is a temporary solution.
Open the config file of FATE-Serving conf/serving-server.properties
and change model.transfer.url
to the correct URL, e.g. http://127.0.0.1:9380/v1/model/transfer
.
如果您仍要使用 FATE-Serving 2.0.5(或更早的版本),以下是临时解决方案。
打开 FATE-Serving 的配置文件 conf/serving-server.properties
,修改 model.transfer.url
为正确的地址,如 http://127.0.0.1:9380/v1/model/transfer
。
有个问题想请教下,fate1.7.x版本目前有对model_local_cache目录里的模型以及jobs目录里的历史job相关配置文件自动清理的处理吗?就是下图中红线标出的,我在1.7.2fateflow源码中看了下,貌似没找到,但是无法确定,包括后续1.8版本中是否会增加这俩目录的自动清理呢?希望得到你的答复
Originally posted by @MiKKiYang in #204 (comment)
The writer component supports outputting fate storage data to designated external storage
Describe the current behavior
The data doc in the current version missing the description of the upload-history
, which is an command in the current implement.
Contributing
when rebind table with flow table bind
command:
$ flow table bind -c bind_local_path.json
{
"retcode": 100,
"retmsg": "The data table already exists.If you still want to continue uploading, please add the parameter -drop.1 means to add again after deleting the table"
}
yet, we find --drop
parameter not supported in this version 😢
Describe the feature and the current behavior/state.
It would be more convenient for the developers if the REST document is updated.
Thanks!
Who will benefit from this feature?
Developers
The table
parameter of flow client cli
in the current version(1.7) has changed to table-name
, Pease refer to https://github.com/gxcuit/FATE/blob/6a9c2dd6ff95fcab6f336a2e188a0c58f3777d39/python/fate_client/flow_client/flow_cli/utils/cli_args.py#L78
However, the doc still uses name
.
Contributing
System information
Describe the feature and the current behavior/state.
Will this change the current api? How?
Who will benefit with this feature?
Any Other info.
我部署了fate v1.9.0,调用flow model export导出模型到存储引擎任务失败。
报错内容是:TypeError: _run() got an unexpected keyword argument 'cpn_input'
初步看了下,fateflow/component/_base.py line 132 的 _run() method,入参名是 "cpn_input";
但ModelStore和ModelRestore重写这个method时,入参名是 "input_cpn"。我将入参名改成 "cpn_input"后可以正常运行任务。
然后这个任务的日志里会把service_conf.yaml的model_store_address部分全都打印出来(包括密码),可能不太合适。
Two-way connectivity detection api
上传100M文件到hdfs失败
cat /data/projects/fate/examples/data-1W/upload_lr7w_host.json
{
"file": "/data/projects/fate/examples/data-1W/data_A402_1_party2_7w.csv",
"table_name": "lr7w_h",
"namespace": "RATEST",
"head": 1,
"partition": 8,
"work_mode": 1,
"backend": 1,
"use_local_data": 0,
"drop": 1,
"storage_engine": "HDFS"
}
ls -hl "/data/projects/fate/examples/data-1W/data_A402_1_party2_7w.csv"
-rw------- 1 root root 109M Feb 23 10:10 /data/projects/fate/examples/data-1W/data_A402_1_party2_7w.csv
flow data upload -c /data/projects/fate/examples/data-1W/upload_lr7w_host.json --drop
197
[INFO] [2022-02-23 10:12:54,048] [202202231012241074990] [57899:139746043860800] - [_table._put_all] [line:64]: put in hdfs file: hdfs://namenode:9000//fate/input_data/RATEST/lr7w_h
198
[ERROR] [2022-02-23 10:12:54,181] [202202231012241074990] [57899:139746043860800] - [task_executor._run_] [line:243]: HDFS Flush failed, errno: 255 (Unknown error 255) Please check that you are connecting to the correct HDFS RPC port
199
Traceback (most recent call last):
200
File "./fate/fate/python/fate_arch/storage/hdfs/_table.py", line 77, in _put_all
201
writer.write(hdfs_utils.serialize(k, v))
202
File "pyarrow/io.pxi", line 283, in pyarrow.lib.NativeFile.write
203
File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
204
OSError: HDFS Write failed, errno: 255 (Unknown error 255) Please check that you are connecting to the correct HDFS RPC port
205
Support forwarding request header
在v1.9.0版本中支持了fateflow的高可用和负载均衡,请问在多大数据量情况下、什么样的场景下需要部署高可用,还有关键是要怎样去部署呢,在部署文档中好像没有看到关于这方面的内容
A. When use model loader to training a new model, for example: use model_loader to load hetero_nn_0's model and train hetero_nn_1, then hetero_nn_0's parameters does not be inherited, which is wrong.
B. If job contains hetero_nn_0 and hetero_nn_1, hetero_nn_0 is a training task and hetero_nn_1 is a predicing task, if the user deploy hetero_nn_1, hetero_nn_1 can not inherit parameters from hetero_nn_0 in single predicting task.
When download data of component outputs using "flow data download -c",the csv file contains strings like "<federatedml.feature.instance.Instance object at 0x7fc7514621d0>", but if we use "component_output_data" to download data directly, the data is normal. There may be some bug in "flow data download -c"
System information
Describe the feature and the current behavior/state.
Will this change the current api? How?
Who will benefit with this feature?
Any Other info.
System information
centos 7, fate1.7
Describe the current behavior
参考文档:https://federatedai.github.io/FATE-Flow/latest/zh/fate_flow_model_migration/#_2
完成模型迁移后,在新集群,执行预测任务,报错
原机器migrate结果截图
目标机器import截图
Describe the expected behavior
Other info / logs Include any logs or source code that would be helpful to
配置信息:
migrate json
{ "job_parameters": { "federated_mode": "MULTIPLE" }, "role": { "guest": [ 10005 ], "host": [ 10003 ] }, "execute_party": { "guest": [ 10005 ], "host": [ 10003 ] }, "migrate_initiator": { "role": "guest", "party_id": 10005 }, "migrate_role": { "guest": [ 10001 ], "host": [ 10002 ] }, "model_id": "guest-10005#host-10003#model", "model_version": "202204161500311591470" }
import 配置
{ "role": "guest", "party_id": 10001, "model_id": "guest-10001#host-10002#model", "model_version": "202204210856597457810", "file": "/data/projects/fate/examples/zhanht/migrate/guest#10001#guest-10001#host-10002#model_202204210856597457810.zip" }
log日志
[ERROR] [2022-04-21 09:22:10,777] [202204210916388364990] [37172:140459868743488] - [model_loader._run] [line:133]: Get 'model_alias' failed. Trying to find a checkpoint...
Traceback (most recent call last):
File "./fate/fateflow/python/fate_flow/components/model_loader.py", line 130, in _run
self.get_model_alias()
File "./fate/fateflow/python/fate_flow/components/model_loader.py", line 56, in get_model_alias
raise ValueError('The job was not found.')
ValueError: The job was not found.
[ERROR] [2022-04-21 09:22:10,778] [202204210916388364990] [37172:140459868743488] - [model_loader._run] [line:144]: Read checkpoint error.
Traceback (most recent call last):
File "./fate/fateflow/python/fate_flow/components/model_loader.py", line 142, in _run
return self.read_checkpoint()
File "./fate/fateflow/python/fate_flow/components/model_loader.py", line 101, in read_checkpoint
raise ValueError('The checkpoint was not found.')
ValueError: The checkpoint was not found.
[ERROR] [2022-04-21 09:22:10,779] [202204210916388364990] [37172:140459868743488] - [task_executor.run] [line:243]: No checkpoint was found.
Traceback (most recent call last):
File "./fate/fateflow/python/fate_flow/components/model_loader.py", line 142, in _run
return self.read_checkpoint()
File "./fate/fateflow/python/fate_flow/components/model_loader.py", line 101, in read_checkpoint
raise ValueError('The checkpoint was not found.')
ValueError: The checkpoint was not found.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./fate/fateflow/python/fate_flow/worker/task_executor.py", line 195, in run
cpn_output = run_object.run(cpn_input)
File "./fate/fateflow/python/fate_flow/components/_base.py", line 149, in run
method(cpn_input)
File "./fate/fateflow/python/fate_flow/components/model_loader.py", line 147, in _run
else 'No checkpoint was found.')
OSError: No checkpoint was found.
file: model_app.py
line 493, tag is not definded
else:
if not (tag_operation is TagOperation.RETRIEVE and not request_data.get('with_model')):
try:
tag = Tag.get(Tag.f_name == tag_name)
except peewee.DoesNotExist:
raise Exception("Can not found '{}' tag.".format(tag_name))
if tag_operation is TagOperation.RETRIEVE:
if request_data.get('with_model', False):
res = {'models': []}
models = (MLModel.select().join(ModelTag, on=ModelTag.f_m_id == MLModel.f_model_version).where(ModelTag.f_t_id == **tag**.f_id))
for model in models:
res["models"].append({
"model_id": model.f_model_id,
"model_version": model.f_model_version,
"model_size": model.f_size,
"role": model.f_role,
"party_id": model.f_party_id
})
res["count"] = models.count()
return get_json_result(data=res)
else:
tags = Tag.filter(Tag.f_name.contains(tag_name))
if not tags:
return get_json_result(retcode=100, retmsg="No tags found.")
res = {'tags': []}
for tag in tags:
res['tags'].append({'name': tag.f_name, 'description': tag.f_desc})
return get_json_result(data=res)
elif tag_operation is TagOperation.UPDATE:
new_tag_name = request_data.get('new_tag_name', None)
new_tag_desc = request_data.get('new_tag_desc', None)
if (**tag**.f_desc == new_tag_name) and (tag.f_desc == new_tag_desc):
return get_json_result(100, "Nothing to be updated.")
else:
if request_data.get('new_tag_name'):
if not Tag.get_or_none(Tag.f_name == new_tag_name):
tag.f_name = new_tag_name
else:
return get_json_result(100, retmsg="'{}' tag already exists.".format(new_tag_name))
tag.f_desc = new_tag_desc
tag.save()
return get_json_result(retmsg="Infomation of '{}' tag has been updated successfully.".format(tag_name))
else:
delete_query = ModelTag.delete().where(ModelTag.f_t_id == tag.f_id)
delete_query.execute()
Tag.delete_instance(tag)
return get_json_result(retmsg="'{}' tag has been deleted successfully.".format(tag_name))
System information
FATE Flow version (use command: python fate_flow_server.py --version):
1.7.0
Python version (use command: python --version):
3.6.5
Are you willing to contribute it (yes/no):
yes
Describe the feature and the current behavior/state.
Will this change the current api? How?
no
Who will benefit with this feature?
Any Other info.
System information
Describe the current behavior
Describe the expected behavior
Other info / logs Include any logs or source code that would be helpful to
diagnose the problem. If including tracebacks, please include the full
traceback. Large logs and files should be attached.
Contributing
System information
在更新参数接口 /parameter/update, 能否做到判断具体时那些参数更新了,影响了那些组件,并为这些组件生成新的版本任务,然后找到本修改的最上游的组件,从这个组件从新运行job会比较合理。我看了1.7的代码发现修改参数时要传入修改后完整的配置,哪怕我只修改了一个组件。代码里面也有修改组件的变量但是没有实现相关逻辑。想问下设计这个参数修改时的初衷是什么。后续是否会实现我描述的功能
System information
Describe the feature and the current behavior/state.
Clean task temporary table when clean job
Will this change the current api? How?
No.
Who will benefit with this feature?
Any Other info.
在KubeFATE的v1.8.0版本的docker-deploy部署方法中,使用两台虚拟机,使用eggroll可以正常运行示例流程;使用spark和rabbitmq无法运行示例数据。
为了使用spark和rabbitmq,根据收集的信息,
然后修改training_template/public/fate_flow/conf/service_conf.yaml文件中的default_engines部分,
分别设置为 computing: spark federation: rabbitmq storage: hdfs
修改完成后,分别运行generate_config.sh和docker_deploy.sh all命令,在两台虚拟机上启动了所有docker容器。在host端进入client_1容器,修改fateflow/examples/upload/upload_host.json文件,在最后添加”storage_engine“: "HDFS"后,使用flow data upload 提交数据
在guest端以同样的方式修改upload_guest.json文件,添加”storage_engine“: "HDFS"后,使用flow data upload 提交数据。然后修改fateflow/examples/lr/test_hetero_lr_job_conf.json文件,在job_parameters中提交”spark_run“和"rabbitmq_run"的配置信息
整个流程没有报错信息,但是提交任务后,所有的f_status一直处于waiting状态,训练流程无法运行。请问有可能是什么问题导致上述任务无法运行的情况???
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.