GithubHelp home page GithubHelp logo

ansiblefate's People

Contributors

dylan-fan avatar hainingzhang avatar jat001 avatar or0or1 avatar zhihuiwan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ansiblefate's Issues

使用ansibleFATE支持多host部署吗?

比如在执行命令:
sh deploy/deploy.sh init -h="10000:192.168.0.1" -g="9999:192.168.1.1" -n=spark
其中支持-h="10000:192.168.0.1|10001:192.168.0.2"这种多host部署吗?
或者说多host部署有其他方案吗或者不支持

Deployment by Ansible

fatal: [ip]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Warning: Permanently added 'ip' (ECDSA) to the list of known hosts.\r\nPermission denied (publickey,gssapi-keyex,gssapi-with-mic,password).", "unreachable": true}

所有前置操作都完成,但是部署出现 'ansible_ssh_host' is undefined

日志如下:


PLAY [fate] ********************************************************************

TASK [Gathering Facts] *********************************************************
ok: [192.168.1.182]
ok: [192.168.1.177]
ok: [192.168.1.183]

.....

TASK [check : update(deploy): check.sh] ****************************************
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ansible.errors.AnsibleUndefinedVariable: 'ansible_ssh_host' is undefined. 'ansible_ssh_host' is undefined
fatal: [192.168.1.182]: FAILED! => {"changed": false, "msg": "AnsibleUndefinedVariable: 'ansible_ssh_host' is undefined. 'ansible_ssh_host' is undefined"}
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ansible.errors.AnsibleUndefinedVariable: 'ansible_ssh_host' is undefined. 'ansible_ssh_host' is undefined
fatal: [192.168.1.183]: FAILED! => {"changed": false, "msg": "AnsibleUndefinedVariable: 'ansible_ssh_host' is undefined. 'ansible_ssh_host' is undefined"}
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ansible.errors.AnsibleUndefinedVariable: 'ansible_ssh_host' is undefined. 'ansible_ssh_host' is undefined
fatal: [192.168.1.177]: FAILED! => {"changed": false, "msg": "AnsibleUndefinedVariable: 'ansible_ssh_host' is undefined. 'ansible_ssh_host' is undefined"}

NO MORE HOSTS LEFT *************************************************************

PLAY RECAP *********************************************************************
192.168.1.177              : ok=4    changed=0    unreachable=0    failed=1    skipped=1    rescued=0    ignored=0   
192.168.1.182              : ok=4    changed=0    unreachable=0    failed=1    skipped=1    rescued=0    ignored=0   
192.168.1.183              : ok=4    changed=0    unreachable=0    failed=1    skipped=1    rescued=0   ```

CCLinux系统安装流程出现异常fatal: [10.32.123.27]: FAILED! => {"changed": false, "msg": "please check json"}

在CCLinux系统,ansible方式安装fate1.8
当安装eggroll时,出现以下异常:
image
我定位到json-replace.sh,该脚本调用了jcheck.py
发现此处进行json格式的检查,我对此进行了修改,如下:
import sys
sys.exit(0)
但是重新部署时,依旧出现以上错误
同时,我发现 json-replace.sh中
/data/projects/backups/fate/eggroll/conf 并未执行
该语句之前的代码,我还是找不到第二处检查json格式的代码

使用ansiblefate部署fate2.0.0版本后,节点无法通信

各个节点均可通过单边测试,但是双方测试失败
(venv) app@VM_0_1_centos:/home/guo$ flow test toy -gid 9999 -hid 10000
{
"code": 1002,
"data": {
"model_id": "202402280731033092380",
"model_version": "0"
},
"job_id": "202402280731033092380",
"message": "Traceback (most recent call last):\n File "/data/projects/fate/fate_flow/python/fate_flow/scheduler/scheduler.py", line 376, in create_all_job\n raise Exception("create job failed", response)\nException: ('create job failed', {'guest': {'9999': {'code': 0, 'message': 'success'}}, 'host': {'10000': {'code': 104, 'message': 'Federated schedule error, <_InactiveRpcError of RPC that terminated with:\n\tstatus = StatusCode.UNKNOWN\n\tdetails = ""\n\tdebug_error_string = "UNKNOWN:Error received from peer {grpc_message:"", grpc_status:2, created_time:"2024-02-28T07:31:03.581646839+00:00"}"\n>'}}})\n"
}

使用ansible部署FATE1.7.0在安装MySQL时报拒绝访问的错

按照部署文档,在本机上部署单边guest的时候,一直卡在MySQL的安装上,前提是以app用户执行部署脚本,但是日志中显示仍然用的是root用户去获取MySQL:

TASK [mysql : debug] ***********************************************************
ok: [我的IP] => {
    "mysql_status.get('stderr_lines')": []
}

TASK [mysql : check(deploy): check if change admin password or not] ************
ok: [我的IP]

TASK [mysql : chpasswd(deploy): admin password] ********************************
changed: [我的IP]

TASK [mysql : debug] ***********************************************************
ok: [我的IP] => {
    "mysql_chpasswd_status.get('stderr_lines')": [
        "\u0007mysqladmin: connect to server at '127.0.0.1' failed",
        "error: 'Access denied for user 'root'@'localhost' (using password: NO)'"
    ]
}

TASK [mysql : check(deploy): check if load data or not] ************************
ok: [我的IP]

TASK [mysql : commit(deploy): load.sh] *****************************************
changed: [我的IP]

TASK [mysql : debug] ***********************************************************
ok: [我的IP] => {
    "mysql_load.get('stderr_lines')": [
        "mysql: [Warning] Using a password on the command line interface can be insecure.",
        "ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: YES)"
    ]
}

ansible部署双边场景,yum依赖包安装不上snappy-devel

TASK [base : yum(deploy): install dependency packages] *************************
fatal: [11.50.192.7]: FAILED! => {"changed": false, "failures": ["No package snappy-devel available."], "msg": "Failed to install some of the specified packages", "rc": 1, "results": []}
fatal: [11.50.192.8]: FAILED! => {"changed": false, "failures": ["No package snappy-devel available."], "msg": "Failed to install some of the specified packages", "rc": 1, "results": []}

我在xxx/tools/install_base.sh中找到了yum -y install gcc gcc-c++ make openssl-devel gmp-devel mpfr-devel libmpc-devel libaio numactl autoconf automake libtool libffi-devel snappy snappy-devel zlib zlib-devel bzip2 bzip2-devel lz4-devel libasan lsof sysstat telnet psmisc iperf3 erlang,但是我把这个文件移走以后,还是报这个错

ansible部署双边场景, 找不到build.tar.gz

fatal: [ip]: FAILED! => {"changed": false, "msg": "Could not find or access 'build.tar.gz'
Searched in: /data/projects/ansibleFATE-1.7.2-release-online/roles/check/files/build.tar.gz
/data/projects/ansibleFATE-1.7.2-release-online/roles/check/build.tar.gz
/data/projects/ansibleFATE-1.7.2-release-online/roles/base/files/build.tar.gz
/data/projects/ansibleFATE-1.7.2-release-online/roles/base/build.tar.gz
/data/projects/ansibleFATE-1.7.2-release-online/roles/check/tasks/files/build.tar.gz
/data/projects/ansibleFATE-1.7.2-release-online/roles/check/tasks/build.tar.gz
/data/projects/ansibleFATE-1.7.2-release-online/deploy/../files/build.tar.gz
/data/projects/ansibleFATE-1.7.2-release-online/deploy/../build.tar.gz on the Ansible Controller.
If you are using a module and expect the file to exist on the remote, see the remote_src option"}

安装失败/bin/bash deploy/deploy.sh deploy 报错archive. Command \"/usr/bin/gtar\

执行报错

/bin/bash deploy/deploy.sh deploy

-------------------1 get base data--------------------------------
deploy in progress, please check the log in /root/AnsibleFATE/deploy/../logs/deploy-1645251940.log
or commit "tail -f /root/AnsibleFATE/deploy/../logs/deploy-1645251940.log"
[root@ip-172-31-2-212 AnsibleFATE]# tail -f /root/AnsibleFATE/deploy/../logs/deploy-1645251940.log
TASK [check : untar(deploy): deploy.tar.gz] ************************************
fatal: [172.31.0.87]: FAILED! => {"changed": false, "msg": "Failed to find handler for "/root/.ansible/tmp/ansible-tmp-1645251944.9-18521-127955827148116/source". Make sure the required command to extract the file is installed. Command "/usr/bin/unzip" could not handle archive. Command "/usr/bin/gtar" could not handle archive."}
fatal: [172.31.2.212]: FAILED! => {"changed": false, "msg": "Failed to find handler for "/root/.ansible/tmp/ansible-tmp-1645251944.92-18523-257278277195642/source". Make sure the required command to extract the file is installed. Command "/usr/bin/unzip" could not handle archive. Command "/usr/bin/gtar" could not handle archive."}

NO MORE HOSTS LEFT *************************************************************

PLAY RECAP *********************************************************************
172.31.0.87 : ok=4 changed=0 unreachable=0 failed=1 skipped=1 rescued=0 ignored=0
172.31.2.212 : ok=4 changed=0 unreachable=0 failed=1 skipped=1 rescued=0 ignored=0

检测主机存在/usr/bin/gtar、/usr/bin/unzip

建议copy、实现呢

  • copy:
    src: /源目录/1.tar.gz
    dest: /目的目录/1.tar.gz
  • shell: gunzip /目的目录/tar.gz

FATE1.7.0部署无法打开9360端口

测试其他服务可以正常使用9360端口,但是部署FATE1.7.0的时候一直无法open port 9360

TASK [fateflow : update(deploy): /data/projects/fate/conf/service_conf.yaml] ***
ok: [My IP]

TASK [fateflow : update(deploy): /data/projects/common/supervisord/supervisord.d/fate-fateflow.conf] ***
changed: [My IP]

TASK [fateflow : flush_handlers] ***********************************************

RUNNING HANDLER [fateflow : reload fate-fateflow] ******************************
changed: [My IP

RUNNING HANDLER [fateflow : restart fate-fateflow] *****************************
changed: [My IP

TASK [fateflow : wait(deploy)): open port 9360( guest )] ***********************
fatal: [My IP]: FAILED! => {"changed": false, "elapsed": 120, "msg": "Timeout when waiting for My IP:9360"}

NO MORE HOSTS LEFT *************************************************************

PLAY RECAP *********************************************************************
My IP : ok=157 changed=52 unreachable=0 failed=1 skipped=57 rescued=0 ignored=0

部署问题

按照文档取部署,mysql用内部的提示连接超时,用外部的有提示找不到mysql命令去连接

部署fateflow的高可用,eggroll的9370启动失败,route_table.json有重复key

版本:
fate-1.10.0

环境:
centos7.9

配置:
三台虚机,ip是172.16.4.34、172.16.4.37、172.16.4. 38
34是一个party,已安装非HA的fate
37、38组成一个party,想安装HA的fate,eggroll安装在37

操作流程:
按HA的部署流程,在ansible的init后,修改conf文件
https://github.com/FederatedAI/AnsibleFATE/blob/c7d450945ce5338be606bd90642d1edd966c99cd/docs/ansible_deploy_HA.md

image

image

然后执行安装,报错说9370端口无法启动,看日志是说route_table.json里有重复key
image

eggroll日志
image

route_table.json里确实有重复的key
image

请问如何解决这个问题?感谢

2.0版本的测试文档要更新下

AnsibleFATE_2.1.0_release_offline.tar.gz,这个版本安装之后,安装三节点环境

  • 2.0分支文档:
    双边测试
    flow test min -gid 9999 -hid 10000 -aid 10000
    运行显示test命令没有min参数:
    image

  • main分支文档:
    双边测试
    python run_task.py -gid 9999 -hid 10000 -aid 10000 -f fast
    run_task.py已经被删除了,复制过来也不能用
    image

ansible部署双边场景,在执行到fate_flow时出现python key不存在的问题

Traceback (most recent call last):
File "/data/projects/fate/fateflow/python/fate_flow/fate_flow_server.py", line 88, in
ComponentRegistry.load()
File "/data/projects/fate/fateflow/python/fate_flow/db/component_registry.py", line 35, in load
component_registry = cls.get_from_db(file_utils.load_json_conf_real_time(FATE_FLOW_DEFAULT_COMPONENT_REGISTRY_PATH))
File "/data/projects/fate/common/python/venv/lib/python3.8/site-packages/peewee.py", line 394, in inner
return fn(*args, **kwargs)
File "/data/projects/fate/fateflow/python/fate_flow/db/component_registry.py", line 179, in get_from_db
for component_alias in component_registry["components"][module.f_component_name]["alias"]:
KeyError: 'custnn'

在t_component_registry表中有51条数据但是在t_component_info中只有16条数据
mysql> select f_provider_name,f_version,f_component_name from t_component_registry;
+-----------------+-----------+----------------------------+
| f_provider_name | f_version | f_component_name |
+-----------------+-----------+----------------------------+
| fate_flow | 1.11.2 | apireader |
| fate_flow | 1.11.2 | cacheloader |
| fate_flow | 1.11.2 | download |
| fate_flow | 1.11.2 | modelloader |
| fate_flow | 1.11.2 | modelrestore |
| fate_flow | 1.11.2 | modelstore |
| fate_flow | 1.11.2 | reader |
| fate_flow | 1.11.2 | upload |
| fate_flow | 1.11.2 | writer |
| fate | 1.11.3 | columnexpand |
| fate | 1.11.3 | custnn |
| fate | 1.11.3 | dataio |
| fate | 1.11.3 | datastatistics |
| fate | 1.11.3 | datatransform |
| fate | 1.11.3 | evaluation |
| fate | 1.11.3 | featureimputation |
| fate | 1.11.3 | featurescale |
| fate | 1.11.3 | federatedsample |
| fate | 1.11.3 | feldmanverifiablesum |
| fate | 1.11.3 | ftl |
| fate | 1.11.3 | heterodatasplit |
| fate | 1.11.3 | heterofastsecureboost |
| fate | 1.11.3 | heterofeaturebinning |
| fate | 1.11.3 | heterofeatureselection |
| fate | 1.11.3 | heterokmeans |
| fate | 1.11.3 | heterolinr |
| fate | 1.11.3 | heterolr |
| fate | 1.11.3 | heteronn |
| fate | 1.11.3 | heteropearson |
| fate | 1.11.3 | heteropoisson |
| fate | 1.11.3 | heterosecureboost |
| fate | 1.11.3 | heterosshelinr |
| fate | 1.11.3 | heterosshelr |
| fate | 1.11.3 | homodatasplit |
| fate | 1.11.3 | homofeaturebinning |
| fate | 1.11.3 | homolr |
| fate | 1.11.3 | homonn |
| fate | 1.11.3 | homoonehotencoder |
| fate | 1.11.3 | homosecureboost |
| fate | 1.11.3 | intersection |
| fate | 1.11.3 | labeltransform |
| fate | 1.11.3 | localbaseline |
| fate | 1.11.3 | onehotencoder |
| fate | 1.11.3 | positiveunlabeled |
| fate | 1.11.3 | psi |
| fate | 1.11.3 | sampleweight |
| fate | 1.11.3 | scorecard |
| fate | 1.11.3 | secureaddexample |
| fate | 1.11.3 | secureinformationretrieval |
| fate | 1.11.3 | spdztest |
| fate | 1.11.3 | union |
+-----------------+-----------+----------------------------+
51 rows in set (0.02 sec)

mysql> select * from t_component_info;
mysql> select f_component_name,f_component_alias from t_component_info;
+----------------------+--------------------------+
| f_component_name | f_component_alias |
+----------------------+--------------------------+
| apireader | ["ApiReader"] |
| cacheloader | ["CacheLoader"] |
| columnexpand | ["ColumnExpand"] |
| download | ["Download"] |
| featurescale | ["FeatureScale"] |
| heterofeaturebinning | ["HeteroFeatureBinning"] |
| heterolr | ["HeteroLR"] |
| homolr | ["HomoLR"] |
| homonn | ["HomoNN"] |
| modelloader | ["ModelLoader"] |
| modelrestore | ["ModelRestore"] |
| modelstore | ["ModelStore"] |
| positiveunlabeled | ["PositiveUnlabeled"] |
| reader | ["Reader"] |
| upload | ["Upload"] |
| writer | ["Writer"] |
+----------------------+--------------------------+
16 rows in set (0.01 sec)

两边的component_name不匹配是什么情况?

添加证书认证后flow test报错

前置条件

用Ansible FATE-1.7.0单边部署了三个结点,详细配置如下:
210结点,Exchange角色,通过 /bin/bash deploy/deploy.sh keys生成了证书,开启了服务端和客户端认证。
route_table.json的配置信息如下:
{
"route_table":
{
"211":
{
"default":[
{
"is_secure": true,
"ip": "10.32.122.211",
"port": 9371
}
]
},
"213":
{
"default":[
{
"is_secure": true,
"ip": "10.32.122.213",
"port": 9371
}
]
},
.....
eggroll.properties的中相关的配置信息为:
eggroll.core.security.client.ca.crt.path=/data/projects/data/fate/keys/exchange-client-ca.pem
eggroll.core.security.client.crt.path=/data/projects/data/fate/keys/exchange-client-client.pem
eggroll.core.security.client.key.path=/data/projects/data/fate/keys/exchange-client-client.key

eggroll.core.security.ca.crt.path=/data/projects/data/fate/keys/exchange-ca.pem
eggroll.core.security.crt.path=/data/projects/data/fate/keys/exchange-server.pem
eggroll.core.security.key.path=/data/projects/data/fate/keys/exchange-server.key

213结点,Host角色,将210结点的证书拷贝到了对应目录下,示例如下:
scp deploy/keys/exchange/ca.pem [email protected]:/data/projects/data/fate/keys/host-client-ca.pem
.....
另外,其route_table.json的配置信息如下:
{
"route_table":
{
"default":
{
"default":[
{
"is_secure": true,
"ip": "10.32.122.210",
"port": 9370
}
]
},
"213":
{
"default":[
{
"ip": "10.32.122.213",
"port": 9370
}
],
"fateflow":[
{
"ip": "10.32.122.213",
"port": 9360
}
]
}
},
"permission":
{
"default_allow": true
}
}

eggroll.properties中相关的配置信息为:
eggroll.rollsite.lan.insecure.channel.enabled=true
eggroll.rollsite.secure.port=9371

eggroll.core.security.client.ca.crt.path=/data/projects/data/fate/keys/host-client-ca.pem
eggroll.core.security.client.crt.path=/data/projects/data/fate/keys/host-client-client.pem
eggroll.core.security.client.key.path=/data/projects/data/fate/keys/host-client-client.key

211结点,Guest角色,,将210结点的证书拷贝到了对应目录下,示例如下:
scp deploy/keys/exchange/ca.pem [email protected]:/data/projects/data/fate/keys/guest-client-ca.pem
.....
另外,其route_table.json的配置信息如下:
{
"route_table":
{
"default":
{
"default":[
{
"is_secure": true,
"ip": "10.32.122.210",
"port": 9371
}
]
},
"211":
{
"default":[
{
"ip": "10.32.122.211",
"port": 9370
}
],
"fateflow":[
{
"ip": "10.32.122.211",
"port": 9360
}
]
}
},
"permission":
{
"default_allow": true
}
}

eggroll.properties中的相关配置为:
eggroll.rollsite.lan.insecure.channel.enabled=true
eggroll.rollsite.secure.port=9371

eggroll.core.security.client.ca.crt.path=/data/projects/data/fate/keys/guest-client-ca.pem
eggroll.core.security.client.crt.path=/data/projects/data/fate/keys/guest-client-client.pem
eggroll.core.security.client.key.path=/data/projects/data/fate/keys/guest-client-client.key

测试

所有结点的fate-rollsite 服务均重新启动。
在211结点,执行:
source /data/projects/fate/bin/init.sh
flow test toy -gid 211 -hid 213

执行结果报错,错误信息为:
(venv) app@cestc211:/data/projects/fate/eggroll/conf$ flow test toy -gid 211 -hid 213
{
"jobId": "202204211045098230110",
"retcode": 103,
"retmsg": "Traceback (most recent call last):\n File "/data/projects/fate/fateflow/python/fate_flow/scheduler/dag_scheduler.py", line 124, in submit\n raise Exception("create job failed", response)\nException: ('create job failed', {'guest': {211: {'data': {'components': {'secure_add_example_0': {'need_run': True}}}, 'retcode': 0, 'retmsg': 'success'}}, 'host': {213: {'retcode': <RetCode.FEDERATED_ERROR: 104>, 'retmsg': 'Federated schedule error, Please check rollSite and fateflow network connectivityrpc request error: <_Rendezvous of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = "UNAVAILABLE: \n[Roll Site Error TransInfo] \n location msg=UNAVAILABLE: io exception \n stack info=io.grpc.StatusRuntimeException: UNAVAILABLE: io exception\n\tat io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:240)\n\tat io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:221)\n\tat io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:140)\n\tat com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$DataTransferServiceBlockingStub.unaryCall(DataTransferServiceGrpc.java:348)\n\tat com.webank.eggroll.rollsite.EggSiteServicer.unaryCall(EggSiteServicer.scala:138)\n\tat com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$MethodHandlers.invoke(DataTransferServiceGrpc.java:406)\n\tat io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172)\n\tat io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)\n\tat io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)\n\tat io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)\n\tat io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86)\n\tat io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)\n\tat io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:817)\n\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\n\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\nCaused by: io.grpc.netty.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /10.32.122.213:9371\nCaused by: java.net.ConnectException: finishConnect(..) failed: Connection refused\n\tat io.grpc.netty.shaded.io.netty.channel.unix.Errors.throwConnectException(Errors.java:124)\n\tat io.grpc.netty.shaded.io.netty.channel.unix.Socket.finishConnect(Socket.java:243)\n\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:660)\n\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:637)\n\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:524)\n\tat io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:473)\n\tat io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:383)\n\tat io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)\n\tat io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\n\tat java.lang.Thread.run(Thread.java:748)\n \n\nexception trans path: 10.32.122.210(${id}) --> 10.32.122.211(211)"\n\tdebug_error_string = "{"created":"@1650509120.047216058","description":"Error received from peer ipv4:10.32.122.211:9370","file":"src/core/lib/surface/call.cc","file_line":1055,"grpc_message":"UNAVAILABLE: \\n[Roll Site Error TransInfo] \\n location msg=UNAVAILABLE: io exception \\n stack info=io.grpc.StatusRuntimeException: UNAVAILABLE: io exception\\n\\tat io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:240)\\n\\tat io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:221)\\n\\tat io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:140)\\n\\tat com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$DataTransferServiceBlockingStub.unaryCall(DataTransferServiceGrpc.java:348)\\n\\tat com.webank.eggroll.rollsite.EggSiteServicer.unaryCall(EggSiteServicer.scala:138)\\n\\tat com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$MethodHandlers.invoke(DataTransferServiceGrpc.java:406)\\n\\tat io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172)\\n\\tat io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)\\n\\tat io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)\\n\\tat io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)\\n\\tat io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86)\\n\\tat io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)\\n\\tat io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:817)\\n\\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\\n\\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)\\n\\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\\n\\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\\n\\tat java.lang.Thread.run(Thread.java:748)\\nCaused by: io.grpc.netty.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /10.32.122.213:9371\\nCaused by: java.net.ConnectException: finishConnect(..) failed: Connection refused\\n\\tat io.grpc.netty.shaded.io.netty.channel.unix.Errors.throwConnectException(Errors.java:124)\\n\\tat io.grpc.netty.shaded.io.netty.channel.unix.Socket.finishConnect(Socket.java:243)\\n\\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:660)\\n\\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:637)\\n\\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:524)\\n\\tat io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:473)\\n\\tat io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:383)\\n\\tat io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)\\n\\tat io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\\n\\tat io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\\n\\tat java.lang.Thread.run(Thread.java:748)\\n \\n\\nexception trans path: 10.32.122.210(${id}) --> 10.32.122.211(211)","grpc_status":14}"\n>'}}})\n"
}

其他信息

如果不使用证书(is_secure: false, port:9370),能够正常运行。

期望

使用证书下,正确的配置

谢谢!

ValueError: source code string cannot contain null bytes

Ansible安装过程报错:“ValueError: source code string cannot contain null bytes”

系统:CentOS 7
Ansible:AnsibleFATE_1.9.2_release_offline.tar.gz

报错节点:
TASK [build(deploy): python virtual env] ***************************************
changed: [192.168.13.163]
changed: [192.168.13.164]

TASK [python : check(deploy): venv again] **************************************
ok: [192.168.13.164]
ok: [192.168.13.163]

TASK [python : pip(deploy): venv install must packages] ************************
fatal: [192.168.13.163]: FAILED!

具体报错日志:
fatal: [192.168.13.163]: FAILED! => {"changed": false, "cmd": ["/data/projects/fate/common/python/venv/bin/pip", "install", "-U", "-f", "/data/temp/fate/pypi", "--no-index", "pip", "setuptools", "wheel"], "msg": "
:stderr: Traceback (most recent call last):
File "/data/projects/fate/common/python/venv/bin/pip", line 5, in
from pip._internal.cli.main import main
File "/data/projects/fate/common/python/venv/lib/python3.8/site-packages/pip/_internal/cli/main.py", line 9, in
from pip._internal.cli.autocompletion import autocomplete
File "/data/projects/fate/common/python/venv/lib/python3.8/site-packages/pip/_internal/cli/autocompletion.py", line 10, in
from pip._internal.cli.main_parser import create_main_parser
File "/data/projects/fate/common/python/venv/lib/python3.8/site-packages/pip/_internal/cli/main_parser.py", line 8, in
from pip._internal.cli import cmdoptions
File "/data/projects/fate/common/python/venv/lib/python3.8/site-packages/pip/_internal/cli/cmdoptions.py", line 24, in
from pip._internal.cli.parser import ConfigOptionParser
File "/data/projects/fate/common/python/venv/lib/python3.8/site-packages/pip/_internal/cli/parser.py", line 12, in
from pip._internal.configuration import Configuration, ConfigurationError
File "/data/projects/fate/common/python/venv/lib/python3.8/site-packages/pip/_internal/configuration.py", line 20, in
from pip._internal.exceptions import (
File "/data/projects/fate/common/python/venv/lib/python3.8/site-packages/pip/_internal/exceptions.py", line 13, in
from pip._vendor.requests.models import Request, Response
File "/data/projects/fate/common/python/venv/lib/python3.8/site-packages/pip/_vendor/requests/init.py", line 43, in
from pip._vendor import urllib3
File "/data/projects/fate/common/python/venv/lib/python3.8/site-packages/pip/_vendor/urllib3/init.py", line 13, in
from .connectionpool import HTTPConnectionPool, HTTPSConnectionPool, connection_from_url
File "/data/projects/fate/common/python/venv/lib/python3.8/site-packages/pip/_vendor/urllib3/connectionpool.py", line 12, in
from .connection import (
File "/data/projects/fate/common/python/venv/lib/python3.8/site-packages/pip/_vendor/urllib3/connection.py", line 15, in
from .util.proxy import create_proxy_ssl_context
ValueError: source code string cannot contain null bytes
"}

复现代码:
sudo /data/projects/fate/common/python/venv/bin/pip install -U --no-index setuptools

感谢大佬帮瞅瞅

部署问题 fate-flow Timeout when waiting for x.x.xx.x:9360

版本1.11.4 执行bash deploy/deploy.sh deploy后报错:

deploy-xxxx.log报错信息:

fatal: FAILED! => {"changed": false, "elapsed": 120, "msg": "Timeout when waiting for x.x.xx.x:9360"}

/data/logs/fate/fateflow/error.log错误信息:

Traceback (most recent call last):
File "/data/projects/fate/fateflow/python/fate_flow/fate_flow_server.py", line 110, in
server.add_insecure_port(f"{HOST}:{GRPC_PORT}")
File "/data/projects/fate/common/python/venv/lib/python3.8/site-packages/grpc/_server.py", line 969, in add_insecure_port
return _common.validate_port_binding_result(
File "/data/projects/fate/common/python/venv/lib/python3.8/site-packages/grpc/_common.py", line 166, in validate_port_binding_result
raise RuntimeError(_ERROR_MESSAGE_PORT_BINDING_FAILED % address)
RuntimeError: Failed to bind to address x.x.x.x:9360; set GRPC_VERBOSITY=debug environment variable to see detailed error message.

/data/logs/fate/fateflow/info.log错误信息:

PROJECT_BASE: /data/projects/fate
PYTHONPATH: /data/projects/fate/fate/python:/data/projects/fate/fateflow/python:/data/projects/fate/eggroll/python
found service conf: /data/projects/fate/conf/service_conf.yaml
fate flow http port: 9380, grpc port: 9360

check process by http port and grpc port
/data/projects/fate/fateflow/bin/service.sh: line 74: lsof: command not found
/data/projects/fate/fateflow/bin/service.sh: line 75: lsof: command not found
PROJECT_BASE: /data/projects/fate
PYTHONPATH: /data/projects/fate/fate/python:/data/projects/fate/fateflow/python:/data/projects/fate/eggroll/python
found service conf: /data/projects/fate/conf/service_conf.yaml
fate flow http port: 9380, grpc port: 9360

check process by http port and grpc port
/data/projects/fate/fateflow/bin/service.sh: line 74: lsof: command not found
/data/projects/fate/fateflow/bin/service.sh: line 75: lsof: command not found

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.