GithubHelp home page GithubHelp logo

ansiblefate's Issues

ansible部署双边场景,在执行到fate_flow时出现python key不存在的问题

Traceback (most recent call last):
File "/data/projects/fate/fateflow/python/fate_flow/fate_flow_server.py", line 88, in
ComponentRegistry.load()
File "/data/projects/fate/fateflow/python/fate_flow/db/component_registry.py", line 35, in load
component_registry = cls.get_from_db(file_utils.load_json_conf_real_time(FATE_FLOW_DEFAULT_COMPONENT_REGISTRY_PATH))
File "/data/projects/fate/common/python/venv/lib/python3.8/site-packages/peewee.py", line 394, in inner
return fn(*args, **kwargs)
File "/data/projects/fate/fateflow/python/fate_flow/db/component_registry.py", line 179, in get_from_db
for component_alias in component_registry["components"][module.f_component_name]["alias"]:
KeyError: 'custnn'

在t_component_registry表中有51条数据但是在t_component_info中只有16条数据
mysql> select f_provider_name,f_version,f_component_name from t_component_registry;
+-----------------+-----------+----------------------------+
| f_provider_name | f_version | f_component_name |
+-----------------+-----------+----------------------------+
| fate_flow | 1.11.2 | apireader |
| fate_flow | 1.11.2 | cacheloader |
| fate_flow | 1.11.2 | download |
| fate_flow | 1.11.2 | modelloader |
| fate_flow | 1.11.2 | modelrestore |
| fate_flow | 1.11.2 | modelstore |
| fate_flow | 1.11.2 | reader |
| fate_flow | 1.11.2 | upload |
| fate_flow | 1.11.2 | writer |
| fate | 1.11.3 | columnexpand |
| fate | 1.11.3 | custnn |
| fate | 1.11.3 | dataio |
| fate | 1.11.3 | datastatistics |
| fate | 1.11.3 | datatransform |
| fate | 1.11.3 | evaluation |
| fate | 1.11.3 | featureimputation |
| fate | 1.11.3 | featurescale |
| fate | 1.11.3 | federatedsample |
| fate | 1.11.3 | feldmanverifiablesum |
| fate | 1.11.3 | ftl |
| fate | 1.11.3 | heterodatasplit |
| fate | 1.11.3 | heterofastsecureboost |
| fate | 1.11.3 | heterofeaturebinning |
| fate | 1.11.3 | heterofeatureselection |
| fate | 1.11.3 | heterokmeans |
| fate | 1.11.3 | heterolinr |
| fate | 1.11.3 | heterolr |
| fate | 1.11.3 | heteronn |
| fate | 1.11.3 | heteropearson |
| fate | 1.11.3 | heteropoisson |
| fate | 1.11.3 | heterosecureboost |
| fate | 1.11.3 | heterosshelinr |
| fate | 1.11.3 | heterosshelr |
| fate | 1.11.3 | homodatasplit |
| fate | 1.11.3 | homofeaturebinning |
| fate | 1.11.3 | homolr |
| fate | 1.11.3 | homonn |
| fate | 1.11.3 | homoonehotencoder |
| fate | 1.11.3 | homosecureboost |
| fate | 1.11.3 | intersection |
| fate | 1.11.3 | labeltransform |
| fate | 1.11.3 | localbaseline |
| fate | 1.11.3 | onehotencoder |
| fate | 1.11.3 | positiveunlabeled |
| fate | 1.11.3 | psi |
| fate | 1.11.3 | sampleweight |
| fate | 1.11.3 | scorecard |
| fate | 1.11.3 | secureaddexample |
| fate | 1.11.3 | secureinformationretrieval |
| fate | 1.11.3 | spdztest |
| fate | 1.11.3 | union |
+-----------------+-----------+----------------------------+
51 rows in set (0.02 sec)

mysql> select * from t_component_info;
mysql> select f_component_name,f_component_alias from t_component_info;
+----------------------+--------------------------+
| f_component_name | f_component_alias |
+----------------------+--------------------------+
| apireader | ["ApiReader"] |
| cacheloader | ["CacheLoader"] |
| columnexpand | ["ColumnExpand"] |
| download | ["Download"] |
| featurescale | ["FeatureScale"] |
| heterofeaturebinning | ["HeteroFeatureBinning"] |
| heterolr | ["HeteroLR"] |
| homolr | ["HomoLR"] |
| homonn | ["HomoNN"] |
| modelloader | ["ModelLoader"] |
| modelrestore | ["ModelRestore"] |
| modelstore | ["ModelStore"] |
| positiveunlabeled | ["PositiveUnlabeled"] |
| reader | ["Reader"] |
| upload | ["Upload"] |
| writer | ["Writer"] |
+----------------------+--------------------------+
16 rows in set (0.01 sec)

两边的component_name不匹配是什么情况?

CCLinux系统安装流程出现异常fatal: [10.32.123.27]: FAILED! => {"changed": false, "msg": "please check json"}

在CCLinux系统,ansible方式安装fate1.8
当安装eggroll时,出现以下异常:
image
我定位到json-replace.sh,该脚本调用了jcheck.py
发现此处进行json格式的检查,我对此进行了修改,如下:
import sys
sys.exit(0)
但是重新部署时,依旧出现以上错误
同时,我发现 json-replace.sh中
/data/projects/backups/fate/eggroll/conf 并未执行
该语句之前的代码,我还是找不到第二处检查json格式的代码

部署问题 fate-flow Timeout when waiting for x.x.xx.x:9360

版本1.11.4 执行bash deploy/deploy.sh deploy后报错:

deploy-xxxx.log报错信息:

fatal: FAILED! => {"changed": false, "elapsed": 120, "msg": "Timeout when waiting for x.x.xx.x:9360"}

/data/logs/fate/fateflow/error.log错误信息:

Traceback (most recent call last):
File "/data/projects/fate/fateflow/python/fate_flow/fate_flow_server.py", line 110, in
server.add_insecure_port(f"{HOST}:{GRPC_PORT}")
File "/data/projects/fate/common/python/venv/lib/python3.8/site-packages/grpc/_server.py", line 969, in add_insecure_port
return _common.validate_port_binding_result(
File "/data/projects/fate/common/python/venv/lib/python3.8/site-packages/grpc/_common.py", line 166, in validate_port_binding_result
raise RuntimeError(_ERROR_MESSAGE_PORT_BINDING_FAILED % address)
RuntimeError: Failed to bind to address x.x.x.x:9360; set GRPC_VERBOSITY=debug environment variable to see detailed error message.

/data/logs/fate/fateflow/info.log错误信息:

PROJECT_BASE: /data/projects/fate
PYTHONPATH: /data/projects/fate/fate/python:/data/projects/fate/fateflow/python:/data/projects/fate/eggroll/python
found service conf: /data/projects/fate/conf/service_conf.yaml
fate flow http port: 9380, grpc port: 9360

check process by http port and grpc port
/data/projects/fate/fateflow/bin/service.sh: line 74: lsof: command not found
/data/projects/fate/fateflow/bin/service.sh: line 75: lsof: command not found
PROJECT_BASE: /data/projects/fate
PYTHONPATH: /data/projects/fate/fate/python:/data/projects/fate/fateflow/python:/data/projects/fate/eggroll/python
found service conf: /data/projects/fate/conf/service_conf.yaml
fate flow http port: 9380, grpc port: 9360

check process by http port and grpc port
/data/projects/fate/fateflow/bin/service.sh: line 74: lsof: command not found
/data/projects/fate/fateflow/bin/service.sh: line 75: lsof: command not found

FATE1.7.0部署无法打开9360端口

测试其他服务可以正常使用9360端口,但是部署FATE1.7.0的时候一直无法open port 9360

TASK [fateflow : update(deploy): /data/projects/fate/conf/service_conf.yaml] ***
ok: [My IP]

TASK [fateflow : update(deploy): /data/projects/common/supervisord/supervisord.d/fate-fateflow.conf] ***
changed: [My IP]

TASK [fateflow : flush_handlers] ***********************************************

RUNNING HANDLER [fateflow : reload fate-fateflow] ******************************
changed: [My IP

RUNNING HANDLER [fateflow : restart fate-fateflow] *****************************
changed: [My IP

TASK [fateflow : wait(deploy)): open port 9360( guest )] ***********************
fatal: [My IP]: FAILED! => {"changed": false, "elapsed": 120, "msg": "Timeout when waiting for My IP:9360"}

NO MORE HOSTS LEFT *************************************************************

PLAY RECAP *********************************************************************
My IP : ok=157 changed=52 unreachable=0 failed=1 skipped=57 rescued=0 ignored=0

部署fateflow的高可用,eggroll的9370启动失败,route_table.json有重复key

版本:
fate-1.10.0

环境:
centos7.9

配置:
三台虚机,ip是172.16.4.34、172.16.4.37、172.16.4. 38
34是一个party,已安装非HA的fate
37、38组成一个party,想安装HA的fate,eggroll安装在37

操作流程:
按HA的部署流程,在ansible的init后,修改conf文件
https://github.com/FederatedAI/AnsibleFATE/blob/c7d450945ce5338be606bd90642d1edd966c99cd/docs/ansible_deploy_HA.md

image

image

然后执行安装,报错说9370端口无法启动,看日志是说route_table.json里有重复key
image

eggroll日志
image

route_table.json里确实有重复的key
image

请问如何解决这个问题?感谢

所有前置操作都完成,但是部署出现 'ansible_ssh_host' is undefined

日志如下:


PLAY [fate] ********************************************************************

TASK [Gathering Facts] *********************************************************
ok: [192.168.1.182]
ok: [192.168.1.177]
ok: [192.168.1.183]

.....

TASK [check : update(deploy): check.sh] ****************************************
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ansible.errors.AnsibleUndefinedVariable: 'ansible_ssh_host' is undefined. 'ansible_ssh_host' is undefined
fatal: [192.168.1.182]: FAILED! => {"changed": false, "msg": "AnsibleUndefinedVariable: 'ansible_ssh_host' is undefined. 'ansible_ssh_host' is undefined"}
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ansible.errors.AnsibleUndefinedVariable: 'ansible_ssh_host' is undefined. 'ansible_ssh_host' is undefined
fatal: [192.168.1.183]: FAILED! => {"changed": false, "msg": "AnsibleUndefinedVariable: 'ansible_ssh_host' is undefined. 'ansible_ssh_host' is undefined"}
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ansible.errors.AnsibleUndefinedVariable: 'ansible_ssh_host' is undefined. 'ansible_ssh_host' is undefined
fatal: [192.168.1.177]: FAILED! => {"changed": false, "msg": "AnsibleUndefinedVariable: 'ansible_ssh_host' is undefined. 'ansible_ssh_host' is undefined"}

NO MORE HOSTS LEFT *************************************************************

PLAY RECAP *********************************************************************
192.168.1.177              : ok=4    changed=0    unreachable=0    failed=1    skipped=1    rescued=0    ignored=0   
192.168.1.182              : ok=4    changed=0    unreachable=0    failed=1    skipped=1    rescued=0    ignored=0   
192.168.1.183              : ok=4    changed=0    unreachable=0    failed=1    skipped=1    rescued=0   ```

安装失败/bin/bash deploy/deploy.sh deploy 报错archive. Command \"/usr/bin/gtar\

执行报错

/bin/bash deploy/deploy.sh deploy

-------------------1 get base data--------------------------------
deploy in progress, please check the log in /root/AnsibleFATE/deploy/../logs/deploy-1645251940.log
or commit "tail -f /root/AnsibleFATE/deploy/../logs/deploy-1645251940.log"
[root@ip-172-31-2-212 AnsibleFATE]# tail -f /root/AnsibleFATE/deploy/../logs/deploy-1645251940.log
TASK [check : untar(deploy): deploy.tar.gz] ************************************
fatal: [172.31.0.87]: FAILED! => {"changed": false, "msg": "Failed to find handler for "/root/.ansible/tmp/ansible-tmp-1645251944.9-18521-127955827148116/source". Make sure the required command to extract the file is installed. Command "/usr/bin/unzip" could not handle archive. Command "/usr/bin/gtar" could not handle archive."}
fatal: [172.31.2.212]: FAILED! => {"changed": false, "msg": "Failed to find handler for "/root/.ansible/tmp/ansible-tmp-1645251944.92-18523-257278277195642/source". Make sure the required command to extract the file is installed. Command "/usr/bin/unzip" could not handle archive. Command "/usr/bin/gtar" could not handle archive."}

NO MORE HOSTS LEFT *************************************************************

PLAY RECAP *********************************************************************
172.31.0.87 : ok=4 changed=0 unreachable=0 failed=1 skipped=1 rescued=0 ignored=0
172.31.2.212 : ok=4 changed=0 unreachable=0 failed=1 skipped=1 rescued=0 ignored=0

检测主机存在/usr/bin/gtar、/usr/bin/unzip

建议copy、实现呢

  • copy:
    src: /源目录/1.tar.gz
    dest: /目的目录/1.tar.gz
  • shell: gunzip /目的目录/tar.gz

Deployment by Ansible

fatal: [ip]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Warning: Permanently added 'ip' (ECDSA) to the list of known hosts.\r\nPermission denied (publickey,gssapi-keyex,gssapi-with-mic,password).", "unreachable": true}

ansible部署双边场景, 找不到build.tar.gz

fatal: [ip]: FAILED! => {"changed": false, "msg": "Could not find or access 'build.tar.gz'
Searched in: /data/projects/ansibleFATE-1.7.2-release-online/roles/check/files/build.tar.gz
/data/projects/ansibleFATE-1.7.2-release-online/roles/check/build.tar.gz
/data/projects/ansibleFATE-1.7.2-release-online/roles/base/files/build.tar.gz
/data/projects/ansibleFATE-1.7.2-release-online/roles/base/build.tar.gz
/data/projects/ansibleFATE-1.7.2-release-online/roles/check/tasks/files/build.tar.gz
/data/projects/ansibleFATE-1.7.2-release-online/roles/check/tasks/build.tar.gz
/data/projects/ansibleFATE-1.7.2-release-online/deploy/../files/build.tar.gz
/data/projects/ansibleFATE-1.7.2-release-online/deploy/../build.tar.gz on the Ansible Controller.
If you are using a module and expect the file to exist on the remote, see the remote_src option"}

ansible部署双边场景,yum依赖包安装不上snappy-devel

TASK [base : yum(deploy): install dependency packages] *************************
fatal: [11.50.192.7]: FAILED! => {"changed": false, "failures": ["No package snappy-devel available."], "msg": "Failed to install some of the specified packages", "rc": 1, "results": []}
fatal: [11.50.192.8]: FAILED! => {"changed": false, "failures": ["No package snappy-devel available."], "msg": "Failed to install some of the specified packages", "rc": 1, "results": []}

我在xxx/tools/install_base.sh中找到了yum -y install gcc gcc-c++ make openssl-devel gmp-devel mpfr-devel libmpc-devel libaio numactl autoconf automake libtool libffi-devel snappy snappy-devel zlib zlib-devel bzip2 bzip2-devel lz4-devel libasan lsof sysstat telnet psmisc iperf3 erlang,但是我把这个文件移走以后,还是报这个错

添加证书认证后flow test报错

前置条件

用Ansible FATE-1.7.0单边部署了三个结点,详细配置如下:
210结点,Exchange角色,通过 /bin/bash deploy/deploy.sh keys生成了证书,开启了服务端和客户端认证。
route_table.json的配置信息如下:
{
"route_table":
{
"211":
{
"default":[
{
"is_secure": true,
"ip": "10.32.122.211",
"port": 9371
}
]
},
"213":
{
"default":[
{
"is_secure": true,
"ip": "10.32.122.213",
"port": 9371
}
]
},
.....
eggroll.properties的中相关的配置信息为:
eggroll.core.security.client.ca.crt.path=/data/projects/data/fate/keys/exchange-client-ca.pem
eggroll.core.security.client.crt.path=/data/projects/data/fate/keys/exchange-client-client.pem
eggroll.core.security.client.key.path=/data/projects/data/fate/keys/exchange-client-client.key

eggroll.core.security.ca.crt.path=/data/projects/data/fate/keys/exchange-ca.pem
eggroll.core.security.crt.path=/data/projects/data/fate/keys/exchange-server.pem
eggroll.core.security.key.path=/data/projects/data/fate/keys/exchange-server.key

213结点,Host角色,将210结点的证书拷贝到了对应目录下,示例如下:
scp deploy/keys/exchange/ca.pem [email protected]:/data/projects/data/fate/keys/host-client-ca.pem
.....
另外,其route_table.json的配置信息如下:
{
"route_table":
{
"default":
{
"default":[
{
"is_secure": true,
"ip": "10.32.122.210",
"port": 9370
}
]
},
"213":
{
"default":[
{
"ip": "10.32.122.213",
"port": 9370
}
],
"fateflow":[
{
"ip": "10.32.122.213",
"port": 9360
}
]
}
},
"permission":
{
"default_allow": true
}
}

eggroll.properties中相关的配置信息为:
eggroll.rollsite.lan.insecure.channel.enabled=true
eggroll.rollsite.secure.port=9371

eggroll.core.security.client.ca.crt.path=/data/projects/data/fate/keys/host-client-ca.pem
eggroll.core.security.client.crt.path=/data/projects/data/fate/keys/host-client-client.pem
eggroll.core.security.client.key.path=/data/projects/data/fate/keys/host-client-client.key

211结点,Guest角色,,将210结点的证书拷贝到了对应目录下,示例如下:
scp deploy/keys/exchange/ca.pem [email protected]:/data/projects/data/fate/keys/guest-client-ca.pem
.....
另外,其route_table.json的配置信息如下:
{
"route_table":
{
"default":
{
"default":[
{
"is_secure": true,
"ip": "10.32.122.210",
"port": 9371
}
]
},
"211":
{
"default":[
{
"ip": "10.32.122.211",
"port": 9370
}
],
"fateflow":[
{
"ip": "10.32.122.211",
"port": 9360
}
]
}
},
"permission":
{
"default_allow": true
}
}

eggroll.properties中的相关配置为:
eggroll.rollsite.lan.insecure.channel.enabled=true
eggroll.rollsite.secure.port=9371

eggroll.core.security.client.ca.crt.path=/data/projects/data/fate/keys/guest-client-ca.pem
eggroll.core.security.client.crt.path=/data/projects/data/fate/keys/guest-client-client.pem
eggroll.core.security.client.key.path=/data/projects/data/fate/keys/guest-client-client.key

测试

所有结点的fate-rollsite 服务均重新启动。
在211结点,执行:
source /data/projects/fate/bin/init.sh
flow test toy -gid 211 -hid 213

执行结果报错,错误信息为:
(venv) app@cestc211:/data/projects/fate/eggroll/conf$ flow test toy -gid 211 -hid 213
{
"jobId": "202204211045098230110",
"retcode": 103,
"retmsg": "Traceback (most recent call last):\n File "/data/projects/fate/fateflow/python/fate_flow/scheduler/dag_scheduler.py", line 124, in submit\n raise Exception("create job failed", response)\nException: ('create job failed', {'guest': {211: {'data': {'components': {'secure_add_example_0': {'need_run': True}}}, 'retcode': 0, 'retmsg': 'success'}}, 'host': {213: {'retcode': <RetCode.FEDERATED_ERROR: 104>, 'retmsg': 'Federated schedule error, Please check rollSite and fateflow network connectivityrpc request error: <_Rendezvous of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = "UNAVAILABLE: \n[Roll Site Error TransInfo] \n location msg=UNAVAILABLE: io exception \n stack info=io.grpc.StatusRuntimeException: UNAVAILABLE: io exception\n\tat io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:240)\n\tat io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:221)\n\tat io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:140)\n\tat com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$DataTransferServiceBlockingStub.unaryCall(DataTransferServiceGrpc.java:348)\n\tat com.webank.eggroll.rollsite.EggSiteServicer.unaryCall(EggSiteServicer.scala:138)\n\tat com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$MethodHandlers.invoke(DataTransferServiceGrpc.java:406)\n\tat io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172)\n\tat io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)\n\tat io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)\n\tat io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)\n\tat io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86)\n\tat io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)\n\tat io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:817)\n\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\n\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\nCaused by: io.grpc.netty.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /10.32.122.213:9371\nCaused by: java.net.ConnectException: finishConnect(..) failed: Connection refused\n\tat io.grpc.netty.shaded.io.netty.channel.unix.Errors.throwConnectException(Errors.java:124)\n\tat io.grpc.netty.shaded.io.netty.channel.unix.Socket.finishConnect(Socket.java:243)\n\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:660)\n\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:637)\n\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:524)\n\tat io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:473)\n\tat io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:383)\n\tat io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)\n\tat io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\n\tat java.lang.Thread.run(Thread.java:748)\n \n\nexception trans path: 10.32.122.210(${id}) --> 10.32.122.211(211)"\n\tdebug_error_string = "{"created":"@1650509120.047216058","description":"Error received from peer ipv4:10.32.122.211:9370","file":"src/core/lib/surface/call.cc","file_line":1055,"grpc_message":"UNAVAILABLE: \\n[Roll Site Error TransInfo] \\n location msg=UNAVAILABLE: io exception \\n stack info=io.grpc.StatusRuntimeException: UNAVAILABLE: io exception\\n\\tat io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:240)\\n\\tat io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:221)\\n\\tat io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:140)\\n\\tat com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$DataTransferServiceBlockingStub.unaryCall(DataTransferServiceGrpc.java:348)\\n\\tat com.webank.eggroll.rollsite.EggSiteServicer.unaryCall(EggSiteServicer.scala:138)\\n\\tat com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$MethodHandlers.invoke(DataTransferServiceGrpc.java:406)\\n\\tat io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172)\\n\\tat io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)\\n\\tat io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)\\n\\tat io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)\\n\\tat io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86)\\n\\tat io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)\\n\\tat io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:817)\\n\\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\\n\\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)\\n\\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\\n\\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\\n\\tat java.lang.Thread.run(Thread.java:748)\\nCaused by: io.grpc.netty.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /10.32.122.213:9371\\nCaused by: java.net.ConnectException: finishConnect(..) failed: Connection refused\\n\\tat io.grpc.netty.shaded.io.netty.channel.unix.Errors.throwConnectException(Errors.java:124)\\n\\tat io.grpc.netty.shaded.io.netty.channel.unix.Socket.finishConnect(Socket.java:243)\\n\\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:660)\\n\\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:637)\\n\\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:524)\\n\\tat io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:473)\\n\\tat io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:383)\\n\\tat io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)\\n\\tat io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\\n\\tat io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\\n\\tat java.lang.Thread.run(Thread.java:748)\\n \\n\\nexception trans path: 10.32.122.210(${id}) --> 10.32.122.211(211)","grpc_status":14}"\n>'}}})\n"
}

其他信息

如果不使用证书(is_secure: false, port:9370),能够正常运行。

期望

使用证书下,正确的配置

谢谢!

ValueError: source code string cannot contain null bytes

Ansible安装过程报错:“ValueError: source code string cannot contain null bytes”

系统:CentOS 7
Ansible:AnsibleFATE_1.9.2_release_offline.tar.gz

报错节点:
TASK [build(deploy): python virtual env] ***************************************
changed: [192.168.13.163]
changed: [192.168.13.164]

TASK [python : check(deploy): venv again] **************************************
ok: [192.168.13.164]
ok: [192.168.13.163]

TASK [python : pip(deploy): venv install must packages] ************************
fatal: [192.168.13.163]: FAILED!

具体报错日志:
fatal: [192.168.13.163]: FAILED! => {"changed": false, "cmd": ["/data/projects/fate/common/python/venv/bin/pip", "install", "-U", "-f", "/data/temp/fate/pypi", "--no-index", "pip", "setuptools", "wheel"], "msg": "
:stderr: Traceback (most recent call last):
File "/data/projects/fate/common/python/venv/bin/pip", line 5, in
from pip._internal.cli.main import main
File "/data/projects/fate/common/python/venv/lib/python3.8/site-packages/pip/_internal/cli/main.py", line 9, in
from pip._internal.cli.autocompletion import autocomplete
File "/data/projects/fate/common/python/venv/lib/python3.8/site-packages/pip/_internal/cli/autocompletion.py", line 10, in
from pip._internal.cli.main_parser import create_main_parser
File "/data/projects/fate/common/python/venv/lib/python3.8/site-packages/pip/_internal/cli/main_parser.py", line 8, in
from pip._internal.cli import cmdoptions
File "/data/projects/fate/common/python/venv/lib/python3.8/site-packages/pip/_internal/cli/cmdoptions.py", line 24, in
from pip._internal.cli.parser import ConfigOptionParser
File "/data/projects/fate/common/python/venv/lib/python3.8/site-packages/pip/_internal/cli/parser.py", line 12, in
from pip._internal.configuration import Configuration, ConfigurationError
File "/data/projects/fate/common/python/venv/lib/python3.8/site-packages/pip/_internal/configuration.py", line 20, in
from pip._internal.exceptions import (
File "/data/projects/fate/common/python/venv/lib/python3.8/site-packages/pip/_internal/exceptions.py", line 13, in
from pip._vendor.requests.models import Request, Response
File "/data/projects/fate/common/python/venv/lib/python3.8/site-packages/pip/_vendor/requests/init.py", line 43, in
from pip._vendor import urllib3
File "/data/projects/fate/common/python/venv/lib/python3.8/site-packages/pip/_vendor/urllib3/init.py", line 13, in
from .connectionpool import HTTPConnectionPool, HTTPSConnectionPool, connection_from_url
File "/data/projects/fate/common/python/venv/lib/python3.8/site-packages/pip/_vendor/urllib3/connectionpool.py", line 12, in
from .connection import (
File "/data/projects/fate/common/python/venv/lib/python3.8/site-packages/pip/_vendor/urllib3/connection.py", line 15, in
from .util.proxy import create_proxy_ssl_context
ValueError: source code string cannot contain null bytes
"}

复现代码:
sudo /data/projects/fate/common/python/venv/bin/pip install -U --no-index setuptools

感谢大佬帮瞅瞅

使用ansiblefate部署fate2.0.0版本后,节点无法通信

各个节点均可通过单边测试,但是双方测试失败
(venv) app@VM_0_1_centos:/home/guo$ flow test toy -gid 9999 -hid 10000
{
"code": 1002,
"data": {
"model_id": "202402280731033092380",
"model_version": "0"
},
"job_id": "202402280731033092380",
"message": "Traceback (most recent call last):\n File "/data/projects/fate/fate_flow/python/fate_flow/scheduler/scheduler.py", line 376, in create_all_job\n raise Exception("create job failed", response)\nException: ('create job failed', {'guest': {'9999': {'code': 0, 'message': 'success'}}, 'host': {'10000': {'code': 104, 'message': 'Federated schedule error, <_InactiveRpcError of RPC that terminated with:\n\tstatus = StatusCode.UNKNOWN\n\tdetails = ""\n\tdebug_error_string = "UNKNOWN:Error received from peer {grpc_message:"", grpc_status:2, created_time:"2024-02-28T07:31:03.581646839+00:00"}"\n>'}}})\n"
}

使用ansibleFATE支持多host部署吗?

比如在执行命令:
sh deploy/deploy.sh init -h="10000:192.168.0.1" -g="9999:192.168.1.1" -n=spark
其中支持-h="10000:192.168.0.1|10001:192.168.0.2"这种多host部署吗?
或者说多host部署有其他方案吗或者不支持

部署问题

按照文档取部署,mysql用内部的提示连接超时,用外部的有提示找不到mysql命令去连接

2.0版本的测试文档要更新下

AnsibleFATE_2.1.0_release_offline.tar.gz,这个版本安装之后,安装三节点环境

  • 2.0分支文档:
    双边测试
    flow test min -gid 9999 -hid 10000 -aid 10000
    运行显示test命令没有min参数:
    image

  • main分支文档:
    双边测试
    python run_task.py -gid 9999 -hid 10000 -aid 10000 -f fast
    run_task.py已经被删除了,复制过来也不能用
    image

使用ansible部署FATE1.7.0在安装MySQL时报拒绝访问的错

按照部署文档,在本机上部署单边guest的时候,一直卡在MySQL的安装上,前提是以app用户执行部署脚本,但是日志中显示仍然用的是root用户去获取MySQL:

TASK [mysql : debug] ***********************************************************
ok: [我的IP] => {
    "mysql_status.get('stderr_lines')": []
}

TASK [mysql : check(deploy): check if change admin password or not] ************
ok: [我的IP]

TASK [mysql : chpasswd(deploy): admin password] ********************************
changed: [我的IP]

TASK [mysql : debug] ***********************************************************
ok: [我的IP] => {
    "mysql_chpasswd_status.get('stderr_lines')": [
        "\u0007mysqladmin: connect to server at '127.0.0.1' failed",
        "error: 'Access denied for user 'root'@'localhost' (using password: NO)'"
    ]
}

TASK [mysql : check(deploy): check if load data or not] ************************
ok: [我的IP]

TASK [mysql : commit(deploy): load.sh] *****************************************
changed: [我的IP]

TASK [mysql : debug] ***********************************************************
ok: [我的IP] => {
    "mysql_load.get('stderr_lines')": [
        "mysql: [Warning] Using a password on the command line interface can be insecure.",
        "ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: YES)"
    ]
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.