GithubHelp home page GithubHelp logo

alibaba / elastic-federated-learning-solution Goto Github PK

View Code? Open in Web Editor NEW
130.0 13.0 39.0 1.53 MB

License: Apache License 2.0

Shell 1.25% Python 60.35% Dockerfile 0.33% Mako 0.04% TypeScript 14.34% JavaScript 1.15% Less 0.74% EJS 0.47% Java 2.90% CMake 1.37% C++ 17.06%

elastic-federated-learning-solution's Introduction

English | 简体中文

Elastic-Federated-Learning-Solution

Elastic-Federated-Learning-Solution(EFLS) is a federal learning framework for cross Internet enterprise information cooperation, which has been verified in 10 billion scale industrial scenarios. EFLS has the following core features: large-scale, highly available cloud native architecture; more powerful and convenient horizontal aggregation and hierarchical aggregation algorithm models.

EFLS pay more attention on privacy protection and encrypted computing. On this basis, EFLS establish the information link of APP island, build the machine learning model, and integrate the privacy collection intersection algorithm, differential privacy algorithm, large-scale sparse machine learning algorithm and visual process console, so as to help everyone carry out the cooperative application and practice of federated learning in the super large-scale sparse scenario in the field of search、recommendation and advertising algorithm.

Installation

Git Clone

EFLS needs to recursively clone the corresponding third-party libraries. Due to network instability and other reasons, it is recommended that when clone is complete, further check if the third-party library is downloaded.

git clone https://github.com/alibaba/Elastic-Federated-Learning-Solution.git --recursive

#Further check if the third-party library is downloaded
cd ${EFLS}
git submodule init && git submodule update --recursive
cd ${EFLS}/efls-train/third_party/grpc
git submodule init && git submodule update --recursive

EFLS provides two deployment modes, stand-alone deployment and cloud native deployment. Users can choose according to their own needs.

Stand-alone Deployment

Environment requirements: docker

EFLS provides stand-alone deployment mode. Users can quickly deploy and test EFLS in stand-alone mode by using docker. Please refer to Standalone Deployment Guide for more information.

Cloud Native Deployment

Environment requirements: docker, kubectl

EFLS provides cloud native deployment and supports large-scale distributed federated learning on the public network. Please refer to Cloud native Deployment Guide for more information.

Documentation

Parameter introduction

We provide an introduction to the parameters of the EFLS-data and EFLS-train part. Please refer to documentation

API Documentation

We provide documentation for dataio, communicator and model in EFLS-train section.

Forward Encryption Introduction

We provide the introduction and usage of forward encryption algorithm in EFLS-train section. Please refer to documentation

Differential Privacy Introduction

We provide the introduction and usage of differential privacy algorithm in EFLS-train section. Please refer to documentation

Algorithm Documentation

We propose a feature fusion method based on horizontal aggregation and a feature fusion method based on hierarchical aggregation. Please refer to documentation

Users can design training algorithms according to their own needs for federated learning training.

elastic-federated-learning-solution's People

Contributors

chrisliu2013 avatar finalljx avatar jacobisong avatar sisyphus235 avatar universe-hcy avatar yanzhangn avatar zonghua94 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

elastic-federated-learning-solution's Issues

PaillierPassiveWeight在梯度回传计算时,为何以下部分(follower端传递数据需要和leader端数据对比)

dw = x * fixedpoint_encode(dy, decrease_precision=True)
sum_mantissa = dw.mantissa.tensor[0]
sum_exponent = dw.exponent[0]
keypair = dw.mantissa.keypair
cond = lambda i, m, e: tf.less(i, tf.shape(inputs)[0])
def body(i, m, e):
fp = FixedPointTensor(PaillierTensor(keypair, m), e) +
FixedPointTensor(PaillierTensor(keypair, dw.mantissa.tensor[i]), dw.exponent[i])
m = fp.mantissa.tensor
e = fp.exponent
return tf.add(i, 1), m, e
_, m, e = tf.while_loop(cond=cond, body=body, loop_vars=[tf.constant(1), sum_mantissa, sum_exponent])
dw = FixedPointTensor(PaillierTensor(keypair, m), e)

Paillier内存泄露

运行demo发现内存在持续增长,经多番排查初步定位是Paillier内存泄露。单独把Paillier模块抽出来进行验证,确实有内存泄露的现象,请问官方有观察到这个现象吗?(用大数据集比如criteo进行测试)

[centOS] undefined symbol error

如题,尝试在centos上构建efls-train镜像,经过一系列makefile的调整后可编译成功,但运行demo出现undefined symbol错误;

Q:如何在centos上部署和运行本项目?
python3 leader.py --federal_role=leader & python3 follower.py --federal_role=follower [1] 282 Traceback (most recent call last): File "follower.py", line 19, in <module> Traceback (most recent call last): File "leader.py", line 19, in <module> import efl import efl File "/usr/local/lib/python3.6/site-packages/efl/__init__.py", line 24, in <module> File "/usr/local/lib/python3.6/site-packages/efl/__init__.py", line 24, in <module> from efl import lib File "/usr/local/lib/python3.6/site-packages/efl/lib.py", line 27, in <module> from efl import lib File "/usr/local/lib/python3.6/site-packages/efl/lib.py", line 27, in <module> ctypes.CDLL(_LIB_SERVICE_PATH) File "/usr/lib64/python3.6/ctypes/__init__.py", line 343, in __init__ ctypes.CDLL(_LIB_SERVICE_PATH) File "/usr/lib64/python3.6/ctypes/__init__.py", line 343, in __init__ self._handle = _dlopen(self._name, mode) self._handle = _dlopen(self._name, mode) OSError: /usr/local/lib/python3.6/site-packages/efl/libefl_service_discovery.so: undefined symbol: _ZN6google8protobuf5Arena18CreateMaybeMessageIN10tensorflow6JobDefEIEEEPT_PS1_DpOT0_ OSError: /usr/local/lib/python3.6/site-packages/efl/libefl_service_discovery.so: undefined symbol: _ZN6google8protobuf5Arena18CreateMaybeMessageIN10tensorflow6JobDefEIEEEPT_PS1_DpOT0_

使用centos镜像版本信息如下

sh-4.2# uname -a
Linux b1dbfb3a4c5a 5.10.47-linuxkit #1 SMP Sat Jul 3 21:51:47 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
sh-4.2# cat /etc/redhat-release
CentOS Linux release 7.9.2009 (Core)

standalone模式docker build efls-data报错

拉取完整项目后,执行如下两句:
cd ${EFLS}/efls-data
sudo docker build -t efls-data:v1 -f ./Dockerfile ./
报错信息如下:

Err:101 https://mirrors.aliyun.com/debian buster/main amd64 manpages-dev all 4.16-2
Undetermined Error [IP: 150.138.203.250 443]
E: Failed to fetch https://mirrors.aliyun.com/debian/pool/main/m/manpages/manpages-dev_4.16-2_all.deb Undetermined Error [IP: 150.138.203.250 443]
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?

我是在win10机器上操作的,请问这个怎么解决?

grpc_cpp_plugin program not found or is not executable

我直接编译EFLS项目可以成功。但是我自己重新搭了一个项目,也是和EFLS一样的TF版本(v1.15.5),pb版本(v3.8.0),gRPC版本(v1.25.x),按照EFLS中CMKAE的写法编译时而成功,时而失败。报grpc_cpp_plugin program not found or is not executable,您知道原因吗?
另外,EFLS中编译gRPC的方式和官方描述的不一样。详见:https://github.com/grpc/grpc/blob/master/BUILDING.md#build-from-source 中的“install after build”

EFLS-data for ECDH-PSI join

EFLS data supports ECDH-PSI encryption and adopts curve25519 as the encryption algorithm.
In the setting, only the server provides RPC services. The current version calculates the intersection result on the server. The mode of intersection on the client will be released in the next version.

EFLS-data 支持ECDH-PSI加密,采用curve25519作为加密算法
设定中,仅server提供rpc服务。当前版本在服务端计算求交结果。在客户端计算求交结果将在下个版本发布。

EFLS-data for feature increment join

EFLS data provides one-way feature increment intersection. One party owns primary table and auxiliary table, it can use the auxiliary table to update the feature of the primary table (add non-existent columns and overwrite existing columns)
For more detail, refer to /docs/efls-data/quick_start_feature_inc.md

EFLS-data 提供了单方特征增量求交。单方拥有主表和辅表,可以利用辅表更新主表的特征(增加不存在的列,覆盖存在的列)
具体参见文档/docs/efls-data/quick_start_feature_inc.md

EFLS-data for csv,leveldb and client2multiserverbucket

EFLS data supports reading and writing CSV files by setting the parameter "--inputfile_type='csv'". the first line of the CSV file is read as the header, which is used to determine the hash_col_name column and sort_col_name column.
EFLS data supports leveldb as the data storage structure by setting the parameter "--sample_store_type='leveldb', --db_root_path='/XXX'".
EFLS data supports the intersection between a single bucket of the client and multiple buckets of the server. The single bucket of the client will be further split, and then the intersection with the corresponding bucket of the server one by one. You can use it by setting parameters "--client2multiserver=2".

EFLS-data 支持csv文件读写,通过设置参数--inputfile_type='csv'使用, csv文件的第一行将被读取作为表头,用于确定求交使用的hash_col_name和sort_col_name是哪一列。
EFLS-data 支持leveldb作为数据存储结构,通过设置参数--sample_store_type='leveldb', --db_root_path='/XXX'使用。
EFLS-data 支持客户端单个桶与服务端多个桶进行求交,客户端单个桶内将进一步进行拆分,随后逐个与服务端对应桶进行求交。可以通过设置参数--client2multiserver=2使用。

生成单机模式部署docker image的时候路径报错,cd: can't cd to /xfl/third_party/curve25519

=> [ 2/11] WORKDIR /xfl 0.2s
=> [ 3/11] RUN sed -i 's#http://deb.debian.org#https://mirrors.aliyun.com#g' /etc/apt/sources.list && echo 'deb http://mirr 4596.0s
=> [ 4/11] COPY . /xfl 0.1s
=> [ 5/11] RUN cd /xfl && wget https://nginx.org/download/nginx-1.20.1.tar.gz && tar -zxvf nginx-1.20.1.tar.gz && cd ./ng 16.5s
=> [ 6/11] RUN cd /xfl/xfl-java && wget -q https://archive.apache.org/dist/maven/maven-3/3.2.5/binaries/apache-maven-3.2.5-b 659.3s
=> [ 7/11] RUN ln -s /usr/bin/python3 /usr/bin/python

[ 8/11] RUN cd /xfl/third_party/curve25519 && python setup.py install:
#12 0.614 /bin/sh: 1: cd: can't cd to /xfl/third_party/curve25519


executor failed running [/bin/sh -c cd /xfl/third_party/curve25519 && python setup.py install]: exit code: 2

请问efls-data的单机部署构建镜像时一定要求FLINK-K8S环境吗

https://github.com/alibaba/Elastic-Federated-Learning-Solution/blob/master/docs/English/Standalone_Deployment_CN.md
我参照上面这个单机部署教程后,运行python /xfl/test/test_data_join.py后报错如下
是由env = StreamExecutionEnvironment.get_execution_environment()这里引起的
请问一定要在k8s上部署flink环境再启动docker镜像才可以部署单机模式么?

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Exception in thread "Thread-4" java.lang.NoClassDefFoundError: org/apache/flink/table/functions/python/PythonFunction
        at java.lang.Class.getDeclaredMethods0(Native Method)
        at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
        at java.lang.Class.privateGetPublicMethods(Class.java:2902)ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py", line 1188, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py", line 1014, in send_command
    response = connection.send_command(command)
  File "/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py", line 1193, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving

        at java.lang.Class.getMethods(Class.java:1615)
        at sun.misc.ProxyGenerator.generateClassFile(ProxyGenerator.java:451)
        at sun.misc.ProxyGenerator.generateProxyClass(ProxyGenerator.java:339)
        at java.lang.reflect.Proxy$ProxyClassFactory.apply(Proxy.java:639)
        at java.lang.reflect.Proxy$ProxyClassFactory.apply(Proxy.java:557)
        at java.lang.reflect.WeakCache$Factory.get(WeakCache.java:230)Traceback (most recent call last):

  File "./test_data_join.py", line 180, in <module>
        at java.lang.reflect.WeakCache.get(WeakCache.java:127)
        at java.lang.reflect.Proxy.getProxyClass0(Proxy.java:419)
        at java.lang.reflect.Proxy.newProxyInstance(Proxy.java:719)
        at org.apache.flink.api.python.shaded.py4j.Gateway.createProxy(Gateway.java:368)
        at org.apache.flink.api.python.shaded.py4j.Protocol.getPythonProxy(Protocol.java:433)
        at org.apache.flink.api.python.shaded.py4j.Protocol.getObject(Protocol.java:311)
    t.test_psi_join()
        at org.apache.flink.api.python.shaded.py4j.commands.AbstractCommand.getArguments(AbstractCommand.java:82)  File "./test_data_join.py", line 174, in test_psi_join

        at org.apache.flink.api.python.shaded.py4j.commands.CallCommand.execute(CallCommand.java:77)
        at org.apache.flink.api.python.shaded.py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:750)    run_client_and_server()

  File "./test_data_join.py", line 91, in run_client_and_server
Caused by: java.lang.ClassNotFoundException: org.apache.flink.table.functions.python.PythonFunction
    'example_id', 'example_id', 8)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:387)  File "./test_data_join2.py", line 78, in __init__

        at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
    conf=conf)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
  File "/xfl/xfl/data/pipelines.py", line 71, in __init__
        ... 19 more
    env = get_flink_batch_env(conf)
  File "/xfl/xfl/data/pipelines.py", line 41, in get_flink_batch_env
    env = StreamExecutionEnvironment.get_execution_environment()
  File "/usr/local/lib/python3.7/dist-packages/pyflink/datastream/stream_execution_environment.py", line 688, in get_execution_environment
    gateway = get_gateway()
  File "/usr/local/lib/python3.7/dist-packages/pyflink/java_gateway.py", line 75, in get_gateway
    _gateway.entry_point.put("PythonFunctionFactory", PythonFunctionFactory())
  File "/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py", line 1286, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/local/lib/python3.7/dist-packages/pyflink/util/exceptions.py", line 146, in deco
    return f(*a, **kw)
  File "/usr/local/lib/python3.7/dist-packages/py4j/protocol.py", line 336, in get_return_value
    format(target_id, ".", name))
py4j.protocol.Py4JError: An error occurred while calling t.put

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.