intel-bigdata / spark-pmof Goto Github PK

View Code? Open in Web Editor NEW

30.0 30.0 22.0 17.64 MB

Spark Shuffle Optimization with RDMA+AEP

License: Apache License 2.0

Scala 10.08% Java 2.89% C++ 85.04% Makefile 0.09% CMake 0.37% C 1.53%

aep rdma shuffle spark

spark-pmof's People

Contributors

Stargazers

Watchers

spark-pmof's Issues

Enable PMoF run in fsdax mode

Now Spark-PMoF doesn't work in AEP's FSDAX mode.
This should be optional, and users should have the option to run on FSDAX when they are not using RDMA NIC(RDMA is too complex to use).
Therefore, it is necessary to make appropriate modifications to the code to run in FSDAX mode.

Create the poolfile of fsdax and specify the size in Scala (this is different from devdax)
When creating a pool, identify whether the device type is fsdax or devdax, and add judgment conditions to NATIVE code.
Other potential risks

The error `Heartbeat exception: Failed to send heart beat to active proxy" will be thrown randomly.

Version:
wip_spark_rpmp branch

Describe the bug
The error in title is thrown in both proxy server, standby proxy server and data server after launching for an idle time.

To Reproduce

Launch proxy
Launch standby proxy
Launch data server

Expected Behavior
All services stay healthy all the time.

A configuration property is redundant

In rpmp.conf, there is a property called rpmp.node.list, which looks redundant to rpmp.network.server.address. We can consider to just keep one.

The unit test put_and_get will fail when there are multiple data servers

Version:
wip_spark_rpmp branch

Describe the bug
As the title, the timeout error is thrown.

To Reproduce

Launch proxy
Launch standby proxy
Launch data server
Launch another data server
Launch put_and_get

Expected Behavior
The test suite finishes successfully

met libhpnl exception when disabled rdma

Spark job is able to start, at map stage(stage 1), it is terminated by below error:
java:229613 terminated with signal 11 at PC=7f1dec8dfeab SP=7f1dc4ba23a0. Backtrace:
/usr/local/lib/libhpnl.so(ZN23CQExternalDemultiplexer10wait_eventEPP6fid_eqPiS3+0x2f)[0x7f1dec8dfeab]
/usr/local/lib/libhpnl.so(ZN17ExternalCqService13wait_cq_eventEiPP6fid_eqPiS3+0x66)[0x7f1dec8e2804]
/usr/local/lib/libhpnl.so(Java_com_intel_hpnl_core_CqService_wait_1cq_1event+0x56)[0x7f1dec8e23e9]

dependency spark version 2.4.4 causes the compilation to fail

In the master branch, I see that the spark version is updated to 2.4.4 in the pom file.
This resulted in the failure to import org.apache.spark.internal.Logging in PmofShuffleManager.

Failed to open pmem pool

Hello guys.

I am trying to run your project (Release v1.0.2) into a Spark standalone over Intel Optane persistent memory, but I have some problems with the deploy.I followed this guide (https://github.com/Intel-bigdata/Spark-PMoF/blob/master/doc/Spark-PMoF-enabling-guide.pdf) but I found some differences between the master branch:

spark.shuffle.manager org.apache.spark.shuffle.pmof.RdmaShuffleManager ( I can´t found this class inside the project.
I use spark.shuffle.manager org.apache.spark.shuffle.pmof.PmofShuffleManager instead RdmaShuffleManager.

But when run databricks TPC benchmark I received this error:

Metastore DB connected: jdbc:sqlite:/tmp/spark-e2ba2c50-4d03-4bf1-aac5-430c740ef8ab/executor-c2b3dd32-9736-42e7-b7f5-71bf1b0820e7/spark_shuffle_meta.db
UPDATE devices SET mount_count = 4 WHERE device = '/dev/dax0.0'

Metastore DB: get unused device, should be /dev/dax0.0.
**failed to open pmem pool, errmsg: invalid major version (0)**

Previously, format my namespace as you said in your document:
Install and configure DCPM

Please install ipmctl and ndctl according to your OS version 2) Run ipmctl show -dimm to check whether dimms can be recognized 3) Run ipmctl create -goal PersistentMemoryType=AppDirect to create AD mode 4) Run ndctl list -R , you will see region0 and region1 in screen
Suppose we have 4x DCPM on two sockets. a) Run ndctl create-namespace –m devdax -r region0 -s 120g
e) Then we will see /dev/dax0.0

My spark-defaults configuration is ( only for test pmem no RDMA):

spark.executor.extraClassPath      /opt/benchmarks_directory/Spark-PMoF/core/target/java-1.0-jar-with-dependencies.jar:/opt/benchmarks_directory/s
park-sql-perf/target/scala-2.11/spark-sql-perf_2.11-0.5.1-SNAPSHOT.jar
spark.driver.extraClassPath        /opt/benchmarks_directory/Spark-PMoF/core/target/java-1.0-jar-with-dependencies.jar:/opt/benchmarks_directory/s
park-sql-perf/target/scala-2.11/spark-sql-perf_2.11-0.5.1-SNAPSHOT.jar

spark.shuffle.manager org.apache.spark.shuffle.pmof.PmofShuffleManager

#new version
#spark.shuffle.manager org.apache.spark.shuffle.pmof.RdmaShuffleManager
spark.shuffle.pmof.enable_rdma false
spark.shuffle.pmof.enable_pmem true
spark.shuffle.pmof.max_stage_num 1
spark.shuffle.pmof.max_task_num 50000
spark.shuffle.spill.pmof.MemoryThreshold 16777216
spark.shuffle.pmof.pmem_capacity 100340914688
spark.shuffle.pmof.pmem_list /dev/dax0.0
spark.shuffle.pmof.dev_core_set dax0:0-71,dax0:0-71,dax1:0-71,dax1:0-71,dax0:0-71,dax0:0-71
spark.shuffle.pmof.server_buffer_nums 64
spark.shuffle.pmof.client_buffer_nums 64
spark.shuffle.pmof.map_serializer_buffer_size 262144
spark.shuffle.pmof.reduce_serializer_buffer_size 262144
spark.shuffle.pmof.chunk_size 262144
spark.shuffle.pmof.server_pool_size 3
spark.shuffle.pmof.client_pool_size 3
spark.shuffle.pmof.shuffle_block_size 2097152

My third party stack of libraries are ( I use this versions according with https://github.com/Intel-bigdata/Spark-PMoF/blob/master/docker/ubuntu18/DockerFile documentation):

spark-2.3.0-bin-hadoop2.7
pmdk 1.6
libfabric v1.8.0
HPNL spark-pmof-test branch

Can you help me?. And if you have one stack of libraries that you recommended, I would appreciate it.

Spark+PMEM: long read block time

With the log, we found remote block fetch time is pretty long. See the benchmark result.
Current conclusion is long read block time issue is caused by Netty.

taskset to the node where pmem device sits on

Instead of taskset to C0, we check the PMEM device and bind the thread to right node.
e.g. /dev/dax0.0 is on node0, then we taskset to some core on node 0.

https://github.com/Intel-bigdata/SSO/blob/a7ae44c3317d4e328169351ef628cc56c2d46578/src/main/scala/org/apache/spark/shuffle/pmof/PmemShuffleWriter.scala#L133

[RPMP][bug] The proxy will kill itself if no RPMP node connected in a period of time.

Launch the proxy by ./proxyMain, wait a time period without launching RPMP nodes, the proxy will kill itself then.

[RPMP] The stored node status table might be emptied when more than one nodes connected.

Launch the proxy.
launch one node.
check the NODE_STATUS from redis
launch another node
check the NODE_STATUS again, the stored node status table might be reset.

RDMA Enabling cannot allocate memory

ENV:

Spark 2.3.1(hdp)
Hadoop 3.1.0
HiBench Terasort workload 500G data
RDMA nic CX4 ( or CX3 Pro)
When I started rdma, I encountered an error,it said that fi_mr_reg: cannot allocate memory, and subsequently caused a NPE exception. When using the CX4 NIC, there is also an ArrayIndexOutOfBoundsException. The rping test has passed.
The early configuration of RDMA is very complicated, is there any simpler solution to enable PMoF with rdma?

Delete the residual file of fsdax

see Enable PMoF run in fsdax mode
It seems that there is a new problem. After the Job is finished, the shuffle file is not automatically deleted.
But pmempool info --stats <file> finds that the utilization rate is almost zero.
Need to add a file delete operation in fsdax mode.

Tips:
When using fsdax mode, you can adjust the number of executors more freely, and the program may run faster.

Client connection and RPMP data server connection failure issue

In one proxy and one data server deployment on my side, all things are normal before any RPMP client request comes. Data server periodically sends heartbeat to proxy as expected. But after client requests data write/read one or more times (put_and_get test is used by me), data server will fail to send heartbeat to proxy. Henceforth, client write/read failure will occur. I found some threads in proxy exit which at least causes no response for heartbeat message from data server.

The below commit is involved in this bug. Please help fix it.
Persist data put job status for future potential job recovery. (#118)

Pass IP address to RPMP server process from start script

The start script can get server IP address from config. And it will go to that host by ssh to launch the server. In the launch, the corresponding IP address can be passed to server process. This looks more straightforward and can avoid some potential issues.

Free the devdax pool Error

In devdax mode, an unknown error occurred in the pool cleanup process while running a large data volume task, causing the process to be killed.
This does not affect the accuracy of the current job, but may result in an exception to the next job, such as a devdax busy or unavailable device.
It needs to be fixed.

Delete fsdax files independently

When running PMOF jobs with large volumes of data, it is common that FSDAX files cannot be deleted,thus affecting the next Job running.
For example, when running a 2TB Terasort test, the FSDAX file cannot always be deleted.
The potential problem might be in cleaning up the pool, but FSDAX does not need to clean up the pool and can directly delete files using POSIX operations.
Therefore, it is recommended to separate the fsdax file deletion operation from devdax.

intel-bigdata / spark-pmof Goto Github PK

spark-pmof's People

Contributors

Stargazers

Watchers

Forkers

spark-pmof's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs