intel / hdk Goto Github PK

A low-level execution library for analytic data processing.

License: Apache License 2.0

CMake 2.26% C++ 84.38% C 0.27% Dockerfile 0.02% Python 3.82% Cython 0.94% FreeMarker 0.60% Java 5.23% Cuda 2.06% LLVM 0.14% Makefile 0.01% CSS 0.01% HTML 0.10% NASL 0.01% Shell 0.09% Ruby 0.01% PowerShell 0.04% Batchfile 0.01%

sql gpu query analytics query-engine modin pandas data-science machine-learning query-builder

hdk's Introduction

PROJECT NOT UNDER ACTIVE MANAGEMENT

This project will no longer be maintained by Intel.

Intel has ceased development and contributions including, but not limited to, maintenance, bug fixes, new releases, or updates, to this project.

Intel no longer accepts patches to this project.

If you have an ongoing need to use this project, are interested in independently developing it, or would like to maintain patches for the open source software community, please create your own fork of this project.

Contact: [email protected]

HDK - Heterogeneous Data Kernels

HDK is a low-level execution library for data analytics processing.

HDK is used as a fast execution backend in Modin. The HDK library provides a set of components for federating analytic queries to an execution backend based on OmniSciDB. Currently, HDK targets OLAP-style queries expressed as relational algebra or SQL. The APIs required for Modin support have been exposed in a library installed from this repository, pyhdk. Major and immediate project priorities include:

Introducing a HDK-specific IR and set of optimizations to reduce reliance on RelAlg and improve extensibility of the query API.
Supporting heterogeneous device execution, where a query is split across a set of hardware devices (e.g. CPU and GPU) for best performance. We have developed an initial cost model for heterogeneous execution.
Improving performance of the CPU backend on Modin-specific queries and current-generation data science workstations and servers by > 2x.

We are committed to supporting a baseline set of functionality on all x86 CPUs, later-generation NVIDIA GPUs (supporting CUDA 11+), and Intel GPUs. The x86 backend uses LLVM ORCJIT for x86 byte code generation. The NVIDIA backend uses NVPTX extensions in LLVM to generate PTX, which is JIT-compiled by the CUDA runtime compiler. The Intel GPU backend leverages the LLVM SPIR-V translator to produce SPIR-V. Device code is generated using the Intel Graphics Compiler (IGC) via the oneAPI L0 driver.

Components

Config

Config controls library-wide properties and must be passed to Executor and DataMgr. Default config objects should suffice for most installations. Instantiate a config first as part of library setup.

Storage

ArrowStorage is currently the default (and only available) HDK storage layer. ArrowStorage provides storage support for Apache Arrow format data. The storage layer must be explicitly initialized:

import pyhdk
storage = pyhdk.storage.ArrowStorage(1)

The parameter applied to the ArrowStorage constructor is the database ID. The database ID allows storage instances to be kept logically separate.

ArrowStorage automatically converts Arrow format datatypes to omniscidb datatypes. Some variable length types are not yet supported, but scalar types are available. pyarrow can be used to convert Pandas DataFrames to Arrow:

at = pyarrow.Table.from_pandas(
    pandas.DataFrame({"a": [1, 2, 3], "b": [10, 20, 30]})
)

The arrow table can then be imported using the Arrow storage interface.

opt = pyhdk.storage.TableOptions(2)
storage.importArrowTable(at, "test", opt)

Data Manager

The Data Manager controls the storage and in-memory buffer pools for all queries. Storage engines must be registered with the data manager:

data_mgr = pyhdk.storage.DataMgr()
data_mgr.registerDataProvider(storage)

Query Execution

Three high level components are required to execute a query:

Calcite: This is a wrapper around Apache Calcite handling SQL parsing and relational algebra optimization. Queries are first sent to Calcite for parsing and conversion to relational algebra. Depending on the query, some optimization of the relational algebra occurs in Calcite.
RelAlgExecutor: Handles execution of a relational algebra tree. Only one should be created per query.
Executor: The JIT compilation and query execution engine. Holds state which spans queries (e.g. code cache). Should be created as a singleton and re-used per query.

The complete flow is as follows:

calcite = pyhdk.sql.Calcite(storage)
executor = pyhdk.Executor(data_mgr)
ra = calcite.process("SELECT * FROM t;")
rel_alg_executor = pyhdk.sql.RelAlgExecutor(
    executor, storage, data_mgr, ra
)
res = rel_alg_executor.execute()

Calcite reads the schema information from storage, and the Executor stores a reference to Data Manager for buffer/storage access during a query.

The return from RelAlgExecutor is a ResultSet object which can be converted to Arrow and to pandas:

df = res.to_arrow().to_pandas()

Examples

Standalone examples are available in the examples directory. Most examples run via Jupyter notebooks.

Build

Dependencies

Miniconda installation is required. (Anaconda may produce build issues.) Use one of these miniconda installers.

Conda environments are used for HDK development. Use the YAML file in omniscidb/scripts/:

conda env create -f omniscidb/scripts/mapd-deps-conda-dev-env.yml
conda activate omnisci-dev

Compilation

If using a Conda environment, run the following to build and install HDK:

mkdir build && cd build
cmake ..
make -j 
make install

By default GPU support is disabled.

To verify check python -c 'import pyhdk' executed without an error.

Compilation with Intel GPU support

Dependencies

Install extra dependencies into the existing environment:

conda install -c conda-forge level-zero-devel pkg-config

Compilation

mkdir build && cd build
cmake -DENABLE_L0=on ..
make -j 
make install

Compilation with CUDA support

Dependencies

Install extra dependencies into an existing environment or a new one.

conda install -c conda-forge cudatoolkit-dev arrow-cpp-proc=3.0.0=cuda arrow-cpp=11.0=*cuda

Compilation

mkdir build && cd build
cmake -DENABLE_CUDA=on ..
make -j 
make install

Issues

If you meet issues during the build refer to .github/workflows/build.yml. This file describes the compilation steps used for the CI build.

If you are still facing issues please create a github issue.

Test

Python tests can be run from the python source directory using pytest.

HDK interface tests

pytest python/tests/*.py

Modin integration tests

pytest python/tests/modin

All pytests

pytest python/tests/

(Optional dependency) Modin installation

Installation into conda environment.

Clone Modin.

cd modin && pip install -e .

Pytest logging

To enable logging:

pyhdk.initLogger(debug_logs=True)

In the setup_class(..) body.

Logs are by default located in the hdk_log/ folder.

hdk's People

Contributors

Stargazers

Watchers

Forkers

garra1980 vlad-penkin aregm djs6255 bagrorg gaybro8777 akroviakov nikzasel gshimansky devjiu andreypavlenko xiangxud yarshev anmyachev

hdk's Issues

Remove duplicate version numbers in HDK CMakeLists

From Igor:
we have version mark here https://github.com/intel-ai/hdk/blob/main/CMakeLists.txt#L5 and here https://github.com/intel-ai/hdk/blob/main/CMakeLists.txt#L32 - probably we can drop one of it to not change in two places each time

Propagate ArrowStorage exceptions to Python

In case of Arrow import data error in PyHDK, we would just get a segfault with no proper diagnostic message. Python exceptions would be much nicer.

Allow GPU execution

Add GPU managers initialization, options for controlling device selection at user level.

Replicate jit-engine CI here

In preparation for repo merge, we should add the GitHub actions from the other repo here, to run under the omniscidb folder. We will need to copy the actions over and update the paths.

test

Allow building HDK from an arbitrary folder

Our current flow assumes the exact location of the build folder. This is a request to lift that restriction to allow something like this:

cd /my/build/folder
cmake /path/to/hdk

Heterogeneous execution fail on assert

There seems to be a flaw with recompilation when QueryMustRunOnCPU is thrown.
ArrowBasedExecutionTest fails:

2022-11-17T08:53:04.345347 F 2829187 0 0 RelAlgExecutor.cpp:622 Check failed: co.device_type == ExecutorDeviceType::GPU

Run Dwarf Benchmarks on Intel PVC GPU

Modin doesn't work with PyHDK with submodule updated to the latest jit-engine branch

If I update omniscidb submodule to the latest jit-engine branch then I get this error trying to parse RelAlg JSON queries in Calcite:

java.lang.NoClassDefFoundError: com/fasterxml/jackson/annotation/JsonIncludeProperties
        at com.fasterxml.jackson.databind.introspect.JacksonAnnotationIntrospector.findPropertyInclusionByName(JacksonAnnotationIntrospector.java:321) ~[calcite-1.0-SNAPSHOT-jar-with-dependencies.jar:?]

Looks like it is related to the latest change in the jackson-databind version used by Calcite. The problem can be reproduced using ienkovich/config of HDK and ienkovich/pyhdk-config branch of Modin.

L0 backend non-scalar code generation produces incorrect results

There's a bug in how we process steps if multiple threads are running.

Support extract/date time runtime in L0 backend

For Taxi Q3/Q4, we need to determine how to pull the Date/Time runtime into SPIRV. For CUDA, we compile the extension functions into a CUDA FatBinary at build time, then use the CUDA linker. We could follow a similar approach with SPIRV, or move the time extraction functions to the module and inline them during the JIT process. The downside to this could be increased module compile time (though there are some optimizations meant to keep such increases to a minimum), so we are considering building a benchmark to test.

Add Intel GPU testing to CI

maven does not work well with openjdk-17

See #51 for details

My attempt to use Ubuntu 22 was blocked with an invalid CUDA package for this Ubuntu version, see https://askubuntu.com/questions/1421423/cuda-11-7-dependencies-issue-on-ubuntu-22-04

I'll check if it's possible update maven only

Add dependencies manifest using vcpkg or conda-forge

Support installing all required dependencies on Linux using vcpkg.

HDK PR checks in CI

Need to enable PRs checks in the HDK CI

Bringing in jit-engine causes compiler error with date/time runtime

Compiler error bringing in extract from time code:

heterogeneous-data-kernels/omniscidb/QueryEngine/ExtractFromTime.cpp: In function 'int64_t ExtractFromTime(ExtractField, int64_t)':
heterogeneous-data-kernels/omniscidb/QueryEngine/ExtractFromTime.cpp:156:1: error: inlining failed in call to always_inline 'int64_t extract_epoch(int64_t)': function body can be overwritten at link time
  156 | extract_epoch(const int64_t timeval) {
      | ^~~~~~~~~~~~~
heterogeneous-data-kernels/omniscidb/QueryEngine/ExtractFromTime.cpp:270:27: note: called from here
  270 |       return extract_epoch(timeval);
      |              ~~~~~~~~~~~~~^~~~~~~~~

Heterogeneous co-execution on CPU-GPU

Support bringing jit-engine branch in as module

Functionality required:

QueryEngine
Analyzer
DataMgr / ArrowStorage
Possibly Calcite, though initially we can directly generate Analyzer nodes
ArrowStorageExecuteTest

Functionality not required/desired:

Parser/ParserNode
Catalog
Thrift/DBHandler

Initial attempts have failed due to linking problems, but we can try again once https://github.com/intel-ai/omniscidb/pull/332 lands.

Also requires:

document endpoints exposed for integration
support build and minimal test w/ CI to prevent regressions (on either side)

JVM Initialization preventing back to back test runs

Running pytest from the hdk tests directory causes a crash on the second test.

Specifically, this line is failing:

    if (JNI_CreateJavaVM(&jvm, (void**)&env, &vm_args) != JNI_OK) {

      LOG(FATAL) << "Couldn't initialize JVM.";

    }

And due to some issues with the logger not being around anymore, we are failing to log the message and aborting.

The problem appears to be an issue of JNI context trying to initialize the JVM twice.

Full backtrace:

Thread 1 "python3" received signal SIGABRT, Aborted.
0x00007ffff7c8e36c in ?? () from /usr/lib/libc.so.6
(gdb) bt
#0 0x00007ffff7c8e36c in ?? () from /usr/lib/libc.so.6
#1 0x00007ffff7c3e838 in raise () from /usr/lib/libc.so.6
#2 0x00007ffff7c28535 in abort () from /usr/lib/libc.so.6
#3 0x00007fff29d4cac0 in logger::Logger::~Logger (this=0x7fffffff6110, __in_chrg=<optimized out>)
at /home/alexb/Projects/hdk/omniscidb/Logger/Logger.cpp:459
#4 0x00007fff29959551 in (anonymous namespace)::JVM::createJVM (max_mem_mb=<optimized out>)
at /home/alexb/Projects/hdk/omniscidb/Calcite/CalciteJNI.cpp:144
#5 (anonymous namespace)::JVM::getInstance (max_mem_mb=<optimized out>)
at /home/alexb/Projects/hdk/omniscidb/Calcite/CalciteJNI.cpp:88
#6 CalciteJNI::Impl::Impl (this=0x5555564e1060, schema_provider=..., udf_filename=..., calcite_max_mem_mb=<optimized out>)
at /home/alexb/Projects/hdk/omniscidb/Calcite/CalciteJNI.cpp:171
#7 0x00007fff2995ab13 in std::make_unique<CalciteJNI::Impl, std::shared_ptr<SchemaProvider>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long&> ()
at /home/alexb/.conda/envs/omnisci-dev/x86_64-conda-linux-gnu/include/c++/9.4.0/bits/unique_ptr.h:857
#8 CalciteJNI::CalciteJNI (this=0x555556a04190, schema_provider=..., udf_filename=..., calcite_max_mem_mb=1024)
at /home/alexb/Projects/hdk/omniscidb/Calcite/CalciteJNI.cpp:602
#9 0x00007fff2997906e in __gnu_cxx::new_allocator<CalciteJNI>::construct<CalciteJNI, std::shared_ptr<SchemaProvider>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, unsigned long&> (this=<optimized out>, __p=0x555556a04190)
at /home/alexb/.conda/envs/omnisci-dev/x86_64-conda-linux-gnu/include/c++/9.4.0/ext/new_allocator.h:146
#10 std::allocator_traits<std::allocator<CalciteJNI> >::construct<CalciteJNI, std::shared_ptr<SchemaProvider>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, unsigned long&> (__a=..., __p=0x555556a04190)
at /home/alexb/.conda/envs/omnisci-dev/x86_64-conda-linux-gnu/include/c++/9.4.0/bits/alloc_traits.h:483
#11 std::_Sp_counted_ptr_inplace<CalciteJNI, std::allocator<CalciteJNI>, (__gnu_cxx::_Lock_policy)2>::_Sp_counted_ptr_inplace<std::shared_ptr<SchemaProvider>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, unsigned long&> (
__a=..., this=0x555556a04180)
at /home/alexb/.conda/envs/omnisci-dev/x86_64-conda-linux-gnu/include/c++/9.4.0/bits/shared_ptr_base.h:548
#12 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<CalciteJNI, std::allocator<CalciteJNI>, std::shared_ptr<SchemaProvider>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, unsigned long&> (__a=...,
__p=<optimized out>, this=<optimized out>)
at /home/alexb/.conda/envs/omnisci-dev/x86_64-conda-linux-gnu/include/c++/9.4.0/bits/shared_ptr_base.h:679
#13 std::__shared_ptr<CalciteJNI, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<CalciteJNI>, std::shared_ptr<SchemaProvider>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, unsigned long&> (__tag=...,
this=<optimized out>)
at /home/alexb/.conda/envs/omnisci-dev/x86_64-conda-linux-gnu/include/c++/9.4.0/bits/shared_ptr_base.h:1344
#14 std::shared_ptr<CalciteJNI>::shared_ptr<std::allocator<CalciteJNI>, std::shared_ptr<SchemaProvider>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, unsigned long&> (__tag=..., this=<optimized out>)
at /home/alexb/.conda/envs/omnisci-dev/x86_64-conda-linux-gnu/include/c++/9.4.0/bits/shared_ptr.h:359
#15 std::allocate_shared<CalciteJNI, std::allocator<CalciteJNI>, std::shared_ptr<SchemaProvider>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, unsigned long&> (__a=...)
at /home/alexb/.conda/envs/omnisci-dev/x86_64-conda-linux-gnu/include/c++/9.4.0/bits/shared_ptr.h:702

Allow building the engine with L0 support in conda environment

Since the default L0 driver location is in system libraries, there's an issue when building jit engine. CMake's find_package looks for the headers and finds them in /usr/include. CMake does not add it to include even if the target_include_directories is set explicitly to include the system paths (see https://gitlab.kitware.com/cmake/cmake/-/issues/17966 for details). Providing hints/paths to find_package break the linking process for other libraries due to conflicts.
There's also currently no conda package that we could use to avoid using the system includes/libraries. We need to either build a package or create a workaround for building jit-engine with L0 under conda env.

Support modin using HDK

Build library exposing required endpoints and Python wrapper

Run Dwarf Benchmarks on Intel PVC

Fix CI builds and add pytests to HDK test suite

Executor holds dangling reference to data mgr

In the unit tests, we delete storage between each test. This deletes DataMgr, but the Executor ends up with a pointer to the old DataMgr. This results in a segfault the next time an Executor is created, because the Python Executor class calls getExecutor which pulls from the Executor pool.

Add documentation skeleton

Use github pages + sphinx + github actions for auto-build?

Add working pyhdk example -- to readme?

e.g.:

import pyhdk
storage = pyhdk.storage.ArrowStorage(1) # 1 is schema id
data_mgr = pyhdk.storage.DataMgr()
data_mgr.registerDataProvider(storage)

calcite = pyhdk.sql.Calcite(storage)
executor = pyhdk.Executor(data_mgr)

import pyarrow
import pandas
at = pyarrow.Table.from_pandas(
            pandas.DataFrame({"a": [1, 2, 3], "b": [10, 20, 30]})
        )
opt = pyhdk.storage.TableOptions(2)
storage.importArrowTable(at, "test", opt)

sql = "SELECT * FROM test;"
ra = calcite.process(sql)
rel_alg_executor = pyhdk.sql.RelAlgExecutor(executor, storage, data_mgr, ra)
print(rel_alg_executor.execute().to_arrow().to_pandas())

print(rel_alg_executor.execute(just_explain = True).to_explain_str())

Support LLVM 14

Add pyhdk install instructions to readme

Add CLA/Contributing document

create manylinux2014_x86_64 build

Status:

The container manylinux2014_x86_64 does not work because of outdated repo url. The container from cibuildwheel cannot be used because it does not have sudo.

It seems people build their own containers for their builds and check them using auditwheel

Establish baseline performance

Design and publish public API (IR)

Fix import of Arrow table with time32[s] data

Currently, ArrowStorage fails to import such data due to improper scheme checks. But other issues might also exist in the actual data import.

C++ exception with description "Mismatched type for column col4: timestamp[s] vs. time32[s]" thrown in the test body.

Open-source community work

Sanitizers do not work with JNI

Needs investigation.

Flaky Select.FilterAndSimpleAggregation test

After CUDA tests were introduced to CI, we saw failures of the Select.FilterAndSimpleAggregation test. The failure is flaky, but if it fails, it always fails in the same way:

Expected equality of these values:
  20
  v<int64_t>( run_simple_agg("SELECT COUNT(*) FROM test WHERE MOD(x, 7) <> 7;", dt))
    Which is: 22

I found out, that it's enough to leave only this particular query in the test to reproduce the failure. The query is supposed to return the number of rows but somehow returns a greater value. The input table has 10 fragments, 2 rows each.

I dumped generated IR module and all data copied to a CUDA device. Dumps are the same for good and bad runs. It looks like we run the same code on the same data but get different results.

I was able to reproduce it on an August 30 version of jit-engine branch, so the problem is not new. Don't know when it was introduced.

pyhdk from conda-forge segfaults

Here is the scenario:

conda env remove -n omnisci-dev
conda env update -f omniscidb/scripts/mapd-deps-conda-dev-env.yml
git clone https://github.com/intel-ai/modin.git
git checkout ienkovich/pyhdk
conda activate omnisci-dev
mamba install  -c conda-forge pyhdk
cd modin/
pip install -e .
cd ..
python python/tests/modin/modin_smoke_test.py

Here is the error:

UserWarning: Distributing <class 'list'> object. This may take some time.
FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
0    12
Name: a, dtype: int64
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f2bcc5c3b85, pid=2895537, tid=2895537
#
# JRE version: OpenJDK Runtime Environment (11.0.15) (build 11.0.15-internal+0-adhoc..src)
# Java VM: OpenJDK 64-Bit Server VM (11.0.15-internal+0-adhoc..src, mixed mode, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# C  [libjimage.so+0x2b85]  ImageStrings::find(Endian*, char const*, int*, unsigned int)+0x65
#
# Core dump will be written. Default location: /localdisk2/afedotov/git/hdk/core
#
# An error report file with more information is saved as:
# /localdisk2/afedotov/git/hdk/hs_err_pid2895537.log
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
#
Aborted (core dumped)

Add cython and java clean targets

Remove java target folder and Cython generated cpp files in python folder when running make clean from this repo.

Intel GPU Enabling

Run Dwarf benchmarks on Intel ATS-P

Allow building HDK with both Nvidia and Intel GPU support

pyhdk failing inside Jupyter notebook

When running inside a Jupyter notebook we get:

      [3] data_mgr = pyhdk.storage.DataMgr()
      [4] data_mgr.registerDataProvider(storage)
----> [6] calcite = pyhdk.sql.Calcite(storage)
      [7] executor = pyhdk.Executor(data_mgr)
      [9] import pyarrow

File _sql.pyx:36, in pyhdk._sql.Calcite.__cinit__()

RuntimeError: Couldn't initialize JVM.

Enable HDK on Windows

Enabling includes successful execution of all OmniSci and HDK tests and integration with Modin.

Run Dwarf on Nvidia A100 via AWS

Add TSAN, ASAN, UBSAN to CI

Blocked by #34

Create Conda package

InsertOrderFragmenter depends on Catalog code

The method insertData depends on Catalog::getTableEpochs for error handling. This requires linking Catalog into Fragmenter, and Fragmenter is currently a dependency on data fetch in QueryEngine. Need to elevate the Catalog accesses to remove the dependency.

Running tests puts `${sys:MAPD_LOG_DIR}` directories in test folder

It appears that the environment variable MAPD_LOG_DIR set here https://github.com/intel-ai/omniscidb/blob/jit-engine/Calcite/CMakeLists.txt#L30 is not being picked up by the log4j properties file(s) https://github.com/intel-ai/omniscidb/blob/jit-engine/Calcite/java/calcite/src/main/resources/log4j2.properties.

To reproduce, build as normal, then enter build/Tests and run ArrowBasedExecuteTest --gtest_filter=Select.GroupBy (keeps the tests brief).

@vlad-penkin Ilya suggested you might have some ideas about how to debug?

Decouple Codegen and Executor to allow passing backend traits to query template generator

Most of the code generation enabling requires a simple change: switch to correct pointers address space and calling convention. For native code generation, this is handled by CodegenTraits, however, existing codegen logic does not allow passing CodegenTraits to CodeGenerator at construction time.

Add Modin tests to HDK suite

Add a smoke/sanity test for Modin powered by HDK to the GitHub Actions tests in this repo. Will need to use the HDK branch in the Modin repository for now.

intel / hdk Goto Github PK

hdk's Introduction

HDK - Heterogeneous Data Kernels

Components

Config

Storage

Data Manager

Query Execution

Examples

Build

Dependencies

Compilation

Compilation with Intel GPU support

Dependencies

Compilation

Compilation with CUDA support

Dependencies

Compilation

Issues

Test

HDK interface tests

Modin integration tests

All pytests

(Optional dependency) Modin installation

Pytest logging

hdk's People

Contributors

Stargazers

Watchers

Forkers

hdk's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs