GithubHelp home page GithubHelp logo

intel / hdk Goto Github PK

View Code? Open in Web Editor NEW
30.0 10.0 14.0 67.88 MB

A low-level execution library for analytic data processing.

License: Apache License 2.0

CMake 2.26% C++ 84.38% C 0.27% Dockerfile 0.02% Python 3.82% Cython 0.94% FreeMarker 0.60% Java 5.23% Cuda 2.06% LLVM 0.14% Makefile 0.01% CSS 0.01% HTML 0.10% NASL 0.01% Shell 0.09% Ruby 0.01% PowerShell 0.04% Batchfile 0.01%
sql gpu query analytics query-engine modin pandas data-science machine-learning query-builder

hdk's Introduction

PROJECT NOT UNDER ACTIVE MANAGEMENT

This project will no longer be maintained by Intel.

Intel has ceased development and contributions including, but not limited to, maintenance, bug fixes, new releases, or updates, to this project.

Intel no longer accepts patches to this project.

If you have an ongoing need to use this project, are interested in independently developing it, or would like to maintain patches for the open source software community, please create your own fork of this project.

Contact: [email protected]

HDK - Heterogeneous Data Kernels Conda Version

HDK is a low-level execution library for data analytics processing.

HDK is used as a fast execution backend in Modin. The HDK library provides a set of components for federating analytic queries to an execution backend based on OmniSciDB. Currently, HDK targets OLAP-style queries expressed as relational algebra or SQL. The APIs required for Modin support have been exposed in a library installed from this repository, pyhdk. Major and immediate project priorities include:

  • Introducing a HDK-specific IR and set of optimizations to reduce reliance on RelAlg and improve extensibility of the query API.
  • Supporting heterogeneous device execution, where a query is split across a set of hardware devices (e.g. CPU and GPU) for best performance. We have developed an initial cost model for heterogeneous execution.
  • Improving performance of the CPU backend on Modin-specific queries and current-generation data science workstations and servers by > 2x.

We are committed to supporting a baseline set of functionality on all x86 CPUs, later-generation NVIDIA GPUs (supporting CUDA 11+), and Intel GPUs. The x86 backend uses LLVM ORCJIT for x86 byte code generation. The NVIDIA backend uses NVPTX extensions in LLVM to generate PTX, which is JIT-compiled by the CUDA runtime compiler. The Intel GPU backend leverages the LLVM SPIR-V translator to produce SPIR-V. Device code is generated using the Intel Graphics Compiler (IGC) via the oneAPI L0 driver.

Components

Config

Config controls library-wide properties and must be passed to Executor and DataMgr. Default config objects should suffice for most installations. Instantiate a config first as part of library setup.

Storage

ArrowStorage is currently the default (and only available) HDK storage layer. ArrowStorage provides storage support for Apache Arrow format data. The storage layer must be explicitly initialized:

import pyhdk
storage = pyhdk.storage.ArrowStorage(1)

The parameter applied to the ArrowStorage constructor is the database ID. The database ID allows storage instances to be kept logically separate.

ArrowStorage automatically converts Arrow format datatypes to omniscidb datatypes. Some variable length types are not yet supported, but scalar types are available. pyarrow can be used to convert Pandas DataFrames to Arrow:

at = pyarrow.Table.from_pandas(
    pandas.DataFrame({"a": [1, 2, 3], "b": [10, 20, 30]})
)

The arrow table can then be imported using the Arrow storage interface.

opt = pyhdk.storage.TableOptions(2)
storage.importArrowTable(at, "test", opt)

Data Manager

The Data Manager controls the storage and in-memory buffer pools for all queries. Storage engines must be registered with the data manager:

data_mgr = pyhdk.storage.DataMgr()
data_mgr.registerDataProvider(storage)

Query Execution

Three high level components are required to execute a query:

  1. Calcite: This is a wrapper around Apache Calcite handling SQL parsing and relational algebra optimization. Queries are first sent to Calcite for parsing and conversion to relational algebra. Depending on the query, some optimization of the relational algebra occurs in Calcite.
  2. RelAlgExecutor: Handles execution of a relational algebra tree. Only one should be created per query.
  3. Executor: The JIT compilation and query execution engine. Holds state which spans queries (e.g. code cache). Should be created as a singleton and re-used per query.

The complete flow is as follows:

calcite = pyhdk.sql.Calcite(storage)
executor = pyhdk.Executor(data_mgr)
ra = calcite.process("SELECT * FROM t;")
rel_alg_executor = pyhdk.sql.RelAlgExecutor(
    executor, storage, data_mgr, ra
)
res = rel_alg_executor.execute()

Calcite reads the schema information from storage, and the Executor stores a reference to Data Manager for buffer/storage access during a query.

The return from RelAlgExecutor is a ResultSet object which can be converted to Arrow and to pandas:

df = res.to_arrow().to_pandas()

Examples

Standalone examples are available in the examples directory. Most examples run via Jupyter notebooks.

Build

Dependencies

Miniconda installation is required. (Anaconda may produce build issues.) Use one of these miniconda installers.

Conda environments are used for HDK development. Use the YAML file in omniscidb/scripts/:

conda env create -f omniscidb/scripts/mapd-deps-conda-dev-env.yml
conda activate omnisci-dev

Compilation

If using a Conda environment, run the following to build and install HDK:

mkdir build && cd build
cmake ..
make -j 
make install

By default GPU support is disabled.

To verify check python -c 'import pyhdk' executed without an error.

Compilation with Intel GPU support

Dependencies

Install extra dependencies into the existing environment:

conda install -c conda-forge level-zero-devel pkg-config
Compilation
mkdir build && cd build
cmake -DENABLE_L0=on ..
make -j 
make install

Compilation with CUDA support

Dependencies

Install extra dependencies into an existing environment or a new one.

conda install -c conda-forge cudatoolkit-dev arrow-cpp-proc=3.0.0=cuda arrow-cpp=11.0=*cuda
Compilation
mkdir build && cd build
cmake -DENABLE_CUDA=on ..
make -j 
make install

Issues

If you meet issues during the build refer to .github/workflows/build.yml. This file describes the compilation steps used for the CI build.

If you are still facing issues please create a github issue.

Test

Python tests can be run from the python source directory using pytest.

HDK interface tests

pytest python/tests/*.py 

Modin integration tests

pytest python/tests/modin

All pytests

pytest python/tests/ 

(Optional dependency) Modin installation

Installation into conda environment.

Clone Modin.

cd modin && pip install -e .

Pytest logging

To enable logging:

pyhdk.initLogger(debug_logs=True)

In the setup_class(..) body.

Logs are by default located in the hdk_log/ folder.

hdk's People

Contributors

alex-brant avatar alexbaden avatar andrewseidl avatar asuhan avatar corecursion avatar dwayneberry avatar fleapapa avatar ienkovich avatar jack-mapd avatar kurapov-peter avatar leshikus avatar m1mc avatar mapdwei avatar mattdlh avatar mattgara avatar mattpulver avatar misiugodfrey avatar norairk avatar pearu avatar pressenna avatar ptumati avatar sashkiani avatar shtilman avatar simoneves avatar smyatkin-maxim avatar steveblackmon-mapd avatar tmostak avatar vrajpandya avatar wamsiv avatar yoonminnam avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hdk's Issues

Allow GPU execution

Add GPU managers initialization, options for controlling device selection at user level.

Replicate jit-engine CI here

In preparation for repo merge, we should add the GitHub actions from the other repo here, to run under the omniscidb folder. We will need to copy the actions over and update the paths.

Allow building HDK from an arbitrary folder

Our current flow assumes the exact location of the build folder. This is a request to lift that restriction to allow something like this:

cd /my/build/folder
cmake /path/to/hdk

Heterogeneous execution fail on assert

There seems to be a flaw with recompilation when QueryMustRunOnCPU is thrown.
ArrowBasedExecutionTest fails:

2022-11-17T08:53:04.345347 F 2829187 0 0 RelAlgExecutor.cpp:622 Check failed: co.device_type == ExecutorDeviceType::GPU

Modin doesn't work with PyHDK with submodule updated to the latest jit-engine branch

If I update omniscidb submodule to the latest jit-engine branch then I get this error trying to parse RelAlg JSON queries in Calcite:

java.lang.NoClassDefFoundError: com/fasterxml/jackson/annotation/JsonIncludeProperties
        at com.fasterxml.jackson.databind.introspect.JacksonAnnotationIntrospector.findPropertyInclusionByName(JacksonAnnotationIntrospector.java:321) ~[calcite-1.0-SNAPSHOT-jar-with-dependencies.jar:?]

Looks like it is related to the latest change in the jackson-databind version used by Calcite. The problem can be reproduced using ienkovich/config of HDK and ienkovich/pyhdk-config branch of Modin.

Support extract/date time runtime in L0 backend

For Taxi Q3/Q4, we need to determine how to pull the Date/Time runtime into SPIRV. For CUDA, we compile the extension functions into a CUDA FatBinary at build time, then use the CUDA linker. We could follow a similar approach with SPIRV, or move the time extraction functions to the module and inline them during the JIT process. The downside to this could be increased module compile time (though there are some optimizations meant to keep such increases to a minimum), so we are considering building a benchmark to test.

Bringing in jit-engine causes compiler error with date/time runtime

Compiler error bringing in extract from time code:

heterogeneous-data-kernels/omniscidb/QueryEngine/ExtractFromTime.cpp: In function 'int64_t ExtractFromTime(ExtractField, int64_t)':
heterogeneous-data-kernels/omniscidb/QueryEngine/ExtractFromTime.cpp:156:1: error: inlining failed in call to always_inline 'int64_t extract_epoch(int64_t)': function body can be overwritten at link time
  156 | extract_epoch(const int64_t timeval) {
      | ^~~~~~~~~~~~~
heterogeneous-data-kernels/omniscidb/QueryEngine/ExtractFromTime.cpp:270:27: note: called from here
  270 |       return extract_epoch(timeval);
      |              ~~~~~~~~~~~~~^~~~~~~~~

Support bringing jit-engine branch in as module

Functionality required:

  • QueryEngine
  • Analyzer
  • DataMgr / ArrowStorage
  • Possibly Calcite, though initially we can directly generate Analyzer nodes
  • ArrowStorageExecuteTest

Functionality not required/desired:

  • Parser/ParserNode
  • Catalog
  • Thrift/DBHandler

Initial attempts have failed due to linking problems, but we can try again once https://github.com/intel-ai/omniscidb/pull/332 lands.

Also requires:

  • document endpoints exposed for integration
  • support build and minimal test w/ CI to prevent regressions (on either side)

JVM Initialization preventing back to back test runs

Running pytest from the hdk tests directory causes a crash on the second test.

Specifically, this line is failing:

    if (JNI_CreateJavaVM(&jvm, (void**)&env, &vm_args) != JNI_OK) {

      LOG(FATAL) << "Couldn't initialize JVM.";

    }

And due to some issues with the logger not being around anymore, we are failing to log the message and aborting.

The problem appears to be an issue of JNI context trying to initialize the JVM twice.

Full backtrace:

Thread 1 "python3" received signal SIGABRT, Aborted.
0x00007ffff7c8e36c in ?? () from /usr/lib/libc.so.6
(gdb) bt
#0 0x00007ffff7c8e36c in ?? () from /usr/lib/libc.so.6
#1 0x00007ffff7c3e838 in raise () from /usr/lib/libc.so.6
#2 0x00007ffff7c28535 in abort () from /usr/lib/libc.so.6
#3 0x00007fff29d4cac0 in logger::Logger::~Logger (this=0x7fffffff6110, __in_chrg=<optimized out>)
at /home/alexb/Projects/hdk/omniscidb/Logger/Logger.cpp:459
#4 0x00007fff29959551 in (anonymous namespace)::JVM::createJVM (max_mem_mb=<optimized out>)
at /home/alexb/Projects/hdk/omniscidb/Calcite/CalciteJNI.cpp:144
#5 (anonymous namespace)::JVM::getInstance (max_mem_mb=<optimized out>)
at /home/alexb/Projects/hdk/omniscidb/Calcite/CalciteJNI.cpp:88
#6 CalciteJNI::Impl::Impl (this=0x5555564e1060, schema_provider=..., udf_filename=..., calcite_max_mem_mb=<optimized out>)
at /home/alexb/Projects/hdk/omniscidb/Calcite/CalciteJNI.cpp:171
#7 0x00007fff2995ab13 in std::make_unique<CalciteJNI::Impl, std::shared_ptr<SchemaProvider>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long&> ()
at /home/alexb/.conda/envs/omnisci-dev/x86_64-conda-linux-gnu/include/c++/9.4.0/bits/unique_ptr.h:857
#8 CalciteJNI::CalciteJNI (this=0x555556a04190, schema_provider=..., udf_filename=..., calcite_max_mem_mb=1024)
at /home/alexb/Projects/hdk/omniscidb/Calcite/CalciteJNI.cpp:602
#9 0x00007fff2997906e in __gnu_cxx::new_allocator<CalciteJNI>::construct<CalciteJNI, std::shared_ptr<SchemaProvider>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, unsigned long&> (this=<optimized out>, __p=0x555556a04190)
at /home/alexb/.conda/envs/omnisci-dev/x86_64-conda-linux-gnu/include/c++/9.4.0/ext/new_allocator.h:146
#10 std::allocator_traits<std::allocator<CalciteJNI> >::construct<CalciteJNI, std::shared_ptr<SchemaProvider>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, unsigned long&> (__a=..., __p=0x555556a04190)
at /home/alexb/.conda/envs/omnisci-dev/x86_64-conda-linux-gnu/include/c++/9.4.0/bits/alloc_traits.h:483
#11 std::_Sp_counted_ptr_inplace<CalciteJNI, std::allocator<CalciteJNI>, (__gnu_cxx::_Lock_policy)2>::_Sp_counted_ptr_inplace<std::shared_ptr<SchemaProvider>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, unsigned long&> (
__a=..., this=0x555556a04180)
at /home/alexb/.conda/envs/omnisci-dev/x86_64-conda-linux-gnu/include/c++/9.4.0/bits/shared_ptr_base.h:548
#12 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<CalciteJNI, std::allocator<CalciteJNI>, std::shared_ptr<SchemaProvider>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, unsigned long&> (__a=...,
__p=<optimized out>, this=<optimized out>)
at /home/alexb/.conda/envs/omnisci-dev/x86_64-conda-linux-gnu/include/c++/9.4.0/bits/shared_ptr_base.h:679
#13 std::__shared_ptr<CalciteJNI, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<CalciteJNI>, std::shared_ptr<SchemaProvider>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, unsigned long&> (__tag=...,
this=<optimized out>)
at /home/alexb/.conda/envs/omnisci-dev/x86_64-conda-linux-gnu/include/c++/9.4.0/bits/shared_ptr_base.h:1344
#14 std::shared_ptr<CalciteJNI>::shared_ptr<std::allocator<CalciteJNI>, std::shared_ptr<SchemaProvider>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, unsigned long&> (__tag=..., this=<optimized out>)
at /home/alexb/.conda/envs/omnisci-dev/x86_64-conda-linux-gnu/include/c++/9.4.0/bits/shared_ptr.h:359
#15 std::allocate_shared<CalciteJNI, std::allocator<CalciteJNI>, std::shared_ptr<SchemaProvider>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, unsigned long&> (__a=...)
at /home/alexb/.conda/envs/omnisci-dev/x86_64-conda-linux-gnu/include/c++/9.4.0/bits/shared_ptr.h:702

Allow building the engine with L0 support in conda environment

Since the default L0 driver location is in system libraries, there's an issue when building jit engine. CMake's find_package looks for the headers and finds them in /usr/include. CMake does not add it to include even if the target_include_directories is set explicitly to include the system paths (see https://gitlab.kitware.com/cmake/cmake/-/issues/17966 for details). Providing hints/paths to find_package break the linking process for other libraries due to conflicts.
There's also currently no conda package that we could use to avoid using the system includes/libraries. We need to either build a package or create a workaround for building jit-engine with L0 under conda env.

Executor holds dangling reference to data mgr

In the unit tests, we delete storage between each test. This deletes DataMgr, but the Executor ends up with a pointer to the old DataMgr. This results in a segfault the next time an Executor is created, because the Python Executor class calls getExecutor which pulls from the Executor pool.

Add working pyhdk example -- to readme?

e.g.:

import pyhdk
storage = pyhdk.storage.ArrowStorage(1) # 1 is schema id
data_mgr = pyhdk.storage.DataMgr()
data_mgr.registerDataProvider(storage)

calcite = pyhdk.sql.Calcite(storage)
executor = pyhdk.Executor(data_mgr)

import pyarrow
import pandas
at = pyarrow.Table.from_pandas(
            pandas.DataFrame({"a": [1, 2, 3], "b": [10, 20, 30]})
        )
opt = pyhdk.storage.TableOptions(2)
storage.importArrowTable(at, "test", opt)

sql = "SELECT * FROM test;"
ra = calcite.process(sql)
rel_alg_executor = pyhdk.sql.RelAlgExecutor(executor, storage, data_mgr, ra)
print(rel_alg_executor.execute().to_arrow().to_pandas())

print(rel_alg_executor.execute(just_explain = True).to_explain_str())

create manylinux2014_x86_64 build

Status:

The container manylinux2014_x86_64 does not work because of outdated repo url. The container from cibuildwheel cannot be used because it does not have sudo.

It seems people build their own containers for their builds and check them using auditwheel

Fix import of Arrow table with time32[s] data

Currently, ArrowStorage fails to import such data due to improper scheme checks. But other issues might also exist in the actual data import.

C++ exception with description "Mismatched type for column col4: timestamp[s] vs. time32[s]" thrown in the test body.

Flaky Select.FilterAndSimpleAggregation test

After CUDA tests were introduced to CI, we saw failures of the Select.FilterAndSimpleAggregation test. The failure is flaky, but if it fails, it always fails in the same way:

Expected equality of these values:
  20
  v<int64_t>( run_simple_agg("SELECT COUNT(*) FROM test WHERE MOD(x, 7) <> 7;", dt))
    Which is: 22

I found out, that it's enough to leave only this particular query in the test to reproduce the failure. The query is supposed to return the number of rows but somehow returns a greater value. The input table has 10 fragments, 2 rows each.

I dumped generated IR module and all data copied to a CUDA device. Dumps are the same for good and bad runs. It looks like we run the same code on the same data but get different results.

I was able to reproduce it on an August 30 version of jit-engine branch, so the problem is not new. Don't know when it was introduced.

pyhdk from conda-forge segfaults

Here is the scenario:

conda env remove -n omnisci-dev
conda env update -f omniscidb/scripts/mapd-deps-conda-dev-env.yml
git clone https://github.com/intel-ai/modin.git
git checkout ienkovich/pyhdk
conda activate omnisci-dev
mamba install  -c conda-forge pyhdk
cd modin/
pip install -e .
cd ..
python python/tests/modin/modin_smoke_test.py

Here is the error:

UserWarning: Distributing <class 'list'> object. This may take some time.
FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
0    12
Name: a, dtype: int64
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f2bcc5c3b85, pid=2895537, tid=2895537
#
# JRE version: OpenJDK Runtime Environment (11.0.15) (build 11.0.15-internal+0-adhoc..src)
# Java VM: OpenJDK 64-Bit Server VM (11.0.15-internal+0-adhoc..src, mixed mode, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# C  [libjimage.so+0x2b85]  ImageStrings::find(Endian*, char const*, int*, unsigned int)+0x65
#
# Core dump will be written. Default location: /localdisk2/afedotov/git/hdk/core
#
# An error report file with more information is saved as:
# /localdisk2/afedotov/git/hdk/hs_err_pid2895537.log
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
#
Aborted (core dumped)

pyhdk failing inside Jupyter notebook

When running inside a Jupyter notebook we get:

      [3] data_mgr = pyhdk.storage.DataMgr()
      [4] data_mgr.registerDataProvider(storage)
----> [6] calcite = pyhdk.sql.Calcite(storage)
      [7] executor = pyhdk.Executor(data_mgr)
      [9] import pyarrow

File _sql.pyx:36, in pyhdk._sql.Calcite.__cinit__()

RuntimeError: Couldn't initialize JVM.

Enable HDK on Windows

Enabling includes successful execution of all OmniSci and HDK tests and integration with Modin.

InsertOrderFragmenter depends on Catalog code

The method insertData depends on Catalog::getTableEpochs for error handling. This requires linking Catalog into Fragmenter, and Fragmenter is currently a dependency on data fetch in QueryEngine. Need to elevate the Catalog accesses to remove the dependency.

Running tests puts `${sys:MAPD_LOG_DIR}` directories in test folder

It appears that the environment variable MAPD_LOG_DIR set here https://github.com/intel-ai/omniscidb/blob/jit-engine/Calcite/CMakeLists.txt#L30 is not being picked up by the log4j properties file(s) https://github.com/intel-ai/omniscidb/blob/jit-engine/Calcite/java/calcite/src/main/resources/log4j2.properties.

To reproduce, build as normal, then enter build/Tests and run ArrowBasedExecuteTest --gtest_filter=Select.GroupBy (keeps the tests brief).

@vlad-penkin Ilya suggested you might have some ideas about how to debug?

Add Modin tests to HDK suite

Add a smoke/sanity test for Modin powered by HDK to the GitHub Actions tests in this repo. Will need to use the HDK branch in the Modin repository for now.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.