Hi Anton! Very often arise exceptions when joining or filtering tables. This occur

Arises exceptions when JOIN or FILTER tables, with a tables size of 1000000 rows. about alenka HOT 18 CLOSED

AlexeyAB commented on September 2, 2024

Arises exceptions when JOIN or FILTER tables, with a tables size of 1000000 rows.

from alenka.

Comments (18)

antonmks commented on September 2, 2024

Do you have string lengths of 50 and 100 defined in strings.cu ? UNROLL COUNT ?

Anton

from alenka.

AlexeyAB commented on September 2, 2024

Hi Anton.
Yes I have strings with length 50 and 100.
Sorry for second example, here was my mistake in numbers of fields.

And in first example bug happens in next simple case.

Have the simplest file e.txt with 4 strings:
abc
abc
abc
abc

Try to start this SQL:

E := LOAD 'e.txt' USING ('|') AS (event_type{1}:varchar(20));

EF := FILTER E BY event_type == "abc";

STORE EF INTO 'res.txt' USING ('|') LIMIT 10;

Get an exception.

And this SQL works:

E := LOAD 'e.txt' USING ('|') AS (event_type{1}:varchar(20));

STORE E INTO 'res.txt' USING ('|') LIMIT 10;

But in res.txt there are 4 such lines :
|
|
|
|

And in additional, you can change in TPC-H file q2m.sql this string:

RF := FILTER R BY r_name == "EUROPE";

to this:

RF := FILTER R BY r_name == "EUROPE.";

or to this:

RF := FILTER R BY r_name == "EUROP";

And get an exception in JOIN: J_PS := SELECT ...
It happens for 10GB/100GB in TPC-H.

Alexey.

from alenka.

antonmks commented on September 2, 2024

Seems like this is caused by a bug in join operation - it incorrectly handles the empty datasets.
I fixed it so q2m.sql works now.

Now, Alenka doesn't work directly with text files - the only thing you can do is to load a text file into a binary file.
So please use the data files to run the SQL commands. It shouldn't be difficult to support the text files but it won't make sense from a performance standpoint because the speed will always be text-file parsing bound.

from alenka.

AlexeyAB commented on September 2, 2024

Thanks for fix.
You are right, working directly with text-file is not a priority feature.
But currently I get empty result of working even with binary files.
An example, again we have the simplest file e.txt with 4 strings:

abc
abc
abc
abc

Start command line:

AlenkaDB.exe load.sql
AlenkaDB.exe filter.sql

load.sql

E := LOAD 'e.txt' USING ('|') AS (event_type{1}:varchar(20));
STORE E INTO 'e' BINARY;

filter.sql

E := LOAD 'e' BINARY AS (event_type{1}:varchar(20));
EF := FILTER E BY event_type == "abc";
STORE EF INTO 'res.txt' USING ('|') LIMIT 10;

res.txt is empty. File size = 0.

Now if I make changes in filter.sql to this:

E := LOAD 'e' BINARY AS (event_type{1}:varchar(20));
STORE E INTO 'res.txt' USING ('|') LIMIT 10;

res.txt now contain:

|
|
|
|

from alenka.

antonmks commented on September 2, 2024

Fixed.
Let me know if it is not working for you.

Regards,Anton

from alenka.

AlexeyAB commented on September 2, 2024

Thanks, it's working.
But now I can not save intermediate data in the binary.
Let's, we have previous example, but in the end we want to store in BINARY instead of pipe-delimited-text.
again we have the simplest file e.txt with 4 strings:

abc
abc
abc
abc

Start command line:

AlenkaDB.exe load.sql
AlenkaDB.exe filter.sql

load.sql

E := LOAD 'e.txt' USING ('|') AS (event_type{1}:varchar(20));
STORE E INTO 'e' BINARY;

filter.sql

E := LOAD 'e' BINARY AS (event_type{1}:varchar(20));
EF := FILTER E BY event_type == "abc";
STORE EF INTO 'ef' BINARY;

get messages:

Process count = 6200000
BINARY LOAD: E e
Reading 4 records
FILTER EF E 583483392
MAP CHECK segment 0 R
filter is finished 4 582434816
filter time 0.798 583483392
STORE: EF ef
LOADING
SQL scan parse worked
cycle time 0.802

But no files are created and EF-table is not saved.

Why is it critical for me? I'm trying to get around the lack of memory, by using intermediate temporary binary-tabels, when working with real data, because in 90% of my test queries the program crashes with an exception.

Best regards,
Alexey

from alenka.

antonmks commented on September 2, 2024

This is a part that is missing right now. I understand that this is very important to be able to store the results as binary data so I'll get on it as soon as possible.
Basically my schedule right now is as following :
1.Change join implementation to sorted merge - it is much more convenient for me and there is no need to use CUDPP.
2.Add support for nulls.
3.Add support for storing binary results.
4.Fix a lot of other bugs.
5.Add insert/update/delete operation support.

Best regards,

Anton

from alenka.

AlexeyAB commented on September 2, 2024

Ok. But still I have a big request, if you can to solve the problem with the lack of GPU RAM.
I can not use AlenkaDB on any real queries for that reason :)

If you are going to use the sort merge join, and if you bring a sorting for all types of data in a separate file, such as it done for strings (strings_sort_host.cu & strings_sort_device.cu), then I could implement a hybrid sorting (sort parts on the GPU, and at this time in parallel doing merge (std :: merge_inplace / thrust :: merge) them on CPU), it is required by the lack of GPU RAM. This will solve the problem of lack of GPU RAM and problem of slow of the CPU.

As well it can be done for GROUP BY and JOIN for all types and for gpu/hybrid(cpu/gpu). At this time, as I understand it, just only the sorting of strings can be performed on the host, or the GROUP BY and JOIN too can for all types?

Are you going to turn off from work on a project for the summer?

from alenka.

antonmks commented on September 2, 2024

The GPU memory limits shouldn't be an issue because all the processing is done on a pieces small enough to be placed in a GPU. It is just that I have 3GB of GPU memory and probably the partitioning logic is somewhere off so it fails on a GPU with smaller memory.
Can you give me a couple of queries that fail specifically because the GPU is out of memory ?
I would try to run it and see what I can do.
Oh, by the way, all types can be sorted on a host, not just strings.

Regards, Anton

from alenka.

AlexeyAB commented on September 2, 2024

Why do I think about the lack of GPU RAM.
I look at GPU-Z 0.5.5 in Sensors -> Memory used. GPU-Z can be free downloaded from the Internet.
For example, if I have many tabs in Chrome, running MS Outlook and other programms, it displays Memory used: 450 MB GPU RAM - and q1.sql of TPC-H 10GB off with an error on the D: = SELECT ... GROUP BY. At this time is free: 550 MB GPU RAM and 4 GB CPU RAM.
If I close some programs, then Memory used will equal to 193 MB, and will be the same exception.
It was only when I close all programs and Memory used is 129 MB, then the query runs well. At this time is free: 871 MB GPU RAM and 5.5 GB CPU RAM.

Also in the GPU-Z shows the graph of used GPU RAM (but in a small window), where the peaks are seen 100% of Memory used.

Of course on a real server with GPU Memory used will be near 0. But if this problem occurs now at low volume of data, then there it will be occurs at a little more volume.

Unfortunately for my real data, GPU Memory used is equal to 129 MB (871 MB free) is not enough. And I can not send them, because this is the real financial data of the customer. But the problem I have described is reproduced on q1.sql of TPC-H 10GB, as I described above.

Regards, Alexey

from alenka.

antonmks commented on September 2, 2024

Well, I can suggests creating a database with smaller segments :
alenka.exe -l 3000000 load_lineitems.sql
This will create data files with 3000000 records in a segment as opposed the default value of 6000000 records.
You can play with this parameter and see what is the optimal segment size for your GPU.

Regards, Anton

from alenka.

AlexeyAB commented on September 2, 2024

Thanks. This has helped in the case of the TPC-H lineitem, but did not help with my real data.
Best regards,
Alexey

from alenka.

AlexeyAB commented on September 2, 2024

Some additional questions.
This option -l affects only the loading data from a text file to binary file, or in other cases too?
And how data are stored in RAM, an example, after JOIN from different tables with different size segments?
And when this result is grouping by, what size pieces that are processed on the GPU RAM?

Best regards,
Alexey

from alenka.

antonmks commented on September 2, 2024

-l affects only the loading data from a text file to binary file.
After join the data is stored in a host memory, uncompressed, in one big piece.
When grouping by, Alenka estimates how much GPU memory it needs to aggregate the data and divides the source data into pieces, then it processes every piece. The number of pieces is calculated in a function setSegments in cm.cu file. You can change current value to a bigger one if you think that this might be a problem.

Best regards, Anton

from alenka.

AlexeyAB commented on September 2, 2024

Hi Anton!
Sent an email to [email protected] with an example of a script on which the error occurs when grouping.

The parameter -l does not solve the problem. Also, if this parameter is set below 500000, an error occurs even when you load data from a text file into a binary.

Best regards,
Alexey

from alenka.

antonmks commented on September 2, 2024

I didn't receive the email ...

Regards, Anton

from alenka.

AlexeyAB commented on September 2, 2024

Ok. I re-sent it.
Best regards,
Alexey

from alenka.

antonmks commented on September 2, 2024

I sent you an email.

Best regards, Anton

from alenka.

Arises exceptions when JOIN or FILTER tables, with a tables size of 1000000 rows. about alenka HOT 18 CLOSED

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs