Comments (15)
@kgpai agreed, and sounds good.
I propose we write up a new issue on the larger topic of what to do with SQL functions in Prestissimo. Happy to take a stab at that.
from presto.
Agreed that this seems like a feature request to make testing easier, which is fine.
That being said, it seems strange that by disabling function inlining, we prefer an alternative implementation. Was this function implemented in Velox because it's considered a part of the core "canon" of Presto functions? I wonder if it makes sense to differentiate core functions, typically coded in Java, vs. convenience SQL defined functions by putting them into a separate namespace, for example, presto.helpers.map_top_n
. Otherwise, we need a plan for how to enable custom user-defined SQL functions (where we surely want to inline these).
from presto.
Why? I don't think we need to do that. We can say it's non-deterministic when there are ties. Also the tier breaking maybe accidental. IIRC the SQL udf does not do that.
from presto.
IOW, there is a cost to introducing more and more built-in functions. It increases our API footprint and the total maintenance size of Presto functions. It also apparently means we'll need to eventually create a dedicated C++ implementation for each of them.
If the goal of many of these functions it to provide convenience shorthand functions for things that can already be accomplished through the other functions, perhaps we could package it as a plugin that one opts-in to so as to reduce this footprint. I imagine it would be a non-goal to reimplement each of these helpers functions in Velox.
from presto.
cc: @rschlussel @mbasmanova @kagamiori
from presto.
Also the tier breaking maybe accidental.
@kaikalur Can you explain what do you mean by accidental here ?
Why?
Making it deterministic helps with verification and secondly it makes it consistent that keys are checked when values are equivalent.
from presto.
@kgpai it seems this inconsistency only occurs when inline-sql-functions
is set to false
, is that correct?
from presto.
@kgpai it seems this inconsistency only occurs when
inline-sql-functions
is set tofalse
, is that correct?
Hmm - that means the native implementaiton is different from the sql function. Like I said on many occasions - what we have in current presto (java) is the behavior we should try and conform to. IMO there are no correctness "bugs" in the java version lol
from presto.
@kaikalur Sreeni, here is the definition of map_top_n in Java. Notice that the logic has 2 steps: (1) filter out null values, sort by value and break ties using keys; (2) filter null values without any sorting and/or breaking ties.
This results in results being deterministic for non-null values, but non-deterministic for null values. Hope this makes sense.
@SqlInvokedScalarFunction(value = "map_top_n", deterministic = true, calledOnNullInput = true)
@Description("Truncates map items. Keeps only the top N elements by value.")
@TypeParameter("K")
@TypeParameter("V")
@SqlParameters({@SqlParameter(name = "input", type = "map(K, V)"), @SqlParameter(name = "n", type = "bigint")})
@SqlType("map(K, V)")
public static String mapTopN()
{
return "RETURN IF(n < 0, fail('n must be greater than or equal to 0'), map_from_entries(slice(array_sort(map_entries(map_filter(input, (k, v) -> v is not null)), (x, y) -> IF(x[2] < y[2], 1, IF(x[2] = y[2], IF(x[1] < y[1], 1, -1), -1))) || map_entries(map_filter(input, (k, v) -> v is null)), 1, n)))";
}
from presto.
I don't think producing inconsistent results here is a bug, even if it's only for non-null values. But the non-determinism is a nuisance for testing. Making the results deterministic would enable us to do better correctness verification with less manual work, so I think making this change could still be worth while.
(also, fwiw there are definitely correctness bugs in the Java version #22040)
from presto.
SQL functions in Presto are quite inefficient. For example, array_sum is a very simple function that's implemented using 'reduce' lambda, which makes it 20x slower on small arrays, 40x slower on medium size arrays, 270x slower on large arrays than a straightforward "native" implementation.
![Screenshot 2024-05-20 at 9 42 30 PM](https://private-user-images.githubusercontent.com/27965151/332249452-907f110c-abb6-418b-b7da-a27dd80cea7a.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTk0Njk5MDksIm5iZiI6MTcxOTQ2OTYwOSwicGF0aCI6Ii8yNzk2NTE1MS8zMzIyNDk0NTItOTA3ZjExMGMtYWJiNi00MThiLWI3ZGEtYTI3ZGQ4MGNlYTdhLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA2MjclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNjI3VDA2MjY0OVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWM3ZTNiZmIzNzhjNjI3NTBjOThlZGVlZTY4NWRmMmU5ZGYwZGNhM2Q2MDZmNTZiMTA1OTEyNjRkM2Y3ZjNiYzYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.GPWtgoc9C0zWW82Cdn52v6_Jmhg_bXLM54VMOBh5SnI)
array_average is similarly inefficient.
array_duplicates and array_has_duplicates are implemented inefficiently as well.
normalize was recently optimized to be not terribly inefficient, but just inefficient: #22211
map_top_n sorts the entire map (could be hundreds of entries) even it is need just 5 top entries. This is inefficient in both CPU time and memory usage.
Currently, there is no way to selectively disable SQL inlining for a subset of functions, hence, we are forced to disable inlining for ALL SQL functions in Prestissimo. Perhaps, a better way would be to allow selectively disabling SQL inlining. Alternatively, we may want to review existing functions and decide to implement them "natively" (both in Java and C++) as it is hard to just these inefficiencies for either stack.
Finally, 'reduce' lambda in Presto is defined as a data-dependent loop over all elements of an array. This makes it practically impossible to implement it efficiently in a vectorized engine. It might be helpful to not allow 'reduce' in the implementation of the SQL functions.
from presto.
that's a different issue from not allowing them. Thee are things that DEs wrote for convenience. We have to support SQL functions irregardless - we can't implement all of them in cpp (or java). And they are not widely used or not in realtime/adhoc queries. When they are good enough for users we should be ok to keep them as they are. ARRAY_SUM/AVG etc are done using REDUCE which we know is bad native - so we should fix that rootcause. We can't say "Sql functions are inefficient".
Also microbenchmarks are not reflective of big pictures. If this is good enough for batch queries, we are good for now and we should keep improving sql functions,
from presto.
Finally, 'reduce' lambda in Presto is defined as a data-dependent loop over all elements of an array. This makes it practically impossible to implement it efficiently in a vectorized engine. It might be helpful to not allow 'reduce' in the implementation of the SQL functions.
This IMO is unacceptable. I also remember giving ideas on improving reduce but they were not acted on. Let's try those things. But in any case, we can't handicap users like this. Anyone can use reduce anywhere so we should fix it.
from presto.
Note: map_top_n SQL implementation in Presto uses array_sort lambda function which cannot be translated to a 'transform' and therefore is not supported in Velox: https://velox-lib.io/blog/array-sort
from presto.
I dont think its harmful to fix the non determinism by breaking ties in case of NULL values by comparing keys - It cant possibly break any existing user behavior. If there are no major concerns, I can submit a PR fixing this.
from presto.
Related Issues (20)
- Add ARM64 Support for Building Prestissimo Docker Image on Mac M1 HOT 7
- Backport https://github.com/prestodb/presto/pull/22926 into 0.285, 0.286 and 0.287 HOT 4
- Pushdown (partial) rowNumber under join
- Flaky test: TestMemoryManager.testReservedPoolDisabledMultiCoordinator
- Add documentation for Geospatial types in main types page HOT 1
- For each agg function with input param as <T>, Add an equivalent agg function with input param as array<T> HOT 1
- [docs] Combine the descriptions of session property with configuration property for history based optimization
- singlestore-dockerized-tests job is failing often HOT 1
- Getting error while building in intelli idea HOT 1
- How to build a custom connector?
- How to build and run presto in intellij idea? HOT 1
- [native] Flaky test TaskManagerTest.buildSpillDirectoryFailure HOT 2
- Writer scaling fails for Parquet with smaller files HOT 6
- Flaky test: TestNoisySumGaussianLongAggregation.testNoisySumGaussianLongClippingSomeNoiseScaleWithinSomeStd() HOT 1
- Iceberg $changelog read fails on table with only one snapshot version.
- Pushdown partial TopN and RowNumber into UNION
- Inline cosntant cross joins
- Allow Presto Coordinator to ignore (not throw) negative runtime metrics. HOT 1
- Update the MongoDB connector to support binData data type HOT 2
- Presto needs a modern functional testing framework that runs tests using real infrastructure
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from presto.