Comments (9)
Hello, I started looking into this and wanted to share my current though-process on possible approaches.
Calls to getColumnMask
Not including here calls from tests and calls from System/CatalogAccessControl
implementations that mostly just delegate (I didn't spend a lot of time on these, please correct me if there is more functionality to consider).
io.trino.sql.analyzer.StatementAnalyzer.Visitor#visitInsert
io.trino.sql.analyzer.StatementAnalyzer.Visitor#visitDelete
io.trino.sql.analyzer.StatementAnalyzer.Visitor#visitUpdate
io.trino.sql.analyzer.StatementAnalyzer.Visitor#visitMerge
io.trino.sql.analyzer.StatementAnalyzer.Visitor#visitTableExecute
io.trino.sql.analyzer.StatementAnalyzer.Visitor#analyzeFiltersAndMasks
All of these iterate over the columns in a table and call getColumMask
on each column.
visitInsert/Delete/Update/Merge/TableExecute
have pretty much the same code block:
for (ColumnMetadata tableColumn : tableMetadata.getColumns()) {
if (accessControl.getColumnMask(session.toSecurityContext(), tableName, tableColumn.getName(), tableColumn.getType()).isPresent()) {
throw semanticException(NOT_SUPPORTED, node, "ALTER TABLE EXECUTE is not supported for table with column masks");
}
}
analyzeFiltersAndMasks
actually uses the returned column masks.
Bulk getColumnMask
?
As discussed in the slack thread, combining the fetching all of column masks needed for a query into a single call is hard to practically impossible. I think we can still get a good increase in performance by fetching the masks for all the columns in a given table in one pass.
The signature of the bulk function could look like the following (this isn't the best approach but it gets the point across, its pretty easy to add). The default implementation would delegate to getColumnMask
.
List<Optional<ViewExpression>> getColumnMasks(SystemSecurityContext context, CatalogSchemaTableName tableName, List<Pair<String, Type>> columns);
OPA access control plugin
This discussion began in the context of the OPA access control plugin, which does an HTTP call for every getColumnMask
function call.
If the plugin would receive a bulk column masks call through the API there are two main ways of resolving it:
- send 1 HTTP requests to the OPA server for each column, but in parallel
- pros: no changes to policy code, no changes to Trino configs needed to benefit from speedup
- cons: high network traffic, limited by the number of connection the http client of the OPA access control plugin can have open at one time. we can assume that the OPA server is running as a sidecar to the Trino coordinator (i.e. on the same machine), which might make this manageable
- use a "bulk column masks" endpoint on the OPA server, similar to
opa.policy.batched-uri
in concept- pros: low network traffic, less communication costs, less things that can break
- cons: additional logic needs to be implemented in Rego; Trino configuration needs to change
I don't know which of these is better.
Next steps
I will look into gathering some data on the performance of the current implementation and one (or both if I have time) of these "solutions".
Would love to hear the thoughts of the community on this!
from trino.
+1 to the bulk getColumnMask
, even if we cannot add a fully bulk mode in that we'll still get several calls for a given query, moving from one call per column to one call per table is a considerable improvement.
As for OPA, I don't have a strong opinion. However, I feel that ideally OPA would be co-located with Trino deployments and as such we can worry less about network overhead and focus more on ensuring policies remain easy to implement.
Each possible type of request that OPA needs to support adds complexity to the policies. Policy implementers currently need to consider 4 possible type of requests:
- GetColumnMask: returning a single mask
- GetRowFilters: returning a list of masks
- Standard operations: returning a boolean allow/disallow field
- (If enabled) Batch operations (e.g., FilterCatalogs): returning a list of indice denoting what elements from the request should be allowed
If we add a bulk column mask request, policy implementers need to consider whether they want to use the feature, and if so, implement it in a way that is performant.
While the documentation suggests that a policy implementer can use recursion within rego to turn a non-batch policy into a batch one (calling itself and using with
to evaluate itself using different objects), this is far from efficient if the request is large. The cost of adding a new type of request is not negligible so I'd suggest keeping OPA as-is for now, unless we have a clear use case where parallelizing it would not suffice.
from trino.
This has been a long standing issue with the interface
io.trino.spi.security.SystemAccessControl
The Ranger plugin isnt widely used but suffers from the same problem. That said, the Ranger plugin isnt widely used so its has not presented a front-center-state problem.
Now with OPA and the renewed interest in data access/control its time we went back and fundamentally fixed the flaws that cause these access-control storms.
What @vagaerg mentioned is the correct course of action.
from trino.
I don't think this will work for OPA, but the access control implementation could cache on a per query basis. So when the first request comes in for they table, you just fetch all the column masks in one shot, and if the engine asks again you already have the answer.
from trino.
As promised, I am back with some findings.
For these tests I added a SystemAccessControl#getBatchColumnMasks
function that gets a list of columns (all from the same tables). See mosiac1/trino
. The changes here are experimental, I will clean then up if we decide to go ahead with this approach.
Edit: Tests were done on mosiac1@ff6e225, commits added after are for cleaning up and improving the APIs.
Findings
Current SystemAccessControl
interface
The current interface treat each getColumnMask
independently and when it needs to
get multiple column masks it will do it in sequence.
No. cols | Avg. getMask (ms) |
---|---|
100 | 46.1910 |
250 | 86.9650 |
1000 | 248.1755 |
Updated SystemAccessControl
interface, with getBatchColumnMasks
, Using Parallel OPA Requests
The interface was updated to include a getBatchColumnMasks
function that can fetch the masks of multiple
columns form a table. The trino-opa
plugin was updated to use parallel requests when fetching bulk column
masks, using the default HTTP client configuration for these tests (the thead pool size is most relevant in this case, default settings are 8 for minThreads
and 200 for maxThreads
). The Rego policy is unchanged.
No. cols | Avg. getMask (ms) |
---|---|
100 | 9.9290 |
250 | 34.0555 |
1000 | 72.1950 |
Updates SystemAccessControl
interface, with getBatchColumnMasks
, Using 1 Request to OPA
The conditions listed above still hold. Additionally, an extra configuration key, opa.policy.batch-column-masking-uri
, is added to trino-opa
- this is used to request column masks for a list
of columns. A new rule is added to the Rego policy, batchColumnMasks := []
.
No. cols | Avg. getMask duration |
---|---|
100 | 1.8125 |
250 | 2.2990 |
1000 | 6.1945 |
Plot
The pot has the same data as above.
Methodology
All tests were ran locally on my WSL instance (20GB or ram, 20 vcores). All tests are done using a Trino Development Server. A memory catalog is used, where a single table with N varchar columns is created. Tracing is enabled and all traces go to a local instance of Jaeger (docker run jaegertracing/all-in-one -p "4317:4317" -p "16686:16686"
) - this is used to get the timings. An OPA server is running locally with a simple policy that will allow all Trino requests and return no column masks.
from trino.
@mosiac1 Thats looks great. Are you plan to make a pull request for that?
It is a great improvement from the current state.
I think that making Updates SystemAccessControl interface, with getBatchColumnMasks, Using 1 Request to OPA
optional like it is with the other batch requests and then in the default you will get the improved version but with multiple calls for OPA unless you configure the opa.policy.batch-column-masking-uri
from trino.
@shohamyamin Yes, I will open a pull request soon
from trino.
@shohamyamin @dain Created the PR - #21997
from trino.
@shohamyamin PR was merged, i think we can close the issue
from trino.
Related Issues (20)
- [Hive connector] Tag S3 objects HOT 1
- Trino with iceberg catalog using glue-metastore: getting CommitFailedException
- Trino with iceberg catalog using glue: not returning data
- Add OceanBase connector HOT 9
- TestBinPackingNodeAllocator flaky
- [io.airlift.jaxrs.JsonMapperParsingException: Invalid json for Java type io.trino.server.TaskUpdateRequest
- Cannot extract Json value from json column
- Spamming Coral log: Failed to get columns using deserializer: java.lang.NoClassDefFoundError: com/linkedin/coral/$internal/org/apache/hadoop/fs/FileSystem HOT 1
- Add an API endpoint with cluster health info
- Always check access control before metadata HOT 2
- Planning failure due to PreAggregateCaseAggregations HOT 1
- trino restart fork many process
- Add support for `VACUUM` procedure with deletion vectors in Delta Lake connector HOT 1
- Add Support For Liquid Clustering Delta lake HOT 1
- How about allowing the Alluxio cache to be shared between catalogs? (1 : N = cache : catalogs) HOT 5
- Prometheus authentication fails, unexpected 0x0a at the end of the token
- Support SELECT of a view without visibility of the underlying tables HOT 1
- Metadata can't resolve `$internal$json_string_to_array_cast` built in function
- Trino router not compatible with delta table using rust engine HOT 2
- Remove spill to disk feature HOT 20
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from trino.