Data is distributed evenly in a random fashion
Best for:
- There is no clear distribution key
- You don't have frequent joins with other tables
- Uniform distribution
- Temporary staging table
- A simple starting point
Data is distributed deterministically by using a hash function
Best for:
- Large tables like fact tables or historical transaction tables
- Tables with frequent inserts, updates and deletes
Example of DDL
CREATE TABLE YOUR_TABLE
(
COLUMN1 INT NOT NULL,
COLUMN2 INT NOT NULL,
COLUMN3 INT NOT NULL,
COLUMN4 INT NOT NULL,
COLUMN5 VARCHAR(20)
)
WITH
(
CLUSTERED COLUMNSTORE INDEX
DISTRIBUTION = HASH(COLUMN1)
)
Full copy of the table is replicated to every node
Best for:
- Small lookup of dimension tables that are frequently joined with larger tables
Data skipping is automatically collected when data is written into a Delta Table
Delta Lake on Databricks uses the minimum and maximum values to speed up queries
Is a technique for colocating related information in the same set of files and it is automatically used by the data-skipping algorithms of delta lake on databricks to substantially reduce the amount of data to be read
Allows files to be skipped within partitions
It is good for non partitioned tables, of for joins of non-partitioned columns
spark.databricks.optimizer.dynamicParitionPruning
spark.databricks.optimizer.deltaTableSizeThreshold
spark.databricks.optimizer.deltaTableFilesThreshold
Rowstore
- Row Compression
- Page Compression
Columnstore
- Columnstore compression by default
- Columnstore archival compression
- Available in azure SQL database
- Row or page compression can be enabled or disabled both online or offline
- Disk space requirements when enabling or disabling are the same as when you are creating or rebuilding an index
Example
ALTER TABLE TABLE1 REBUILD PARTITION = ALL WITH (DATA_COMPRESSION = ROW)
- Is enabled by default
- Used by clustered index or nonclustered columnstore index
- Indexes with columnstore archival compression are slower thant those without it
- Use only to reduce the storage size of data that is not accessed frequently