absaoss / spark-hats Goto Github PK

View Code? Open in Web Editor NEW

35.0 18.0 4.0 155 KB

Nested array transformation helper extensions for Apache Spark

License: Apache License 2.0

Scala 100.00%

scala spark schema nested-structures arrays

spark-hats's People

Contributors

Stargazers

Watchers

Forkers

raffael-dzikowski yruslan snbaskarraj fossabot

spark-hats's Issues

Select an element from an array of arrays

Background

If you have an array of arrays select doesn't work

Feature

Be able to select a column from array of arrays

Example [Optional]

scala> res0.printSchema
root
 |-- id: long (nullable = false)
 |-- name: string (nullable = true)
 |-- items: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- itemid: string (nullable = true)
 |    |    |-- qty: integer (nullable = false)
 |    |    |-- price: double (nullable = false)
 |    |    |-- payments: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- payid: string (nullable = true)
 |    |    |    |    |-- amount: double (nullable = false)

scala> res0.selectFromArray(col("items.payments.payid"))

Feature

Add a Jenkins file to run CI

Use currying interface for methods that accept lambdas

Feature

Currying is preferred when a function is passed as a parameter. Switch the signature of nested transformation methods to use currying instead of a parameter list.

Background

Add code coverage support to be able measure code quality.

Feature

New ability to measure current code coverage as one of QA metrics.

Update:

Improve or add code coverage support to be able measure code quality.
New ability to measure current code coverage as one of QA metrics.
Add GH action to check changed file coverage.

Add advanced nested routines to the extensions

Background

Currently, the DataFrame extensions contain only basic nested routines.

Feature

Add routines that work with the error column as DataFrame extensions as well. Put these routines into a separate implicit class so that advanced extensions need to be explicitly turned on to use.

Update. (minor) Add an explanation for the 'hats' acronym.

Feature

Add support for spark 3.2.1

Feature

Create a method that given an input dataframe and the desired schema applies the schema to the dataframe as long as:

Same fields have same/compatible data types.
The desired schema is a subset of the input schema.

Errors are not retained if the input array is null

Background

If a processed array is null, all existing errors in the error column are removed.

Expected behavior

The API that supports adding errors to the error column should retain the list of errors for each record that is already there.

Add Spark 3 to the compatibility table in README

Background

Currently, the compatibility table contains only Spark 2.*

Feature

Add Spark 3 to the compatibility table

UDF "arrayDistinctErrors" is called but not registered

In the NestedArrayTransformations.scala UDF "arrayDistinctErrors" is called but it's not registered anywhere in the library (To be precise it is registered but only the test part of the code.)
Therefore the library relies on the user app to define and register the function.

Support dataframes that have maps

Background

Currently, spark-hats does not support transformations inside nested maps (see #15). But at least it should allow processing Data Frames that contain maps if the transformations are happening outside nested maps.

Feature

Allow applying transformations on nested structs and arrays outside of a Map when the Data Frame contains a map.

Add support for Scala 2.13

Background

It has been requested to be able to use the library in a project built for on Scala 2.13

How to use spark-hats functions on a DataFrame in PySpark?

This looks like a super helpful extension for dealing with deeply nested fields in Spark. I'd love to see if it can help me with my problems, but I'm using PySpark in Python.

I think it's installing properly with:

from pyspark.sql.session import SparkSession
spark = (
    SparkSession.builder
    .config('spark.jars.packages', 'za.co.absa:spark-hats_2.12:0.2.2')
    .getOrCreate()
)

Since I see the following in the logs:

:: resolution report :: resolve 157ms :: artifacts dl 4ms
	:: modules in use:
	za.co.absa#spark-hats_2.12;0.2.2 from central in [default]
	za.co.absa#spark-hofs_2.12;0.4.0 from central in [default]

But then if I create a dataframe and try to access the functions, I'm not having success:

>>> empty_df = spark.createDataFrame([], schema="")
>>> empty_df.nestedWithColumn()
AttributeError: 'DataFrame' object has no attribute 'nestedWithColumn'
>>> empty_df._jdf.nestedWithColumn()
Py4JError: An error occurred while calling o63.nestedWithColumn. Trace:
py4j.Py4JException: Method nestedWithColumn([]) does not exist

So not sure if anyone has experience with PySpark here and has any insights. I'll also update this issue if I find a solution.

Multiple calls of nestedMapColumn cause dataframe to hang

Describe the bug
Hi there, I am using nestedMapColumn for multiple nested columns in a dataframe. Any native method I use on the dataframe after multiple uses of nestedMapColumn will hang like .show(), .write(), .select()...etc. When I say hang I mean it just blocks the application code when I use a native dataframe method. When I just use nestedMapColumn once on the dataframe, it works fine. Is this behavior expected? Is there any workaround?

I am using: libraryDependencies += "za.co.absa" %% "spark-hats" % "0.2.2"

To Reproduce
Steps to reproduce the behavior:

Have a dataframe with multiple nested fields.
Apply the nestedMapColumn on multiple fields.
Try to run outputDataframe.show(), outputDataframe.select()...etc

Expected behavior
I expect the dataframe to not hang after using the nestedMapColumn multiple times.

Add Unstruct functionality to flatten a nested struct

Background

Currently, there is no way to flatten a struct field in a certain level of nesting.

Feature

When doing f.nestedMapColumn(), the unstruct function should project the fields of a nested struct on the same level as the parent

Example

For a dataset of the following shape:

root
|-- id: long (nullable = true)
|-- my_array: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- a: long (nullable = true)
|    |    |-- b: string (nullable = true)
|    |    |-- c: struct (containsNull = true)
|    |    |    |--nestedField1: string (nullable = true)
|    |    |    |--nestedField2: long (nullable = true)

Applying df.nestedMapColumn("my_array.c", "my_array", c => unstruct(c)) should result in

root
|-- id: long (nullable = true)
|-- my_array: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- a: long (nullable = true)
|    |    |-- b: string (nullable = true)
|    |    |-- nestedField1: string (nullable = true)
|    |    |-- nestedField2: long (nullable = true)

Support nested Map transformations

Currently, when encountered a nested Map the following error is given:

java.lang.IllegalArgumentException: Field 'someNestedMap' is not a struct type or an array.

Would be a nice to have improvement. Are there plans to introduce this feature?

Add cross compilation for spark 3

Background

Add cross compilation for spark 3

Move extended array transformations from Enceladus to spark-hats

Background

Extended array transformations were added to Enceladus to support broadcast join on array elements so that join conditions could contain fields on all parent array levels.

Feature

Move extended array transformation from Enceladus to spark-hats and add Spark extension interface to it.

absaoss / spark-hats Goto Github PK

spark-hats's People

Contributors

Stargazers

Watchers

Forkers

spark-hats's Issues

Background

Feature

Example [Optional]

Feature

Feature

Background

Feature

Background

Feature

Feature

Feature

Background

Expected behavior

Background

Feature

Background

Feature

Background

Background

Feature

Example

Background

Background

Feature

Recommend Projects

Recommend Topics

Recommend Org

Jobs