absaoss / spark-hats Goto Github PK
View Code? Open in Web Editor NEWNested array transformation helper extensions for Apache Spark
License: Apache License 2.0
Nested array transformation helper extensions for Apache Spark
License: Apache License 2.0
If you have an array of arrays select doesn't work
Be able to select a column from array of arrays
scala> res0.printSchema
root
|-- id: long (nullable = false)
|-- name: string (nullable = true)
|-- items: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- itemid: string (nullable = true)
| | |-- qty: integer (nullable = false)
| | |-- price: double (nullable = false)
| | |-- payments: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- payid: string (nullable = true)
| | | | |-- amount: double (nullable = false)
scala> res0.selectFromArray(col("items.payments.payid"))
Add a Jenkins file to run CI
Currying is preferred when a function is passed as a parameter. Switch the signature of nested transformation methods to use currying instead of a parameter list.
Add code coverage support to be able measure code quality.
New ability to measure current code coverage as one of QA metrics.
Update:
Currently, the DataFrame extensions contain only basic nested routines.
Add routines that work with the error column as DataFrame extensions as well. Put these routines into a separate implicit class so that advanced extensions need to be explicitly turned on to use.
Update. (minor) Add an explanation for the 'hats' acronym.
Add support for spark 3.2.1
Create a method that given an input dataframe and the desired schema applies the schema to the dataframe as long as:
If a processed array is null, all existing errors in the error column are removed.
The API that supports adding errors to the error column should retain the list of errors for each record that is already there.
Currently, the compatibility table contains only Spark 2.*
Add Spark 3 to the compatibility table
In the NestedArrayTransformations.scala
UDF "arrayDistinctErrors" is called but it's not registered anywhere in the library (To be precise it is registered but only the test part of the code.)
Therefore the library relies on the user app to define and register the function.
Currently, spark-hats
does not support transformations inside nested maps (see #15). But at least it should allow processing Data Frames that contain maps if the transformations are happening outside nested maps.
Allow applying transformations on nested structs and arrays outside of a Map
when the Data Frame contains a map.
It has been requested to be able to use the library in a project built for on Scala 2.13
This looks like a super helpful extension for dealing with deeply nested fields in Spark. I'd love to see if it can help me with my problems, but I'm using PySpark in Python.
I think it's installing properly with:
from pyspark.sql.session import SparkSession
spark = (
SparkSession.builder
.config('spark.jars.packages', 'za.co.absa:spark-hats_2.12:0.2.2')
.getOrCreate()
)
Since I see the following in the logs:
:: resolution report :: resolve 157ms :: artifacts dl 4ms
:: modules in use:
za.co.absa#spark-hats_2.12;0.2.2 from central in [default]
za.co.absa#spark-hofs_2.12;0.4.0 from central in [default]
But then if I create a dataframe and try to access the functions, I'm not having success:
>>> empty_df = spark.createDataFrame([], schema="")
>>> empty_df.nestedWithColumn()
AttributeError: 'DataFrame' object has no attribute 'nestedWithColumn'
>>> empty_df._jdf.nestedWithColumn()
Py4JError: An error occurred while calling o63.nestedWithColumn. Trace:
py4j.Py4JException: Method nestedWithColumn([]) does not exist
So not sure if anyone has experience with PySpark here and has any insights. I'll also update this issue if I find a solution.
Describe the bug
Hi there, I am using nestedMapColumn for multiple nested columns in a dataframe. Any native method I use on the dataframe after multiple uses of nestedMapColumn will hang like .show(), .write(), .select()...etc. When I say hang I mean it just blocks the application code when I use a native dataframe method. When I just use nestedMapColumn once on the dataframe, it works fine. Is this behavior expected? Is there any workaround?
I am using: libraryDependencies += "za.co.absa" %% "spark-hats" % "0.2.2"
To Reproduce
Steps to reproduce the behavior:
Expected behavior
I expect the dataframe to not hang after using the nestedMapColumn multiple times.
Currently, there is no way to flatten a struct field in a certain level of nesting.
When doing f.nestedMapColumn()
, the unstruct
function should project the fields of a nested struct on the same level as the parent
For a dataset of the following shape:
root
|-- id: long (nullable = true)
|-- my_array: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: long (nullable = true)
| | |-- b: string (nullable = true)
| | |-- c: struct (containsNull = true)
| | | |--nestedField1: string (nullable = true)
| | | |--nestedField2: long (nullable = true)
Applying df.nestedMapColumn("my_array.c", "my_array", c => unstruct(c))
should result in
root
|-- id: long (nullable = true)
|-- my_array: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: long (nullable = true)
| | |-- b: string (nullable = true)
| | |-- nestedField1: string (nullable = true)
| | |-- nestedField2: long (nullable = true)
Currently, when encountered a nested Map the following error is given:
java.lang.IllegalArgumentException: Field 'someNestedMap' is not a struct type or an array.
Would be a nice to have improvement. Are there plans to introduce this feature?
Add cross compilation for spark 3
Extended array transformations were added to Enceladus to support broadcast join on array elements so that join conditions could contain fields on all parent array levels.
Move extended array transformation from Enceladus to spark-hats and add Spark extension interface to it.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.