GithubHelp home page GithubHelp logo

absaoss / spark-hats Goto Github PK

View Code? Open in Web Editor NEW
35.0 18.0 4.0 155 KB

Nested array transformation helper extensions for Apache Spark

License: Apache License 2.0

Scala 100.00%
scala spark schema nested-structures arrays

spark-hats's People

Contributors

dwfchu avatar fossabot avatar miroslavpojer avatar raffael-dzikowski avatar yruslan avatar zejnilovic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spark-hats's Issues

Select an element from an array of arrays

Background

If you have an array of arrays select doesn't work

Feature

Be able to select a column from array of arrays

Example [Optional]

scala> res0.printSchema
root
 |-- id: long (nullable = false)
 |-- name: string (nullable = true)
 |-- items: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- itemid: string (nullable = true)
 |    |    |-- qty: integer (nullable = false)
 |    |    |-- price: double (nullable = false)
 |    |    |-- payments: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- payid: string (nullable = true)
 |    |    |    |    |-- amount: double (nullable = false)

scala> res0.selectFromArray(col("items.payments.payid"))

Add CI

Feature

Add a Jenkins file to run CI

Add code coverage support

Background

Add code coverage support to be able measure code quality.

Feature

New ability to measure current code coverage as one of QA metrics.

Update:

  • Improve or add code coverage support to be able measure code quality.
  • New ability to measure current code coverage as one of QA metrics.
  • Add GH action to check changed file coverage.

Add advanced nested routines to the extensions

Background

Currently, the DataFrame extensions contain only basic nested routines.

Feature

Add routines that work with the error column as DataFrame extensions as well. Put these routines into a separate implicit class so that advanced extensions need to be explicitly turned on to use.

Update. (minor) Add an explanation for the 'hats' acronym.

Add schema projection

Feature

Create a method that given an input dataframe and the desired schema applies the schema to the dataframe as long as:

  • Same fields have same/compatible data types.
  • The desired schema is a subset of the input schema.

Errors are not retained if the input array is null

Background

If a processed array is null, all existing errors in the error column are removed.

Expected behavior

The API that supports adding errors to the error column should retain the list of errors for each record that is already there.

UDF "arrayDistinctErrors" is called but not registered

In the NestedArrayTransformations.scala UDF "arrayDistinctErrors" is called but it's not registered anywhere in the library (To be precise it is registered but only the test part of the code.)
Therefore the library relies on the user app to define and register the function.

Support dataframes that have maps

Background

Currently, spark-hats does not support transformations inside nested maps (see #15). But at least it should allow processing Data Frames that contain maps if the transformations are happening outside nested maps.

Feature

Allow applying transformations on nested structs and arrays outside of a Map when the Data Frame contains a map.

How to use spark-hats functions on a DataFrame in PySpark?

This looks like a super helpful extension for dealing with deeply nested fields in Spark. I'd love to see if it can help me with my problems, but I'm using PySpark in Python.

I think it's installing properly with:

from pyspark.sql.session import SparkSession
spark = (
    SparkSession.builder
    .config('spark.jars.packages', 'za.co.absa:spark-hats_2.12:0.2.2')
    .getOrCreate()
)

Since I see the following in the logs:

:: resolution report :: resolve 157ms :: artifacts dl 4ms
	:: modules in use:
	za.co.absa#spark-hats_2.12;0.2.2 from central in [default]
	za.co.absa#spark-hofs_2.12;0.4.0 from central in [default]

But then if I create a dataframe and try to access the functions, I'm not having success:

>>> empty_df = spark.createDataFrame([], schema="")
>>> empty_df.nestedWithColumn()
AttributeError: 'DataFrame' object has no attribute 'nestedWithColumn'
>>> empty_df._jdf.nestedWithColumn()
Py4JError: An error occurred while calling o63.nestedWithColumn. Trace:
py4j.Py4JException: Method nestedWithColumn([]) does not exist

So not sure if anyone has experience with PySpark here and has any insights. I'll also update this issue if I find a solution.

Multiple calls of nestedMapColumn cause dataframe to hang

Describe the bug
Hi there, I am using nestedMapColumn for multiple nested columns in a dataframe. Any native method I use on the dataframe after multiple uses of nestedMapColumn will hang like .show(), .write(), .select()...etc. When I say hang I mean it just blocks the application code when I use a native dataframe method. When I just use nestedMapColumn once on the dataframe, it works fine. Is this behavior expected? Is there any workaround?

I am using: libraryDependencies += "za.co.absa" %% "spark-hats" % "0.2.2"

To Reproduce
Steps to reproduce the behavior:

  1. Have a dataframe with multiple nested fields.
  2. Apply the nestedMapColumn on multiple fields.
  3. Try to run outputDataframe.show(), outputDataframe.select()...etc

Expected behavior
I expect the dataframe to not hang after using the nestedMapColumn multiple times.

Add Unstruct functionality to flatten a nested struct

Background

Currently, there is no way to flatten a struct field in a certain level of nesting.

Feature

When doing f.nestedMapColumn(), the unstruct function should project the fields of a nested struct on the same level as the parent

Example

For a dataset of the following shape:

root
|-- id: long (nullable = true)
|-- my_array: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- a: long (nullable = true)
|    |    |-- b: string (nullable = true)
|    |    |-- c: struct (containsNull = true)
|    |    |    |--nestedField1: string (nullable = true)
|    |    |    |--nestedField2: long (nullable = true)

Applying df.nestedMapColumn("my_array.c", "my_array", c => unstruct(c)) should result in

root
|-- id: long (nullable = true)
|-- my_array: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- a: long (nullable = true)
|    |    |-- b: string (nullable = true)
|    |    |-- nestedField1: string (nullable = true)
|    |    |-- nestedField2: long (nullable = true)

Support nested Map transformations

Currently, when encountered a nested Map the following error is given:

java.lang.IllegalArgumentException: Field 'someNestedMap' is not a struct type or an array.

Would be a nice to have improvement. Are there plans to introduce this feature?

Move extended array transformations from Enceladus to spark-hats

Background

Extended array transformations were added to Enceladus to support broadcast join on array elements so that join conditions could contain fields on all parent array levels.

Feature

Move extended array transformation from Enceladus to spark-hats and add Spark extension interface to it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.