GithubHelp home page GithubHelp logo

Comments (24)

Vislesha avatar Vislesha commented on June 7, 2024 3

Hi Team (@imback82 , @Niharikadutta , @dbeavon, @suhsteve, @AFFogarty, @bamurtaugh),
Any update on the possible new release of this library?
Thanks.

from spark.

bmazzarol avatar bmazzarol commented on June 7, 2024 3

Hi @Vislesha, the main branch is already on .NET 6.0. However, it would not be safe to do an official release until #1131 is fixed.

First off this library is great and I want to comend all the hard work that has gone into it.

Just my two cents here but I think it would be a good idea to consider a release with Binary Serializer still in place for the following reasons,

  • My reading of the depreciation, was that it needs to be done, but projects can plan for it and start the process towards it. It's not designed in such a way as to block all releases
  • Although the Binary Serializer can not be made safe, it's because any schema-less Serializer is equally unsafe, swapping it out for another like protobuf without the formal contract in place does not solve the problem.
  • The spark connect grpc bindings provides a base for integration, minus UDFs and can also be considered in a future state.
  • Not having a release of this library on a supported version of dotnet is far more damaging than the security concerns around the Binary Serializer, and will kill comunity engagement with it, required to implement a proper fix

Hope my comments are clear. I look forward to hearing what others think.

from spark.

GeorgeS2019 avatar GeorgeS2019 commented on June 7, 2024 3

Not having a release of this library on a supported version of dotnet is far more damaging than the security concerns around the Binary Serializer, and will kill comunity engagement with it, required to implement a proper fix

@bmazzarol <= well communicated..very appreciated 👏

We need to find ways fast to continue the iteration of improving this project.

from spark.

bmazzarol avatar bmazzarol commented on June 7, 2024 2

@GeorgeS2019 Will do my best!

Spark connect is a built-in set of grpc bindings included with Spark 3.4+

This provides a low level API that can be used to drive Spark in a very similar way to how this project works, infact the latest version of pyspark supports this client mode already

This solves the Serializer issue as it uses protobuf behind a defined grpc contract.

However my understanding is that a udf needs to run on the Spark workers and be one of the supported languages to work via Spark connect.

However it's not my intention to solution, I just want to argue for a roadmap to be created and an "as is" release to be considered so progress can be made incrementally.

At the very least a counter argument against an "as is" release would be good so the comunity can understand more issues that might not have been considered.

from spark.

bmazzarol avatar bmazzarol commented on June 7, 2024 1

@bmazzarol

I wonder if it is potentially feasible to replace the JVM part of the diagram to ikvm.NET?

https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals

image

My understanding is that ikvm.NET allows java programs to run on dotnet. So would not be equivalent to Py4j, which is essentially a python interpreter running on the jvm with access to the jvm memory space.

But again my main point was there are lots of ways to move forward, all take time and require planning, the bigger issue that creates is in the meantime there is no supported release of this library.

from spark.

Vislesha avatar Vislesha commented on June 7, 2024 1

@GeorgeS2019! Sure,

from spark.

dbeavon avatar dbeavon commented on June 7, 2024 1

I'm testing with .Net 8 on OSS and Azure HDI.

@AFFogarty It has been almost a year since you mentioned the concern related to
#1131
Did you see there is a PR?
#1166

I'm eager to help get this merged. Let me know how we can help. I will start testing it on OSS and HDI as soon as possible.
I think MessagePack is as good a solution as other possible replacements for BinaryFormatter. My opinion is that it could be used as the default serialization/deserialization strategy, but that users should be able to revert to BinaryFormatter if desired.

Can we get this merged? And after that I will have follow-up changes to migrate to .net 8. They are basically the same as your old changes to migrate to .net 6.

from spark.

AFFogarty avatar AFFogarty commented on June 7, 2024

Hi @Vislesha, the main branch is already on .NET 6.0. However, it would not be safe to do an official release until #1131 is fixed.

from spark.

relcodedev avatar relcodedev commented on June 7, 2024

I have used the current version 2.1.1 writing delta format from dotnet 7.0

used sdkman to install java and spark. dotnet dotnetapp was compile to native

ubuntu 22.04
jdk 8
spark 3.2.1
dotnet 7.0
Microsoft.Spark.Worker-2.1.1

run spark submit

spark-submit --packages io.delta:delta-core_2.12:2.0.2 --class org.apache.spark.deploy.dotnet.DotnetRunner --master local ./bin/Release/dotnetapp/microsoft-spark-3-2_2.12-2.1.1.jar ./bin/Release/dotnetapp/dotnetapp

There is still the bug with the udfs.

from spark.

GeorgeS2019 avatar GeorgeS2019 commented on June 7, 2024

udfs is not working in polyglot note book due to #1131

from spark.

GeorgeS2019 avatar GeorgeS2019 commented on June 7, 2024

@AFFogarty

Could you provide a working solution to make UDFs work in polyglot by working with the polyglot team => @claudiaregio

Basically, it is the directory path problem associated with polyglot,

from spark.

Vislesha avatar Vislesha commented on June 7, 2024

Hi @AFFogarty & @GeorgeS2019, thank you for the quick reply!

We are heavily reliant on this library for our solution which is ready for production now. Rest of our application is on .Net 6.0 and would like this library to be upgraded as well. We are currently using the main branch and it's all working fine on .Net 6.0, as we are not using either UDFs or polyglot notebook. However, as we are going for production, would like an official version and it appears #1131 is a security vulnerability that would fail some security checks.

Also, we are looking for a complete port of Spark along with MLLib. Would greatly appreciate if there's a new version of this library with full compatibility with latest version of Spark.

from spark.

Vislesha avatar Vislesha commented on June 7, 2024

Hi Team:
Could someone clarify the future of this library. It's been so long the PRs are pending!

Also, is there a chance this library can be merged with SynapseML (https://github.com/microsoft/SynapseML)? It appears it is actively being developed and has a better technology to generate Spark bindings without much delay and also has so many other features integrated.

Thanks!

from spark.

GeorgeS2019 avatar GeorgeS2019 commented on June 7, 2024

@bmazzarol

Although the Binary Serializer can not be made safe, it's because any schema-less Serializer is equally unsafe, swapping it out for another like protobuf without the formal contract in place does not solve the problem.

The spark connect grpc bindings provides a base for integration, minus UDFs and can also be considered in a future state.

Could you provide more information?

UDF is only an issue with PolyGlot notebook.
It is the question with PolyGlot team

Could you just elaborate further so others could continue to add more information and we iteratively get closer to a suitable solution?

I wonder if the block is due to legal issues than the software implementation

Why there is no incentive to address this at the Software level for the .NET community?

#AGAIN

@AFFogarty,

Leaving this not moving forwards could have UNDESIRABLE consequences for the entire Microsoft Big Data analytics offerings

from spark.

GeorgeS2019 avatar GeorgeS2019 commented on June 7, 2024

@bmazzarol

  • udf needs to run on the Spark workers
  • Spark workers needs to be one of the supported languages to work via Spark connect.

image

Is it feasible to make .NET one of the supported languages (e.g. python, R, Go according to the diagram)?

I am still fuzzy.

Are u familiar with ikvm?

It is .NET6, it is possible to load java code files and compile within VS2022 into .NET

If ikvm is feasible, then the question of keeping Spark.NET always up to date is no longer an issue

from spark.

GeorgeS2019 avatar GeorgeS2019 commented on June 7, 2024

@bmazzarol

I wonder if it is potentially feasible to replace the JVM part of the diagram to ikvm.NET?

https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals

image

from spark.

GeorgeS2019 avatar GeorgeS2019 commented on June 7, 2024

Initial protobuf definition for Spark Connect API

from spark.

Vislesha avatar Vislesha commented on June 7, 2024

Hi @bmazzarol! It appears it's going to be a long time for any new version of this library. We'll explore alternatives. Thank you for the clarification!

from spark.

GeorgeS2019 avatar GeorgeS2019 commented on June 7, 2024

@Vislesha

What alternative(s) are you expecting?

from spark.

GeorgeS2019 avatar GeorgeS2019 commented on June 7, 2024

@Vislesha
You are stopping brainstorming WHY?

from spark.

Vislesha avatar Vislesha commented on June 7, 2024

@GeorgeS2019, we are moving to Java based APIs for our Analytics Engine so we don't have to play a catchup with compatible libraries. It's going to be time consuming but looks like that's a better alternative.

from spark.

GeorgeS2019 avatar GeorgeS2019 commented on June 7, 2024

@Vislesha

You have abandoned, but not everyone YET. So, do consider leaving it open even if you are no longer interested

from spark.

mlafleur avatar mlafleur commented on June 7, 2024

This issues is certainly of interest to me. We are considering using Spark and Spark .NET but this issue raises some obvious concerns.

from spark.

GeorgeS2019 avatar GeorgeS2019 commented on June 7, 2024

@dbeavon

Thx for helping to keep this project forwards

from spark.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.