Comments (24)
Hi Team (@imback82 , @Niharikadutta , @dbeavon, @suhsteve, @AFFogarty, @bamurtaugh),
Any update on the possible new release of this library?
Thanks.
from spark.
Hi @Vislesha, the
main
branch is already on .NET 6.0. However, it would not be safe to do an official release until #1131 is fixed.
First off this library is great and I want to comend all the hard work that has gone into it.
Just my two cents here but I think it would be a good idea to consider a release with Binary Serializer still in place for the following reasons,
- My reading of the depreciation, was that it needs to be done, but projects can plan for it and start the process towards it. It's not designed in such a way as to block all releases
- Although the Binary Serializer can not be made safe, it's because any schema-less Serializer is equally unsafe, swapping it out for another like protobuf without the formal contract in place does not solve the problem.
- The spark connect grpc bindings provides a base for integration, minus UDFs and can also be considered in a future state.
- Not having a release of this library on a supported version of dotnet is far more damaging than the security concerns around the Binary Serializer, and will kill comunity engagement with it, required to implement a proper fix
Hope my comments are clear. I look forward to hearing what others think.
from spark.
Not having a release of this library on a supported version of dotnet is far more damaging than the security concerns around the Binary Serializer, and will kill comunity engagement with it, required to implement a proper fix
@bmazzarol <= well communicated..very appreciated 👏
We need to find ways fast to continue the iteration of improving this project.
from spark.
@GeorgeS2019 Will do my best!
Spark connect is a built-in set of grpc bindings included with Spark 3.4+
This provides a low level API that can be used to drive Spark in a very similar way to how this project works, infact the latest version of pyspark supports this client mode already
This solves the Serializer issue as it uses protobuf behind a defined grpc contract.
However my understanding is that a udf needs to run on the Spark workers and be one of the supported languages to work via Spark connect.
However it's not my intention to solution, I just want to argue for a roadmap to be created and an "as is" release to be considered so progress can be made incrementally.
At the very least a counter argument against an "as is" release would be good so the comunity can understand more issues that might not have been considered.
from spark.
I wonder if it is potentially feasible to replace the JVM part of the diagram to ikvm.NET?
https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
My understanding is that ikvm.NET allows java programs to run on dotnet. So would not be equivalent to Py4j, which is essentially a python interpreter running on the jvm with access to the jvm memory space.
But again my main point was there are lots of ways to move forward, all take time and require planning, the bigger issue that creates is in the meantime there is no supported release of this library.
from spark.
@GeorgeS2019! Sure,
from spark.
I'm testing with .Net 8 on OSS and Azure HDI.
@AFFogarty It has been almost a year since you mentioned the concern related to
#1131
Did you see there is a PR?
#1166
I'm eager to help get this merged. Let me know how we can help. I will start testing it on OSS and HDI as soon as possible.
I think MessagePack is as good a solution as other possible replacements for BinaryFormatter. My opinion is that it could be used as the default serialization/deserialization strategy, but that users should be able to revert to BinaryFormatter if desired.
Can we get this merged? And after that I will have follow-up changes to migrate to .net 8. They are basically the same as your old changes to migrate to .net 6.
from spark.
Hi @Vislesha, the main
branch is already on .NET 6.0. However, it would not be safe to do an official release until #1131 is fixed.
from spark.
I have used the current version 2.1.1 writing delta format from dotnet 7.0
used sdkman to install java and spark. dotnet dotnetapp was compile to native
ubuntu 22.04
jdk 8
spark 3.2.1
dotnet 7.0
Microsoft.Spark.Worker-2.1.1
run spark submit
spark-submit --packages io.delta:delta-core_2.12:2.0.2 --class org.apache.spark.deploy.dotnet.DotnetRunner --master local ./bin/Release/dotnetapp/microsoft-spark-3-2_2.12-2.1.1.jar ./bin/Release/dotnetapp/dotnetapp
There is still the bug with the udfs.
from spark.
udfs is not working in polyglot note book due to #1131
from spark.
Could you provide a working solution to make UDFs work in polyglot by working with the polyglot team => @claudiaregio
Basically, it is the directory path problem associated with polyglot,
from spark.
Hi @AFFogarty & @GeorgeS2019, thank you for the quick reply!
We are heavily reliant on this library for our solution which is ready for production now. Rest of our application is on .Net 6.0 and would like this library to be upgraded as well. We are currently using the main branch and it's all working fine on .Net 6.0, as we are not using either UDFs or polyglot notebook. However, as we are going for production, would like an official version and it appears #1131 is a security vulnerability that would fail some security checks.
Also, we are looking for a complete port of Spark along with MLLib. Would greatly appreciate if there's a new version of this library with full compatibility with latest version of Spark.
from spark.
Hi Team:
Could someone clarify the future of this library. It's been so long the PRs are pending!
Also, is there a chance this library can be merged with SynapseML (https://github.com/microsoft/SynapseML)? It appears it is actively being developed and has a better technology to generate Spark bindings without much delay and also has so many other features integrated.
Thanks!
from spark.
Although the Binary Serializer can not be made safe, it's because any schema-less Serializer is equally unsafe, swapping it out for another like protobuf without the formal contract in place does not solve the problem.
The spark connect grpc bindings provides a base for integration, minus UDFs and can also be considered in a future state.
Could you provide more information?
UDF is only an issue with PolyGlot notebook.
It is the question with PolyGlot team
Could you just elaborate further so others could continue to add more information and we iteratively get closer to a suitable solution?
I wonder if the block is due to legal issues than the software implementation
Why there is no incentive to address this at the Software level for the .NET community?
#AGAIN
Leaving this not moving forwards could have UNDESIRABLE consequences for the entire Microsoft Big Data analytics offerings
from spark.
- udf needs to run on the Spark workers
- Spark workers needs to be one of the supported languages to work via Spark connect.
Is it feasible to make .NET one of the supported languages (e.g. python, R, Go according to the diagram)?
I am still fuzzy.
Are u familiar with ikvm?
It is .NET6, it is possible to load java code files and compile within VS2022 into .NET
If ikvm is feasible, then the question of keeping Spark.NET always up to date is no longer an issue
from spark.
I wonder if it is potentially feasible to replace the JVM part of the diagram to ikvm.NET?
https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
from spark.
Initial protobuf definition for Spark Connect API
from spark.
Hi @bmazzarol! It appears it's going to be a long time for any new version of this library. We'll explore alternatives. Thank you for the clarification!
from spark.
What alternative(s) are you expecting?
from spark.
@Vislesha
You are stopping brainstorming WHY?
from spark.
@GeorgeS2019, we are moving to Java based APIs for our Analytics Engine so we don't have to play a catchup with compatible libraries. It's going to be time consuming but looks like that's a better alternative.
from spark.
You have abandoned, but not everyone YET. So, do consider leaving it open even if you are no longer interested
from spark.
This issues is certainly of interest to me. We are considering using Spark and Spark .NET but this issue raises some obvious concerns.
from spark.
Thx for helping to keep this project forwards
from spark.
Related Issues (20)
- If you are a webapi project, you also need to use spark-submit? Is there a webapi example to refer to
- Can't a .netcore program connect to a remote spark cluster? If so, what should I do? please help me
- Support for NotebookUtils
- [BUG]: Hive incompatibility when using microsoft-spark-3-1_2.12-2.1.1.jar HOT 1
- [FEATURE REQUEST]: Benchmark Spark.NET versus PySpark and SparkR
- [BUG]: HOT 1
- [FEATURE REQUEST]: Status of Project HOT 1
- [BUG]: When collected, long values are cast to int
- Question: How to use DataFrame API to achieve the function equivalent to map/reduce in spark.net
- support Apache Spark 3.4 HOT 4
- [BUG]: Failed to execute 'collectToPython' on 'org.apache.spark.sql.Dataset' with args=()
- [FEATURE REQUEST]: Spark version 3.1.3 is not supported by current dotnet on spark code. This is preventing Migration to HDI 5.0 which uses spark version 3.1.3 HOT 1
- Can we breathe life back into this project? HOT 23
- [BUG]: HOT 11
- I am facing the following issue: The system cannot find the path specified but my pyspark opens up. HOT 4
- [FEATURE REQUEST]: Replacement for BinaryFormatter HOT 1
- [FEATURE REQUEST]: Stop targeting .net standard (both 2.0 and 2.1)
- [FEATURE REQUEST]: .Net 8.0 support
- [FEATURE REQUEST]: Deprecate and/or evict Microsoft.Data.Analysis from the Microsoft.Spark assembly HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spark.