dotnet / spark Goto Github PK

View Code? Open in Web Editor NEW

2.0K 91.0 308.0 3.06 MB

.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.

Home Page: https://dot.net/spark

License: MIT License

C# 74.33% Python 1.65% Scala 14.37% Shell 3.66% Batchfile 0.02% PowerShell 5.62% CMake 0.36%

spark csharp dotnet analytics bigdata spark-streaming spark-sql machine-learning fsharp dotnet-core

spark's Introduction

.NET for Apache® Spark™

.NET for Apache Spark provides high performance APIs for using Apache Spark from C# and F#. With these .NET APIs, you can access the most popular Dataframe and SparkSQL aspects of Apache Spark, for working with structured data, and Spark Structured Streaming, for working with streaming data.

.NET for Apache Spark is compliant with .NET Standard - a formal specification of .NET APIs that are common across .NET implementations. This means you can use .NET for Apache Spark anywhere you write .NET code allowing you to reuse all the knowledge, skills, code, and libraries you already have as a .NET developer.

.NET for Apache Spark runs on Windows, Linux, and macOS using .NET 6, or Windows using .NET Framework. It also runs on all major cloud providers including Azure HDInsight Spark, Amazon EMR Spark, AWS & Azure Databricks.

Note: We currently have a Spark Project Improvement Proposal JIRA at SPIP: .NET bindings for Apache Spark to work with the community towards getting .NET support by default into Apache Spark. We highly encourage you to participate in the discussion.

Supported Apache Spark
Releases
Get Started
Build Status
Building from Source
Samples
Contributing
Inspiration and Special Thanks
How to Engage, Contribute and Provide Feedback
Support
.NET Foundation
Code of Conduct
License

Supported Apache Spark

Apache Spark	.NET for Apache Spark
2.4*	v2.1.1
3.0
3.1
3.2

*2.4.2 is not supported.

Releases

.NET for Apache Spark releases are available here and NuGet packages are available here.

Get Started

These instructions will show you how to run a .NET for Apache Spark app using .NET 6.

Build Status


Ubuntu	Windows

Building from Source

Building from source is very easy and the whole process (from cloning to being able to run your app) should take less than 15 minutes!

		Instructions
	Windows	Local - .NET Framework 4.6.1 Local - .NET 6
	Ubuntu	Local - .NET 6 Azure HDInsight Spark - .NET 6

Samples

There are two types of samples/apps in the .NET for Apache Spark repo:

Getting Started - .NET for Apache Spark code focused on simple and minimalistic scenarios.
End-End apps/scenarios - Real world examples of industry standard benchmarks, usecases and business applications implemented using .NET for Apache Spark.

We welcome contributions to both categories!

Analytics Scenario	Description	Scenarios
Dataframes and SparkSQL	Simple code snippets to help you get familiarized with the programmability experience of .NET for Apache Spark.	Basic C# F#
Structured Streaming	Code snippets to show you how to utilize Apache Spark's Structured Streaming (2.3.1, 2.3.2, 2.4.1, Latest)	Word Count C# F# Windowed Word Count C# F# Word Count on data from Kafka C# F#
TPC-H Queries	Code to show you how to author complex queries using .NET for Apache Spark.	TPC-H Functional C# TPC-H SparkSQL C#

Contributing

We welcome contributions! Please review our contribution guide.

Inspiration and Special Thanks

This project would not have been possible without the outstanding work from the following communities:

Apache Spark: Unified Analytics Engine for Big Data, the underlying backend execution engine for .NET for Apache Spark
Mobius: C# and F# language binding and extensions to Apache Spark, a pre-cursor project to .NET for Apache Spark from the same Microsoft group.
PySpark: Python bindings for Apache Spark, one of the implementations .NET for Apache Spark derives inspiration from.
sparkR: one of the implementations .NET for Apache Spark derives inspiration from.
Apache Arrow: A cross-language development platform for in-memory data. This library provides .NET for Apache Spark with efficient ways to transfer column major data between the JVM and .NET CLR.
Pyrolite - Java and .NET interface to Python's pickle and Pyro protocols. This library provides .NET for Apache Spark with efficient ways to transfer row major data between the JVM and .NET CLR.
Databricks: Unified analytics platform. Many thanks to all the suggestions from them towards making .NET for Apache Spark run on Azure and AWS Databricks.

How to Engage, Contribute and Provide Feedback

The .NET for Apache Spark team encourages contributions, both issues and PRs. The first step is finding an existing issue you want to contribute to or if you cannot find any, open an issue.

Support

.NET for Apache Spark is an open source project under the .NET Foundation and does not come with Microsoft Support unless otherwise noted by the specific product. For issues with or questions about .NET for Apache Spark, please create an issue. The community is active and is monitoring submissions.

.NET Foundation

The .NET for Apache Spark project is part of the .NET Foundation.

Code of Conduct

This project has adopted the code of conduct defined by the Contributor Covenant to clarify expected behavior in our community. For more information, see the .NET Foundation Code of Conduct.

License

.NET for Apache Spark is licensed under the MIT license.

spark's People

Contributors

Stargazers

Watchers

Forkers

safern adamsitnik imback82 lhutyra slang25 druttka asthana86 dongfengxin dot-nethub gezb19890407 q315523275 wenfeifei lqdev jxufesoft dgomezferro jangocheng dystudio shaunstanislauslau dtziotzios mowlali jthelin doneladams alefranz danmoseley ericstj suhsteve dreamofei lifejoyforpy niuziyuanpy hhy5277 joneyin shz701 freshcoder39 hcchappy swissarmykirpan mbrukman niladri24dutta wentaowu weizai118 qcjxberin tw1999 littlewrong awesomedatatool leibocode duyanming luyingyao-schneider-electric johnsonsharp eerhardt matrixdekoder mzy666888 jesse-xu dut3062796s affogarty francischung svemulapalli zec-hue ramasedi mengjin001 jinchengliuau tolufash hariksubramani gbrueckl chandlermoon jdpbbd stephensmitchell-forks sailfish009 iangifford261 lniedzielski swpuzhang sharpeye096 aeroxuk iguanaware sd37 cy101 guruvonline frame-work huangweiboy tadziqusky johnpaulada sourichalla thrajput tianyaba sidneyocirqueira-zz mikeashley87 benzei jhgoodwin padamshrestha stevenflood xiawei666 bamurtaugh iracding wynot12 danny8002 raberana yanchao891012 tiagoooliveira powerdg daxiadongxin yannify rallyfoundation

spark's Issues

Use `dotnet pack` against a csproj to create nuget package

We are currently using nuget.exe and a .nuspec file to create the Spark NuGet package. We should instead use dotnet pack and MSBuild to do this.

The advantages are that we can:

Maintain the dependencies in a single place (the .csproj).
Use MSBuild properties to control things like the version number. This allows for pre-release daily builds to have different version numbers.
It works cross-platform
We can take advantage of new features, like .snupkg packages for symbols, source link, and the <repository /> tag in the .nuspec which tells users which git repo and commit this package was built from.

@imback82 @safern @rapoth

MacOS support

Please add support for running in MacOS.

Automate scalafmt (formatting rule for Scala)

Create a script that can run scalafmt, similar to how apache/spark does: https://github.com/apache/spark/blob/master/dev/scalafmt

Support more data types: ArrayType, MapType, DateType, and StructType

The following types are not fully supported:
ArrayType, MapType, DateType, and StructType.

Check Row.cs, etc.

Discussion: Idiomatic F# APIs

This user experience item describes idiomatic APIs for C# and F#: https://github.com/dotnet/spark/blob/master/ROADMAP.md#user-experience-1

I think this would be a good issue to discuss what idiomatic looks like for F# in the context of spark.

Here's the (basic) sample from the .NET homepage:

// Create a Spark session
let spark =
    SparkSession.Builder()
        .AppName("word_count_sample")
        .GetOrCreate()

// Create a DataFrame
let df = spark.Read().Text("input.txt")

let words = df.Select(Split(df.["value"], " ").Alias("words")

words.Select(Explode(words["words"]).Alias("word"))
     .GroupBy("word")
     .Count()

Although this certainly isn't bad, a more idiomatic API could look something like this:

// Create a Spark session
let spark =
    SparkSession.initiate()
    |> SparkSession.appName "word_count_sample"
    |> SparkSesstion.getOrCreate

// Create a DataFrame
let df = spark |> Spark.readText "input.txt"

let words = df |> DataFrame.map (Split(df.["value"], " ").Alias("words"))

words
|> DataFrame.map (Explode(words["words"]).Alias("word"))
|> DataFrame.groupBy "word"
|> DataFrame.count

The above is just a starting point for a conversation. It would assume a module of combinators for data frames (and potentially other collection-like structures). Although this wouldn't be difficult to implement or maintain - it would be proportional to maintaining the one-liners in the C# LINQ-style implementation - I wonder what else could be done to make it feel more natural for F#, and what the best bang for our buck here is.

In other words, I'd love to solicit feedback on the kinds of things that matter most to F# developers interested in using Spark, so that it's possible to stack these up relative to their implementation and maintenance costs.

Also including @isaacabraham, as he tends to be a lot more creative than I am when it comes to these things 😄

[BUG]: Trying to follow the "Getting Started" guide step by step

Describe the bug
I was following the Getting Started guide step by step. When I execute the following:

C:\Users\j.shaer\source\repos\HelloSpark\HelloSpark\bin\Debug\netcoreapp2.1>spark-submit 
`--class org.apache.spark.deploy.DotnetRunner ` --master local ` microsoft-spark-2.4.x-0.1.0.jar ` HelloSpark

I get this:

Exception in thread "main" org.apache.spark.SparkException: Cannot load main class from JAR file:/C:/Users/j.shaer/source/repos/HelloSpark/HelloSpark/bin/Debug/netcoreapp2.1/%60--class
        at org.apache.spark.deploy.SparkSubmitArguments.error(SparkSubmitArguments.scala:657)
        at org.apache.spark.deploy.SparkSubmitArguments.loadEnvironmentArguments(SparkSubmitArguments.scala:221)
        at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:116)
        at org.apache.spark.deploy.SparkSubmit$$anon$2$$anon$3.<init>(SparkSubmit.scala:911)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.parseArguments(SparkSubmit.scala:911)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:81)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

[BUG]: Sample From Readme Not Running Ubuntu 18.04

Describe the bug
Running On Ubuntu 18.04 the following output / error appears:

19/04/25 06:39:40 WARN Utils: Your hostname, localhost resolves to a loopback address: 127.0.0.1; using <MY-PUBLIC-IP> instead (on interface eth0)
19/04/25 06:39:40 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
19/04/25 06:39:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/internal/Logging$class
	at org.apache.spark.deploy.DotnetRunner$.<init>(DotnetRunner.scala:34)
	at org.apache.spark.deploy.DotnetRunner$.<clinit>(DotnetRunner.scala)
	at org.apache.spark.deploy.DotnetRunner.main(DotnetRunner.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.internal.Logging$class
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 15 more
log4j:WARN No appenders could be found for logger (org.apache.spark.util.ShutdownHookManager).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

To Reproduce

Using this code in Program.cs:

using System;
using Microsoft.Spark.Sql;
using static Microsoft.Spark.Sql.Functions;

namespace HelloSpark
{
    class Program
    {
        static void Main(string[] args)
        {
            var spark = SparkSession.Builder().GetOrCreate();
            var df = spark.Read().Json("people.json");
            df.Show();
        }
    }
}

and this data in file people.json:

{"name":"Michael"} 
{"name":"Andy", "age":30} 
{"name":"Justin", "age":19}

After entering the following command in the terminal

spark-submit \
--class org.apache.spark.deploy.DotnetRunner \
--master local \
./bin/Debug/netcoreapp2.1/linux-x64/publish/microsoft-spark-2.4.x-0.1.0.jar \
./bin/Debug/netcoreapp2.1/linux-x64/publish/HelloSpark

Expected behavior
No errors.

Desktop (please complete the following information):

OS: Ubuntu
Version 18.04

Additional context

See draft publish in #13 to see steps used to set up environment. Spark and Java are confirmed working.

error in installing on Azure Databricks

I am following the instructions to setup new cluster with .NET for Apache spark. i have not changed any path/config from that shared in document. When i try to start my cluster it fails. below is the message that is getting logged.

bash: line 1: $'\r': command not found
bash: line 12: $'\r': command not found
bash: line 18: $'\r': command not found
bash: line 22: $'\r': command not found
bash: line 24: $'\r': command not found
bash: line 25: set: +
: invalid option
set: usage: set [-abefhkmnptuvxBCHP] [-o option-name] [--] [arg ...]
/bin/bash: /dbfs/spark-dotnet
/install-worker.sh: Bad address

I have verified the install-worker is present in dbfs/spark-dotnet/install-worker.sh.

Implement accumulator in TaskRunner.cs

Write unit tests for DaemonWorker

DaemonWorker unit tests are missing in Spark.Net.Work.UnitTest.

[BUG]: DataFrameWriter PartitionBy fails - cannot find matching method class org.apache.spark.sql.DataFrameWriter.partitionBy

Describe the bug

DataFrameWriter.PartitionBy fails to execute. PartitionBy takes an array of strings as the parameters (params string[]) and when the scala code tries to find the matching method on org.apache.spark.sql.DataFrameWriter it fails because it is expecting either a Seq or a String[] but instead (I think at least) it is looking for one string argument for every column passed in so String, String.

19/04/25 21:55:50 WARN DotnetBackendHandler: cannot find matching method class org.apache.spark.sql.DataFrameWriter.partitionBy. Candidates are:
19/04/25 21:55:50 WARN DotnetBackendHandler: partitionBy(class [Ljava.lang.String;)
19/04/25 21:55:50 WARN DotnetBackendHandler: partitionBy(interface scala.collection.Seq)
...
19/04/25 21:55:50 ERROR DotnetBackendHandler: args:
19/04/25 21:55:50 ERROR DotnetBackendHandler: argType: java.lang.String, argValue: a
19/04/25 21:55:50 ERROR DotnetBackendHandler: argType: java.lang.String, argValue: b
[2019-04-25T20:55:50.0447941Z] [TEDTOP] [Error] [JvmBridge] JVM method execution failed: Nonstatic method partitionBy failed for class 13 when called with 2 arguments ([Index=1, Type=String, Value=a], [Index=2, Type=String, Value=b], )

To Reproduce

Steps to reproduce the behavior:

If you use the basic example to read a parquet file then call PartitionBy:

            SparkSession spark = SparkSession
                .Builder()
                .AppName(".NET Spark SQL basic example")
                .Config("spark.some.config.option", "some-value")
                .GetOrCreate();

            var df = spark.Read().Load(args[0]);
            df.Write()
                .PartitionBy("favorite_color")
                .SaveAsTable("people_partitioned_bucketed");

run it using:

spark-submit.cmd --class org.apache.spark.deploy.DotnetRunner `
 --master local  `
 C:\github\dotnet-spark\src\scala\microsoft-spark-2.4.x\target\microsoft-spark-2.4.x-0.1.0.jar `
 Microsoft.Spark.CSharp.Examples.exe Sql.Basic %SPARK_HOME%\examples\src\main\resources\users.parquet

Expected behavior

I would expect to be able to partition when writing

Desktop (please complete the following information):

OS: [e.g. iOS] Windows 10
Browser [e.g. chrome, safari]
Version :

I tested with spark 2.3 and 2.4 (spark-2.3.2 + spark-2.4.1)

Missing line in file developer-guide.md

https://github.com/dotnet/spark/blob/master/docs/developer-guide.md#debugging-net-application
debug command：

spark-submit \
  --class org.apache.spark.deploy.DotnetRunner \
  --master local \
  debug

The correct way is：

spark-submit \
  --class org.apache.spark.deploy.DotnetRunner \
  --master local \
  <path-to-microsoft-spark-jar> \
  debug

Microsoft.Spark nuget package shouldn't work on all TFMs

The Microsoft.Spark nuget package has a build\Microsoft.Spark.targets file. Since this file isn't contained in a TFM folder, nuget will think this package works on any TFM, even TFMs that aren't compatible with netstandard2.0.

See dotnet/machinelearning#370 for the same issue that came up in ML.NET.

We should move that .targets file into a build\netstandard2.0 folder.

Documentation for running .NET for Apache Spark on Azure Databricks

Is there already any documentation about how to use .Net for Apache Spark inside Azure Databricks? Do we need to wait for Azure Databricks to allow Noteboks be created with C# as a language?

[C#] For public APIs, check nulls for parameters.

For example,

public DataFrameWriter Format(string source)

source should be checked if it is null before sending it to JVM side.

One thing to consider is whether we want to check it at the API level, or at the JVMBridge level.

Checking at the API level will introduce lots of code, but will be better for the user since the exception stack will be smaller. Checking at the JVMBridge level will be much less code, but user will see the exception stack from the JVMBridge.

System.BadImageFormatException

In Windows 10, we have created a netcoreapp2.1 project in VS2019 for the sample HelloSpark, building for x64 architecture. We use the binaries for the netcoreapp2.1 Microsoft.Spark.Worker, and we try to run on a v.2.4.1 Spark local instance with 64-Bit OpenJDK and 64-bit Python3.5.2 (Anaconda).

We get the following error:
Unhandled Exception: System.BadImageFormatException: Could not load file or assembly 'System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e' or one of its dependencies. An attempt was made to load a program with an incorrect format. ---> System.BadImageFormatException: An attempt was made to load a program with an incorrect format. (Exception from HRESULT: 0x8007000B)

Port spark/examples/src/main/python/sql to C# and verify

Python Examples

python/sql/streaming/structured_kafka_wordcount.py
- CSharp
- FSharp
python/sql/streaming/structured_network_wordcount.py
- CSharp
- FSharp
python/sql/streaming/structured_network_wordcount_windowed.py
- CSharp
- FSharp
python/sql/arrow.py
- CSharp
- FSharp
python/sql/basic.py
- CSharp
- FSharp
python/sql/datasource.py
- CSharp
- FSharp
python/sql/hive.py
- CSharp
- FSharp

Handle "kill" signal better in DaemonServer.

According to Spark's PythonRunner.scala, the daemon server receives task runner id to kill. Currently, the server doesn't handle it and may leak a Task object. We need to introduce cancellation token in TaskRunner and launch Task with the token.

More detailed description can be found in DaemonWorker.WaitForSignal().

[BUG]: org.apache.spark.sql.AnalysisException: Cannot resolve column name

Describe the bug
I am trying work count program and below is the code I am executing,

SparkSession spark = SparkSession.Builder().AppName("StructuredNetworkWordCount").GetOrCreate();

DataFrame lines = spark.Read().Text(@"E:\Hadoop\Data\Test.txt");
lines.Show();
DataFrame words = lines.Select(Explode(Split(lines["value"], " "))).Alias("word");
DataFrame wordCounts = words.GroupBy("word").Count();

wordCounts.Show();

When I execute, I am getting below error. Could you please help on this.

19/05/01 13:21:15 INFO FileScanRDD: Reading File path: file:///E:/Hadoop/Data/Test.txt, range: 0-564, partition values: [empty row]
19/05/01 13:21:15 INFO CodeGenerator: Code generated in 14.437193 ms
19/05/01 13:21:15 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1728 bytes result sent to driver
19/05/01 13:21:15 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 357 ms on localhost (executor driver) (1/1)
19/05/01 13:21:15 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
19/05/01 13:21:15 INFO DAGScheduler: ResultStage 0 (showString at <unknown>:0) finished in 0.471 s
19/05/01 13:21:15 INFO DAGScheduler: Job 0 finished: showString at <unknown>:0, took 0.536283 s
+--------------------+
|               value|
+--------------------+
|Of course, you ca...|
+--------------------+

19/05/01 13:21:15 INFO BlockManagerInfo: Removed broadcast_1_piece0 on ASPLAPNAV118.xxxxx.com:64062 in memory (size: 4.0 KB, free: 366.3 MB)
19/05/01 13:21:15 ERROR DotnetBackendHandler: methods:
19/05/01 13:21:15 ERROR DotnetBackendHandler: public org.apache.spark.sql.Dataset org.apache.spark.sql.Dataset.limit(int)
19/05/01 13:21:15 ERROR DotnetBackendHandler: public org.apache.spark.sql.Dataset org.apache.spark.sql.Dataset.cache()
19/05/01 13:21:15 ERROR DotnetBackendHandler: public long org.apache.spark.sql.Dataset.count()
19/05/01 13:21:15 ERROR DotnetBackendHandler: public java.lang.String org.apache.spark.sql.Dataset.toString()
19/05/01 13:21:15 ERROR DotnetBackendHandler: public boolean org.apache.spark.sql.Dataset.isEmpty()
19/05/01 13:21:15 ERROR DotnetBackendHandler: public org.apache.spark.sql.Dataset org.apache.spark.sql.Dataset.join(org.apache.spark.sql.Dataset,scala.collection.Seq)
19/05/01 13:21:15 ERROR DotnetBackendHandler: public org.apache.spark.sql.Dataset org.apache.spark.sql.Dataset.join(org.apache.spark.sql.Dataset,scala.collection.Seq,java.lang.String)
19/05/01 13:21:15 ERROR DotnetBackendHandler: public org.apache.spark.sql.Dataset org.apache.spark.sql.Dataset.join(org.apache.spark.sql.Dataset,org.apache.spark.sql.Column)
19/05/01 13:21:15 ERROR DotnetBackendHandler: public org.apache.spark.sql.Dataset org.apache.spark.sql.Dataset.join(org.apache.spark.sql.Dataset,org.apache.spark.sql.Column,java.lang.String)
19/05/01 13:21:15 ERROR DotnetBackendHandler: public org.apache.spark.sql.Dataset org.apache.spark.sql.Dataset.join(org.apache.spark.sql.Dataset,java.lang.String)
19/05/01 13:21:15 ERROR DotnetBackendHandler: public org.apache.spark.sql.Dataset org.apache.spark.sql.Dataset.join(org.apache.spark.sql.Dataset)
19/05/01 13:21:15 ERROR DotnetBackendHandler: public org.apache.spark.sql.Column org.apache.spark.sql.Dataset.apply(java.lang.String)
19/05/01 13:21:15 ERROR DotnetBackendHandler: public java.lang.Object org.apache.spark.sql.Dataset.collect()
19/05/01 13:21:15 ERROR DotnetBackendHandler: public org.apache.spark.sql.DataFrameWriter org.apache.spark.sql.Dataset.write()
19/05/01 13:21:15 ERROR DotnetBackendHandler: public org.apache.spark.sql.catalyst.expressions.NamedExpression org.apache.spark.sql.Dataset.resolve(java.lang.String)
19/05/01 13:21:15 ERROR DotnetBackendHandler: public java.lang.Object org.apache.spark.sql.Dataset.first()
19/05/01 13:21:15 ERROR DotnetBackendHandler: public java.lang.Object org.apache.spark.sql.Dataset.head()
19/05/01 13:21:15 ERROR DotnetBackendHandler: public java.lang.Object org.apache.spark.sql.Dataset.head(int)
....
.....
....
19/05/01 13:21:15 ERROR DotnetBackendHandler: public native int java.lang.Object.hashCode()
19/05/01 13:21:15 ERROR DotnetBackendHandler: public final native java.lang.Class java.lang.Object.getClass()
19/05/01 13:21:15 ERROR DotnetBackendHandler: public final native void java.lang.Object.notify()
19/05/01 13:21:15 ERROR DotnetBackendHandler: public final native void java.lang.Object.notifyAll()
19/05/01 13:21:15 ERROR DotnetBackendHandler: args:
19/05/01 13:21:15 ERROR DotnetBackendHandler: argType: java.lang.String, argValue: word
19/05/01 13:21:15 ERROR DotnetBackendHandler: argType: scala.collection.mutable.WrappedArray.ofRef, argValue: WrappedArray()
[2019-05-01T07:51:15.6684168Z] [ASPLAPNAV118] [Error] [JvmBridge] JVM method execution failed: Nonstatic method groupBy failed for class 12 when called with 2 arguments ([Index=1, Type=String, Value=word], [Index=2, Type=String[], Value=System.String[]], )
[2019-05-01T07:51:15.6684616Z] [ASPLAPNAV118] [Error] [JvmBridge] org.apache.spark.sql.AnalysisException: Cannot resolve column name "word" among (col);
        at org.apache.spark.sql.Dataset$$anonfun$resolve$1.apply(Dataset.scala:223)
        at org.apache.spark.sql.Dataset$$anonfun$resolve$1.apply(Dataset.scala:223)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.sql.Dataset.resolve(Dataset.scala:222)
        at org.apache.spark.sql.Dataset$$anonfun$groupBy$2.apply(Dataset.scala:1622)
        at org.apache.spark.sql.Dataset$$anonfun$groupBy$2.apply(Dataset.scala:1622)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.AbstractTraversable.map(Traversable.scala:104)
        at org.apache.spark.sql.Dataset.groupBy(Dataset.scala:1622)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at org.apache.spark.api.dotnet.DotnetBackendHandler.handleMethodCall(DotnetBackendHandler.scala:162)
        at org.apache.spark.api.dotnet.DotnetBackendHandler.handleBackendRequest(DotnetBackendHandler.scala:102)
        at org.apache.spark.api.dotnet.DotnetBackendHandler.channelRead0(DotnetBackendHandler.scala:29)
        at org.apache.spark.api.dotnet.DotnetBackendHandler.channelRead0(DotnetBackendHandler.scala:24)
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)

		....
		....
		
[2019-05-01T07:51:15.6941021Z] [ASPLAPNAV118] [Exception] [JvmBridge] JVM method execution failed: Nonstatic method groupBy failed for class 12 when called with 2 arguments ([Index=1, Type=String, Value=word], [Index=2, Type=String[], Value=System.String[]], )
   at Microsoft.Spark.Interop.Ipc.JvmBridge.CallJavaMethod(Boolean isStatic, Object classNameOrJvmObjectReference, String methodName, Object[] args) in D:\a\1\s\src\csharp\Microsoft.Spark\Interop\Ipc\JvmBridge.cs:line 192

Unhandled Exception: System.Exception: JVM method execution failed: Nonstatic method groupBy failed for class 12 when called with 2 arguments ([Index=1, Type=String, Value=word], [Index=2, Type=String[], Value=System.String[]], )
   at Microsoft.Spark.Interop.Ipc.JvmBridge.CallJavaMethod(Boolean isStatic, Object classNameOrJvmObjectReference, String methodName, Object[] args) in D:\a\1\s\src\csharp\Microsoft.Spark\Interop\Ipc\JvmBridge.cs:line 192
   at Microsoft.Spark.Interop.Ipc.JvmBridge.CallJavaMethod(Boolean isStatic, Object classNameOrJvmObjectReference, String methodName, Object arg0, Object arg1) in D:\a\1\s\src\csharp\Microsoft.Spark\Interop\Ipc\JvmBridge.cs:line 138
   at Microsoft.Spark.Interop.Ipc.JvmBridge.CallNonStaticJavaMethod(JvmObjectReference objectId, String methodName, Object arg0, Object arg1) in D:\a\1\s\src\csharp\Microsoft.Spark\Interop\Ipc\JvmBridge.cs:line 94
   at Microsoft.Spark.Sql.DataFrame.GroupBy(String column, String[] columns) in D:\a\1\s\src\csharp\Microsoft.Spark\Sql\DataFrame.cs:line 429
   at NetworkWordCount.Program.Main(String[] args) in D:\Workspace\DotNet\DotNetSpark\NetworkWordCount\NetworkWordCount\Program.cs:line 29
19/05/01 13:21:18 INFO DotnetRunner: Closing DotnetBackend
19/05/01 13:21:18 INFO DotnetBackend: Requesting to close all call back sockets
19/05/01 13:21:18 INFO SparkContext: Invoking stop() from shutdown hook
19

Add SourceLink support

We aren't currently embedding SourceLink information into the PDB - that allows a debugger to automatically find and pull the source files.

https://github.com/dotnet/sourcelink has info

C:\dotnet>dotnet tool install --global sourcelink
Since you just installed the .NET Core SDK, you will need to reopen the Command Prompt window before running the tool you installed.
You can invoke the tool using the following command: sourcelink
Tool 'sourcelink' (version '3.0.0') was successfully installed.
…

C:\dotnet>sourcelink test C:\pkg\lib\netstandard2.0\Microsoft.Spark.pdb | find /i "failed"
sourcelink test failed

(Note you need 2.1 installed currently to run the tool: ctaggart/SourceLink#380)

We should re-enable running tests against netfx

In PR #10 we disabled net461 targetframework for our tests because they were failing with a strong name failure.

 System.IO.FileLoadException : Could not load file or assembly 'Microsoft.Spark, Version=0.1.0.0, Culture=neutral, PublicKeyToken=cc7b13ffcd2ddd51' or one of its dependencies. Strong name signature could not be verified. The assembly may have been tampered with, or it was delay signed but not fully signed with the correct private key. (Exception from HRESULT: 0x80131045)
Stack Trace:
 at Microsoft.Spark.UnitTest.SerDeTests.TestReadBytes()
Failed Microsoft.Spark.UnitTest.SerDeTests.TestReadAndWrite

cc: @eerhardt @rapoth @imback82 @danmosemsft

ERROR Shell: Failed to locate the winutils binary in the hadoop binary path

With all the steps stated in https://github.com/dotnet/spark#get-started and when executing the command,
spark-submit --class org.apache.spark.deploy.DotnetRunner --master local microsoft-spark-2.4.x-0.1.0.jar HelloSpark
I am getting error message that states as below,

19/04/26 15:24:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 19/04/26 15:24:29 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:378) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:393) at org.apache.hadoop.util.Shell.<clinit>(Shell.java:386) at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79) at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:116) at org.apache.hadoop.security.Groups.<init>(Groups.java:93) at org.apache.hadoop.security.Groups.<init>(Groups.java:73) at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:293) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:283) at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:260) at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:789) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:774) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:647) at org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2422) at scala.Option.getOrElse(Option.scala:138) at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2422) at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:79) at org.apache.spark.deploy.SparkSubmit.secMgr$lzycompute$1(SparkSubmit.scala:359) at org.apache.spark.deploy.SparkSubmit.secMgr$1(SparkSubmit.scala:359) at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$9(SparkSubmit.scala:367) at scala.Option.map(Option.scala:163) at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:367) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/internal/Logging$class at org.apache.spark.deploy.DotnetRunner$.<init>(DotnetRunner.scala:34) at org.apache.spark.deploy.DotnetRunner$.<clinit>(DotnetRunner.scala) at org.apache.spark.deploy.DotnetRunner.main(DotnetRunner.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: org.apache.spark.internal.Logging$class at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) ... 15 more log4j:WARN No appenders could be found for logger (org.apache.spark.util.ShutdownHookManager). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Could you please help finding what am I missing and resolve this issue.

I can able to access Spark shell using spark-shell command

System.Runtime issue

Hi All,

Might be a dumb question.

I am experienced in .net but new to the GIT and Spark. I am trying to run Spark from .net.

I tried to follow the step as shown here,

https://github.com/dotnet/spark/blob/master/README.md#get-started

After couple of struggle i was able to get in to the point of running Spark-Submit.

But when i run that

I am getting this error

Unhandled Exception: System.IO.FileNotFoundException: Could not load file or assembly 'System.Runtime, Version=4.2.1.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a' or one of its dependencies. The system cannot find the file specified.

I know its a reference error and i know how to fix it in the VS.

But since this thing is running under Spark Interop, am not sure how to fix this.

I tried installing the System.Runtime from the nuget manager.

Thanks in advance and any help is much appreciated.

Use C# pattern matching for matching types.

There are many places where we are doing the following:
PayloadHelper.GetTypeId():

        if (type == typeof(int))
        {
            return new[] { Convert.ToByte('i') };
        }
        if (type == typeof(long))
        {
            return new[] { Convert.ToByte('g') };
        }
        ..

Use C#'s pattern matching instead if applicable.

[Documentation]: Get Started Experience on Ubuntu

At the moment, the Get Started section in README.md talks about how users can get started on a Windows machines using .NET Core. We need a similar set of instructions for Ubuntu.

Success Criteria
User has clear instructions on:

Setting up all the pre-requisites
Creating a new .NET Core project
Obtain the Nuget package on Ubuntu
Compiling the .NET app on Ubuntu
Running the app using spark-submit

[BUG]: Getting started is broken - no releases

The getting started pre-requisites has a link to Microsoft.Spark.Worker releases but there are no releases do download.

Steps to reproduce the behavior:

https://github.com/dotnet/spark
Scroll to the README
In the Pre-requisites, click on the Microsoft.Spark.Worker link
there are no releases

Remove Pyrolite pickling code when the Razorvine.Pyrolite.dll is signed with a strong name

Now that Nuget v4.25 is released with a strong name signing: https://www.nuget.org/packages/Razorvine.Pyrolite/4.25.0, we can remove the code from our repo.

Any plans to support Datasets?

Looking at the code and the roadmap, there is nothing about supporting Datasets in the future.
Are there any plans for it?

Dataframes are the untyped version of the Dataset (Dataset).
Datasets would allow writing less code and code less prone to bugs, using Spark as one would use Linq.

I'm talking about coding in C# using this:

Dataset<Person> dataset;
dataset.Filter(p => p.Age >= 18).Select(p => p.Name)

Over this:

Dataframe dataframe;
dataframe.Filter("Age >= 18").Select("Name")

Happy Birthday

Calling Scala Functions from .NET for Spark

We have some prebuilt Jars ( Scala methods to query data from HDInsight).
How can we call these methods from .Net for Spark API ?

Use DisplayClass to differentiate delegate with closure.

@stephentoub suggested this idea.

[BUG]: DataFrameWriter Option isn't passing the key, only the value

Describe the bug

DataFrameWriter.Option takes a key and a value but DataFrameWriter is only passing on the value so the method can't be found.

To Reproduce

            df.Write().Format("orc")
                .Option("orc.bloom.filter.columns", "favorite_color")
                .Option("orc.dictionary.key.threshold", 1.0)
                .Option("orc.column.encoding.direct", "name")
                .Mode(SaveMode.Overwrite)
                .Save("users_with_options.orc");

This causes:

"cannot find matching method class org.apache.spark.sql.DataFrameWriter.option. Candidates are:"

in DataFrameWriter we have:

 public DataFrameWriter Option(string key, bool value)
        {
            _jvmObject.Invoke("option", value);
            return this;
        }

which I guess should be:

 public DataFrameWriter Option(string key, bool value)
        {
            _jvmObject.Invoke("option", key, value);
            return this;
        }

There is one Option for each data type it supports.

Expected behavior

To be able to set individual options on the DataFrameWriter

[Attn] Support for Spark 2.4.2

Summary: You cannot use .NET for Apache Spark with Apache Spark 2.4.2

Details: Spark 2.4.2 was released on 4/23/19 and using it against microsoft.spark.2.4.x results in unexpected behavior (reported in #48, #49); the expected behavior is that you would get an exception message such as Unsupported spark version used: 2.4.2. Supported versions: 2.4.0, 2.4.1 causing less confusion. This is likely due to the scala version upgrade to 2.12 in 2.4.2. Note that microsoft.spark.2.4.x is being built with 2.11.

There is an ongoing discussion about this (http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-2-4-2-tc27075.html#a27139), so depending on the outcome of the discussion, 2.4.2 may or may not be supported.

Do you want to help? While we are closely monitoring and working with the Apache Spark community in addressing this issue, you can also feel free to reply back to the main thread about any problems this issue has caused so we can avoid such mishaps in the future.

Documentation: Debugging Instructions

Dispose JvmBridge correctly

Currently, a JvmBridge instance is a static member of SparkEnvironment class. Without forcing the user to call something like SparkEnvironment.JvmBridge.Dispose() in his/her application, there is no clean way to dispose JvmBridge, thus Scala side handles the disconnect gracefully (#121).

One approach to address this issue is to have a ref-counted SparkSession where JvmBridge.Dispose() is called when the last SparkSession object is disposed.

Note that the following should be handled:

using (var spark = SparkSession.Builder().GetOrCreate())
{
    // do something
}

// New JvmBridge should be instantiated with the following.
using (var spark = SparkSession.Builder().GetOrCreate())
{
    // do somthing
}

One issue with relying on SparkSession is that there are few classes such as SparkConf and Builder that accesses the JvmBridge directly from SparkEnvironment and these classes do not implement IDisposable (to be consistent with Scala Spark API), so it is harder to enforce cleaning up the JvmBridge if an user does the following

public static void Main(string[] args) {
    var conf = new SparkConf();
    // exits Main without creating SparkSession.
}

cc: @rapoth @stephentoub

Row unpickles long value incorrectly.

From Row.GetAs(int):
If the original type is "long" and its value can be fit into the "int", Pickler will serialize the value as int. Since the value is boxed, GetAs will throw an exception.

One way to address the issue is to convert the right type in Row.Convert(). However, since this has to be done for each row (plus boxing/unboxing), this may not be the ideal solution. Also, check if pickling can preserve the type.

Automate C# formatting using dotnet-format

Explore adopting https://github.com/dotnet/format to automate the process of formatting source code similar to scalafmt (#153).

Add config option for installing .NET Core runtime in shared mode

Giving the option within the db-init.sh initialisation script to install .NET Core Runtime in shared mode would allow for smaller application bundles.

When running dotnet publish then bundling within a .zip file, it produces around 200 files and an almost 30MB .zip file. Having the .NET Core Runtime installed on the workers in shared mode allows us to compile with the --self-contained=false flag producing around 20 files which bundle into a less than 1MB .zip file.

[Documentation]: Upgrading for a newly release Apache Spark version

When a new version of Apache Spark is released, we'd have to make changes in the code base to add any new APIs, account for new protocol changes on the worker-side etc. We should document this to allow anyone to upgrade .NET for Apache Spark to newer versions.

Success Criteria

Clearly call out code changes with examples
Capture caveats one might encounter during this upgrade process

Registering & Invoking other language UDFs (e.g., Java/Scala, Python, R) through .NET API

Extract common classes from src/scala/microsoft-spark-<version>.

We create multiple jars during our builds to accommodate multiple versions of Apache Spark. In the current approach, the implementation is copied from one version to another and then necessary changes are made.

An ideal approach could create a common directory and extract common classes from duplicate code. Note that even if class/code is exactly the same, you cannot pull out to a common class if it depends on Apache Spark.

Success Criteria:

PR that refactors all the classes appropriately
Documentation for all the classes changed/added
Documentation on upgrading versions (if it doesn't already exist)

TPC-DS Functional + SQL Queries

RDD APIs available?

Is there anyway to manage RDD and Parallelize context by using Nuget Package.
It seems they're all "Internal" and not accessible or I'm missing something.
Also can't see anything a bit more advance in samples like what Mobius was providing.

Publish PDBs to symbol server

We need to update build pipeline to publish PDBs to symbol server.

cc: @eerhardt @safern

Microsoft.Spark nuget package shouldn't have symbols files in it

We are currently packaging the .pdb symbols file into the NuGet package. But we should be using the new .snupkg format instead.

See https://docs.microsoft.com/en-us/nuget/create-packages/symbol-packages-snupkg

The reason we don't want symbols in the NuGet package is because it bloats every application that uses this package. Symbols are only be used when debugging, so publishing an application shouldn't need them.

Arrow enabled implementation and comparison with pySpark

It seems that using Arrow enabled Pandas could greatly improve pySpark efficiency. Did you compare against Arrow enabled Pandars or the version without? Is this detail something you could highlight in the comparison? As far as I know ,NET does not have proper Arrow bindings, and it would be of great benefit if it did, did you consider creating it? There's a bit of a discussion in dotnet/machinelearning#69 if you need more context.

Add license and 3rd party notices to nuget package

We aren't currently putting the license and 3rd party notice files in our nuget package. We should add them.

Write unit tests for SimpleWorker.

SimpleWorker unit tests are missing in Spark.Net.Work.UnitTest.

Array/Map return types for UDF do not work correctly.

I have a scenario where for structured streaming input and for each event/row i have to write a custom logic/function which can return multiple rows.

looks like for return type UDF only supports basic type and not list/array.

Any workaround for this?

for sample my UDF is something like below, so that i can explode to create multiple rows.
`
Func<Column, Column> ToUpperList = Udf<string, string[]>((arg) =>
{
var ret = new string[] { arg, arg.ToUpper()};
return ret;
});

        var query = inStream
            .Select(Explode( ToUpperList(Col("InputEventColumn"))))
            .WriteStream()
            .OutputMode("append")
            .Format("console")
            .Start();`

Automate packaging release artifacts

The build pipieline should automatically produce the following worker binaries for release:

Microsoft.Spark.Worker.net461.win-x64-0.1.0.zip
Microsoft.Spark.Worker.netcoreapp2.1.linux-x64-0.1.0.tar.gz
Microsoft.Spark.Worker.netcoreapp2.1.win-x64-0.1.0.zip

dotnet / spark Goto Github PK

spark's Introduction

.NET for Apache® Spark™

Table of Contents

Supported Apache Spark

Releases

Get Started

Build Status

Building from Source

Samples

Analytics Scenario

Description

Scenarios

Dataframes and SparkSQL

Basic C# F#

Structured Streaming

Word Count C# F#

Windowed Word Count C# F#

Word Count on data from Kafka C# F#

TPC-H Queries

TPC-H Functional C#

TPC-H SparkSQL C#

Contributing

Inspiration and Special Thanks

How to Engage, Contribute and Provide Feedback

Support

.NET Foundation

Code of Conduct

License

spark's People

Contributors

Stargazers

Watchers

Forkers

spark's Issues

Python Examples

Recommend Projects

Recommend Topics

Recommend Org

Jobs