GithubHelp home page GithubHelp logo

lensacom / sparkit-learn Goto Github PK

View Code? Open in Web Editor NEW
1.1K 1.1K 254.0 455 KB

PySpark + Scikit-learn = Sparkit-learn

License: Apache License 2.0

Python 98.89% Scala 0.94% Shell 0.17%
apache-spark distributed-computing machine-learning python scikit-learn

sparkit-learn's People

Contributors

chuckwoodraska avatar fulibacsi avatar gaborbarna avatar huandy avatar kszucs avatar nralat avatar tdna avatar vchollati avatar zseder avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sparkit-learn's Issues

[RFC] Scikit interface for the `ml` and `mllib` packages

PySpark machine learning packages are getting more robust.
It is considerable to make use of the already implemented distributed algorithms through an sklearn compatible interface instead of porting the non-distributed ones.

TypeError: 'Broadcast' object is unsubscriptable

We are trying to create a Count Vector using SparkCountVectorizer. We are using Python 2.6.6, hence
replaced all the dict_comprehensions in the code. We ran into this following error:

  File "base.py", line 19, in func_wrapper
    return func(*args, **kwargs)
  File "splearn_custom.py", line 176, in _count_vocab
    j_indices.append(vocabulary[feature])
TypeError: 'Broadcast' object is unsubscriptable

Here, splearn_custom.py refers to feature_extraction/text.py

Thanks in advance

Error in Creating DictRdd: Can only zip RDDs with same number of elements in each partition

I am trying to create a DictRdd as follows:

cleanedRdd=sc.sequenceFile(path="hdfs:///bdpilot/text_mining/sequence_input_with_target_v",minSplits=100)
train_rdd,test_rdd = cleanedRdd.randomSplit([0.7,0.3])
train_rdd.saveAsSequenceFile("hdfs:///bdpilot/text_mining/sequence_train_input")

train_rdd = sc.sequenceFile(path="hdfs:///bdpilot3_h/text_mining/sequence_train_input",minSplits=100)

train_y = train_rdd.map(lambda(x,y): int(y.split("~")[1]))
train_text = train_rdd.map(lambda(x,y): y.split("~")[0])

train_Z = DictRDD((train_text,train_y),columns=('X','y'),bsize=50)

But, I get the follwing error, when I do:

train_Z.first()
org.apache.spark.SparkException: Can only zip RDDs with same number of
elements in each partition

I tried the following as well, but with no sucess:

train_y = train_rdd.map(lambda(x,y): int(y.split("~")[1]),perservesPartitioning=True)
train_text = train_rdd.map(lambda(x,y): y.split("~")[0],perservesPartitioning=True)
train_Z = DictRDD((train_text,train_y),columns=('X','y'),bsize=50)

Integrate skflow

Interesting for me would be also option to be able to use skflow https://github.com/tensorflow/skflow I guess that it would have to be done together with skflow community.

Since skflow has very similar interface like scikit-learn integration here might be more logical than starting a completely new project.

There actually exists some effort in reaching that at the side of skflow tensorflow/skflow#97

Scala support?

Hi,

I wanted to know if I can use these functionality in Scala too. If yes, would you provide a example in Scala?

Syntax Errors

I am getting syntax errors in

  • splearn/base.py :
    • for name in self.transient}, line 12
  • splearn/feature_extraction/text.py :
    • vocabulary = {t: i for i, t in enumerate(accum.value)}, line 154

DBSCAN Import Error

I have been trying to run DBSCAN, using Python from command line .. I got this error

ImportError: cannot import name _get_unmangled_double_vector_rdd

Any one can help me regarding this ?

Decision function for LinearSVC

Hi,

Can we get the confidence score, like we get it in sci-kit learn using decision function method?
I get the following error when I run the code:

svm_model.decision_function(Z[:,'X'])

error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.6/site-packages/sklearn/linear_model/base.py", line 199, in decision_function
    X = check_array(X, accept_sparse='csr')
  File "/usr/lib64/python2.6/site-packages/sklearn/utils/validation.py", line 344, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.

Thanks

ValueError: This solver needs samples of at least 2 classes in the data

Hi,

I am using SparkLinearSVC. The code is as follows:

svm_model = SparkLinearSVC(class_weight='auto')
svm_fitted = svm_model.fit(train_Z,classes=np.unique(train_y))

and I get the following error:

File "/DATA/sdw1/hadoop/yarn/local/usercache/ad79139/filecache/328/spark-assembly-1.2.1.2.2.4.2-2-hadoop2.6.0.2.2.4.2-2.jar/pyspark/worker.py", line 98, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/hdp/2.2.4.2-2/spark/python/pyspark/rdd.py", line 2081, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/usr/hdp/2.2.4.2-2/spark/python/pyspark/rdd.py", line 2081, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/usr/hdp/2.2.4.2-2/spark/python/pyspark/rdd.py", line 258, in func
    return f(iterator)
  File "/usr/hdp/2.2.4.2-2/spark/python/pyspark/rdd.py", line 820, in <lambda>
    return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
  File "/usr/lib/python2.6/site-packages/splearn/linear_model/base.py", line 81, in <lambda>
    mapper = lambda X_y: super(cls, self).fit(
  File "/usr/lib64/python2.6/site-packages/sklearn/svm/classes.py", line 207, in fit
    self.loss
  File "/usr/lib64/python2.6/site-packages/sklearn/svm/base.py", line 809, in _fit_liblinear
    " class: %r" % classes_[0])
ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: 0

whereas, I have 2 classes, namely 0 and 1. The block size of the DictRDD is 2000. The percentage of classes 0 and 1 are 92% and 8% respectively

Avoid using reduce() with large partition outputs

RDD.reduce() collect the results of mapPartition() then reduces locally. In case of stateful transformers (SparkTfidfTransformer, SparkVarianceThreshold) it causes large memory consumption on driver side.

Use treeReduce() or treeAggregate() instead.

Validate Vocabulary issue

First of all, thanks for the git. It is very helpful.

When running splearn/feature_extraction/text.py in pyspark shell, I am getting "AttributeError: 'SparkCountVectorizer' object has no attribute '_validate_vocabulary' " in fit_transform method. (Is it need to be '_init_vocab' or something??)

Python 2.7
Spark 1.2.0
Scikit 0.15.2
numpy 1.9.0

Code I am using -
vect = text.SparkCountVectorizer()
result_dist = vect.fit_transform(docs).collect()

If an appropriate to post the issue details here, please redirect the same. Thanks in advance

Instructions for installation

Are there any instructions to use it on spark cluster? looks like its not integrated with spark packages. Any help is appreciated

Py4JJavaError while fit_transform(X_rdd)

From the documentation examples, I tried to vectorize a list of texts with SparkCountVectorizer:

In:
from splearn.rdd import ArrayRDD
from splearn.feature_extraction.text import SparkCountVectorizer
import pandas as pd

df = pd.read_csv('/Users/user/Downloads/new_labeled_corpus.csv')
#print (df.head())
X = [df['text'].values]
print (type(X))
X_rdd = ArrayRDD(sc.parallelize(X, 4))  # sc is SparkContext
print (X_rdd)

Out:
<class 'list'>
<class 'splearn.rdd.ArrayRDD'> from PythonRDD[27] at RDD at PythonRDD.scala:43
In:
dist = SparkCountVectorizer()
result_dist = dist.fit_transform(X_rdd)  # SparseRDD

However, I got this exception:

Out:
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-15-5070dde8118a> in <module>()
      1 dist = SparkCountVectorizer()
----> 2 result_dist = dist.fit_transform(X_rdd)  # SparseRDD

/usr/local/lib/python3.5/site-packages/splearn/feature_extraction/text.py in fit_transform(self, Z)
    291         # create vocabulary
    292         X = A[:, 'X'] if isinstance(A, DictRDD) else A
--> 293         self.vocabulary_ = self._init_vocab(X)
    294 
    295         # transform according to vocabulary

/usr/local/lib/python3.5/site-packages/splearn/feature_extraction/text.py in _init_vocab(self, analyzed_docs)
    152             accum = analyzed_docs._rdd.context.accumulator(set(), SetAccum())
    153             analyzed_docs.foreach(
--> 154                 lambda x: accum.add(set(chain.from_iterable(x))))
    155             vocabulary = {t: i for i, t in enumerate(accum.value)}
    156         else:

/usr/local/lib/python3.5/site-packages/splearn/rdd.py in bypass(*args, **kwargs)
    172         """
    173         def bypass(*args, **kwargs):
--> 174             result = getattr(self._rdd, attr)(*args, **kwargs)
    175             if isinstance(result, RDD):
    176                 if result is self._rdd:

/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/rdd.py in foreach(self, f)
    745                 f(x)
    746             return iter([])
--> 747         self.mapPartitions(processPartition).count()  # Force evaluation
    748 
    749     def foreachPartition(self, f):

/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/rdd.py in count(self)
   1002         3
   1003         """
-> 1004         return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   1005 
   1006     def stats(self):

/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/rdd.py in sum(self)
    993         6.0
    994         """
--> 995         return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
    996 
    997     def count(self):

/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/rdd.py in fold(self, zeroValue, op)
    867         # zeroValue provided to each partition is unique from the one provided
    868         # to the final reduce call
--> 869         vals = self.mapPartitions(func).collect()
    870         return reduce(op, vals, zeroValue)
    871 

/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/rdd.py in collect(self)
    769         """
    770         with SCCallSiteSync(self.context) as css:
--> 771             port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
    772         return list(_load_from_socket(port, self._jrdd_deserializer))
    773 

/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
    811         answer = self.gateway_client.send_command(command)
    812         return_value = get_return_value(
--> 813             answer, self.gateway_client, self.target_id, self.name)
    814 
    815         for temp_arg in temp_args:

/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/sql/utils.py in deco(*a, **kw)
     43     def deco(*a, **kw):
     44         try:
---> 45             return f(*a, **kw)
     46         except py4j.protocol.Py4JJavaError as e:
     47             s = e.java_exception.toString()

/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    306                 raise Py4JJavaError(
    307                     "An error occurred while calling {0}{1}{2}.\n".
--> 308                     format(target_id, ".", name), value)
    309             else:
    310                 raise Py4JError(

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 8.0 failed 1 times, most recent failure: Lost task 3.0 in stage 8.0 (TID 35, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/local/lib/python3.5/site-packages/splearn/feature_extraction/text.py", line 289, in <lambda>
    A = Z.transform(lambda X: list(map(analyze, X)), column='X').persist()
  File "/usr/local/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 238, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "/usr/local/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 204, in <lambda>
    return lambda x: strip_accents(x.lower())
AttributeError: 'numpy.ndarray' object has no attribute 'lower'

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
    at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:405)
    at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/local/lib/python3.5/site-packages/splearn/feature_extraction/text.py", line 289, in <lambda>
    A = Z.transform(lambda X: list(map(analyze, X)), column='X').persist()
  File "/usr/local/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 238, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "/usr/local/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 204, in <lambda>
    return lambda x: strip_accents(x.lower())
AttributeError: 'numpy.ndarray' object has no attribute 'lower'

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    ... 1 more

Any idea of how to correctly vectorize X with sparkit-learn and why this Py4JJavaError occurred?.

Spark version 1.2.1, error AttributeError: <class 'rdd.BlockRDD'> object has no attribute treeReduce

countvectorizer = SparkCountVectorizer(tokenizer=tokenize_pre_process)
count_vector
<class 'rdd.ArrayRDD'> from PythonRDD[22] at collect at rdd.py:168
sel_vt = SparkVarianceThreshold()
red_vt_vector = sel_vt.fit_transform(count_vector)
Traceback (most recent call last):
File "", line 1, in
File "base.py", line 63, in fit_transform
return self.fit(Z, **fit_params).transform(Z)
File "feature_selection.py", line 72, in fit
_, , self.variances = X.map(mapper).treeReduce(reducer)
File "rdd.py", line 179, in getattr
self.class, attr))
AttributeError: <class 'rdd.BlockRDD'> object has no attribute treeReduce

I am using spark 1.2.1, and I think rdd has the method treeReduce.
Would you have any idea why this error could be popping out of the ArrayRDD extendig BlockRDD

Py4JJavaError while fitting a splearn.rdd.DictRDD?

From the the documentation, I tried a very simple classification pipeline:

In:


X = df['text'].values
y = df['labels'].values

#<class 'numpy.ndarray'> :
X_rdd = sc.parallelize(X, 4)
#<class 'numpy.ndarray'> :
y_rdd = sc.parallelize(y, 4)


Z = DictRDD((X_rdd, y_rdd),
            columns=('text', 'labels'),
            dtype=[np.ndarray, np.ndarray])

Then:

Out:
<class 'splearn.rdd.DictRDD'> from PythonRDD[54] at RDD at PythonRDD.scala:43

Then I initialize both, a distribuited pipeline and a local one:

In:    
local_pipeline = Pipeline((
        ('vect', HashingVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('clf', LinearSVC())
    ))
    dist_pipeline = SparkPipeline((
        ('vect', SparkHashingVectorizer()),
        ('tfidf', SparkTfidfTransformer()),
        ('clf', SparkLinearSVC())
    ))

    local_pipeline.fit(X, y)
    dist_pipeline.fit(Z, clf__classes=np.unique(y))

    y_pred_local = local_pipeline.predict(X)
    y_pred_dist = dist_pipeline.predict(Z[:, 'text'])

However, I got the following exception:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-26-dfd551e98564> in <module>()
     11 
     12 local_pipeline.fit(X, y)
---> 13 dist_pipeline.fit(Z, clf__classes=np.unique(y))
     14 
     15 y_pred_local = local_pipeline.predict(X)

/usr/local/lib/python3.5/site-packages/splearn/pipeline.py in fit(self, Z, **fit_params)
    108         """
    109         Zt, fit_params = self._pre_transform(Z, **fit_params)
--> 110         self.steps[-1][-1].fit(Zt, **fit_params)
    111         return self
    112 

/usr/local/lib/python3.5/site-packages/splearn/svm/classes.py in fit(self, Z, classes)
    117         check_rdd(Z, {'X': (sp.spmatrix, np.ndarray)})
    118         self._classes_ = np.unique(classes)
--> 119         return self._spark_fit(SparkLinearSVC, Z)
    120 
    121     def predict(self, X):

/usr/local/lib/python3.5/site-packages/splearn/linear_model/base.py in _spark_fit(self, cls, Z, *args, **kwargs)
     82         )
     83         models = Z.map(mapper)
---> 84         avg = models.sum() / models.count()
     85         self.__dict__.update(avg.__dict__)
     86         return self

/usr/local/lib/python3.5/site-packages/splearn/rdd.py in bypass(*args, **kwargs)
    172         """
    173         def bypass(*args, **kwargs):
--> 174             result = getattr(self._rdd, attr)(*args, **kwargs)
    175             if isinstance(result, RDD):
    176                 if result is self._rdd:

/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/rdd.py in sum(self)
    993         6.0
    994         """
--> 995         return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
    996 
    997     def count(self):

/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/rdd.py in fold(self, zeroValue, op)
    867         # zeroValue provided to each partition is unique from the one provided
    868         # to the final reduce call
--> 869         vals = self.mapPartitions(func).collect()
    870         return reduce(op, vals, zeroValue)
    871 

/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/rdd.py in collect(self)
    769         """
    770         with SCCallSiteSync(self.context) as css:
--> 771             port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
    772         return list(_load_from_socket(port, self._jrdd_deserializer))
    773 

/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
    811         answer = self.gateway_client.send_command(command)
    812         return_value = get_return_value(
--> 813             answer, self.gateway_client, self.target_id, self.name)
    814 
    815         for temp_arg in temp_args:

/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/sql/utils.py in deco(*a, **kw)
     43     def deco(*a, **kw):
     44         try:
---> 45             return f(*a, **kw)
     46         except py4j.protocol.Py4JJavaError as e:
     47             s = e.java_exception.toString()

/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    306                 raise Py4JJavaError(
    307                     "An error occurred while calling {0}{1}{2}.\n".
--> 308                     format(target_id, ".", name), value)
    309             else:
    310                 raise Py4JError(

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 30.0 failed 1 times, most recent failure: Lost task 0.0 in stage 30.0 (TID 96, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/rdd.py", line 864, in func
    acc = op(obj, acc)
  File "/usr/local/lib/python3.5/site-packages/splearn/linear_model/base.py", line 27, in __add__
    model.coef_ += other.coef_
AttributeError: 'int' object has no attribute 'coef_'

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
    at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:405)
    at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
    at sun.reflect.GeneratedMethodAccessor41.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/rdd.py", line 864, in func
    acc = op(obj, acc)
  File "/usr/local/lib/python3.5/site-packages/splearn/linear_model/base.py", line 27, in __add__
    model.coef_ += other.coef_
AttributeError: 'int' object has no attribute 'coef_'

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    ... 1 more

Any idea of why this is happening and how to solve this issue?.

ImportError: pyspark home needs to be added to PYTHONPATH

During execution of following simple code with Sparkit-Learn:

from splearn.svm import SparkLinearSVC
spark=SparkLinearSVC()

I get following error message:

ImportError: pyspark home needs to be added to PYTHONPATH.
export PYTHONPATH=$PYTHONPATH:$SPARK_HOME/python:../

In accordance with those anserws:
http://stackoverflow.com/questions/28829757/unable-to-add-spark-to-pythonpath
http://stackoverflow.com/questions/23256536/importing-pyspark-in-python-shell
I have added every possible configuration of those PYTHONPATHs to my .bashrc,but error is still occuring.

Currently my .bashrc paths looks like that:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH=$JAVA_HOME/bin:$PATH
export PATH=/home/123/anaconda2/bin:$PATH
export SPARK_HOME=/home/123/Downloads/spark-1.6.1-bin-hadoop2.6
export PATH=$SPARK_HOME/bin:$PATH
export PATH=$JAVA_HOME/jre/lib/amd64/server:$PATH
export PATH=$JAVA_HOME/jre/lib/amd64:$PATH
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH

Any possible solution? I am running this on Ubuntu 16.04,Pycharm and spark-1.6.1-bin-hadoop2.6

Invalid syntax issue

When running splearn/feature_extraction/text.py using pyspark shell, I am getting invalid syntax at :
vocabulary = {t: i for i, t in enumerate(accum.value)}

python 2.6.6
spark 1.2.0
numpy 1.9.0
scikit 1.5.2

If anappropriate to put the issue details here, please redirect the same. Thanks in advance

SparseRDD multiplication

Hi,

I have generated a SparseRDD using SparkHashingVectotiser and an ArrayRDD, containing the target values(1s and 0s). I want to do a matrix multiplication between these two. Is there a way to do matrix multiplication of Sparse RDDs?

[RFC] Integrate with numba / parakeet

Execution time could benefit from an optimizer like numba.

From numba's documentation:
"Numba works by generating optimized machine code using the LLVM compiler infrastructure at import time, runtime, or statically (using the included pycc tool). Numba supports compilation of Python to run on either CPU or GPU hardware, and is designed to integrate with the Python scientific software stack."

Cassandra support?

Looks great. Can this support pulling RDD's from Cassandra with CQL ?

Linear models fail with AttributeError: 'int' object has no attribute 'coef_'

Hello,

When I try to train a linear model (with SGD, SparkLinearSVC...) I get an exception:

AttributeError: 'int' object has no attribute 'coef_'

Here is a snippet which trigger the error:

from splearn.rdd import DictRDD
from splearn.svm import SparkLinearSVC
from sklearn.datasets import make_classification
import numpy as np

X, y = make_classification(n_classes=2, n_samples=10000, n_features=2, 
                           n_informative=2, n_redundant=0, 
                           n_clusters_per_class=1, random_state=42)

X_rdd = sc.parallelize(X, 4)
y_rdd = sc.parallelize(y, 4)
Z = DictRDD((X_rdd, y_rdd),
            columns=('X', 'y'),
            dtype=[np.ndarray, np.ndarray])

lr = SparkLinearSVC()

lr.fit(Z, classes=np.unique(y))

y_pred_local = local_pipeline.predict(X)
y_pred_dist = dist_pipeline.predict(Z[:, 'X'])

I get

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1280)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1268)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1267)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1267)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1493)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1455)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1444)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1813)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1826)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1839)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
    at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373)
    at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/lib/spark/python/pyspark/worker.py", line 111, in main
    process()
  File "/usr/lib/spark/python/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/lib/spark/python/pyspark/serializers.py", line 263, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/lib/spark/python/pyspark/rdd.py", line 866, in func
    acc = op(obj, acc)
  File "/usr/local/lib/python3.4/dist-packages/splearn/linear_model/base.py", line 27, in __add__
    model.coef_ += other.coef_
AttributeError: 'int' object has no attribute 'coef_'

    at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:138)
    at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:179)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:97)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:88)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    ... 1 more

It is the same with Python 3.4.2 and Python 2.7.9. I use spark 1.5.0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.