lensacom / sparkit-learn Goto Github PK
View Code? Open in Web Editor NEWPySpark + Scikit-learn = Sparkit-learn
License: Apache License 2.0
PySpark + Scikit-learn = Sparkit-learn
License: Apache License 2.0
see http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.sort.html#numpy.ndarray.sort
This is NOT equivalent with ArrayRDD.sortBy() or sortByKey(), which are proxied to the underlying RDD object.
PySpark machine learning packages are getting more robust.
It is considerable to make use of the already implemented distributed algorithms through an sklearn
compatible interface instead of porting the non-distributed ones.
We are trying to create a Count Vector using SparkCountVectorizer. We are using Python 2.6.6, hence
replaced all the dict_comprehensions in the code. We ran into this following error:
File "base.py", line 19, in func_wrapper
return func(*args, **kwargs)
File "splearn_custom.py", line 176, in _count_vocab
j_indices.append(vocabulary[feature])
TypeError: 'Broadcast' object is unsubscriptable
Here, splearn_custom.py refers to feature_extraction/text.py
Thanks in advance
I am trying to create a DictRdd as follows:
cleanedRdd=sc.sequenceFile(path="hdfs:///bdpilot/text_mining/sequence_input_with_target_v",minSplits=100)
train_rdd,test_rdd = cleanedRdd.randomSplit([0.7,0.3])
train_rdd.saveAsSequenceFile("hdfs:///bdpilot/text_mining/sequence_train_input")
train_rdd = sc.sequenceFile(path="hdfs:///bdpilot3_h/text_mining/sequence_train_input",minSplits=100)
train_y = train_rdd.map(lambda(x,y): int(y.split("~")[1]))
train_text = train_rdd.map(lambda(x,y): y.split("~")[0])
train_Z = DictRDD((train_text,train_y),columns=('X','y'),bsize=50)
But, I get the follwing error, when I do:
train_Z.first()
org.apache.spark.SparkException: Can only zip RDDs with same number of
elements in each partition
I tried the following as well, but with no sucess:
train_y = train_rdd.map(lambda(x,y): int(y.split("~")[1]),perservesPartitioning=True)
train_text = train_rdd.map(lambda(x,y): y.split("~")[0],perservesPartitioning=True)
train_Z = DictRDD((train_text,train_y),columns=('X','y'),bsize=50)
In README
Interesting for me would be also option to be able to use skflow
https://github.com/tensorflow/skflow I guess that it would have to be done together with skflow
community.
Since skflow
has very similar interface like scikit-learn
integration here might be more logical than starting a completely new project.
There actually exists some effort in reaching that at the side of skflow
tensorflow/skflow#97
Hi,
I wanted to know if I can use these functionality in Scala too. If yes, would you provide a example in Scala?
To increase efficiency _idf_diag should be broadcasted before transformation.
I am getting syntax errors in
I have been trying to run DBSCAN, using Python from command line .. I got this error
ImportError: cannot import name _get_unmangled_double_vector_rdd
Any one can help me regarding this ?
Hi,
Can we get the confidence score, like we get it in sci-kit learn using decision function method?
I get the following error when I run the code:
svm_model.decision_function(Z[:,'X'])
error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.6/site-packages/sklearn/linear_model/base.py", line 199, in decision_function
X = check_array(X, accept_sparse='csr')
File "/usr/lib64/python2.6/site-packages/sklearn/utils/validation.py", line 344, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.
Thanks
Hi,
I am using SparkLinearSVC. The code is as follows:
svm_model = SparkLinearSVC(class_weight='auto')
svm_fitted = svm_model.fit(train_Z,classes=np.unique(train_y))
and I get the following error:
File "/DATA/sdw1/hadoop/yarn/local/usercache/ad79139/filecache/328/spark-assembly-1.2.1.2.2.4.2-2-hadoop2.6.0.2.2.4.2-2.jar/pyspark/worker.py", line 98, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/hdp/2.2.4.2-2/spark/python/pyspark/rdd.py", line 2081, in pipeline_func
return func(split, prev_func(split, iterator))
File "/usr/hdp/2.2.4.2-2/spark/python/pyspark/rdd.py", line 2081, in pipeline_func
return func(split, prev_func(split, iterator))
File "/usr/hdp/2.2.4.2-2/spark/python/pyspark/rdd.py", line 258, in func
return f(iterator)
File "/usr/hdp/2.2.4.2-2/spark/python/pyspark/rdd.py", line 820, in <lambda>
return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
File "/usr/lib/python2.6/site-packages/splearn/linear_model/base.py", line 81, in <lambda>
mapper = lambda X_y: super(cls, self).fit(
File "/usr/lib64/python2.6/site-packages/sklearn/svm/classes.py", line 207, in fit
self.loss
File "/usr/lib64/python2.6/site-packages/sklearn/svm/base.py", line 809, in _fit_liblinear
" class: %r" % classes_[0])
ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: 0
whereas, I have 2 classes, namely 0 and 1. The block size of the DictRDD is 2000. The percentage of classes 0 and 1 are 92% and 8% respectively
see http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.var.html#numpy.ndarray.var
There is already an implementation in https://github.com/lensacom/sparkit-learn/blob/master/splearn/feature_selection/variance_threshold.py#L62 but it should use ndarray.var instead of mean_variance_axis.
RDD.reduce() collect the results of mapPartition() then reduces locally. In case of stateful transformers (SparkTfidfTransformer, SparkVarianceThreshold) it causes large memory consumption on driver side.
Use treeReduce() or treeAggregate() instead.
First of all, thanks for the git. It is very helpful.
When running splearn/feature_extraction/text.py in pyspark shell, I am getting "AttributeError: 'SparkCountVectorizer' object has no attribute '_validate_vocabulary' " in fit_transform method. (Is it need to be '_init_vocab' or something??)
Python 2.7
Spark 1.2.0
Scikit 0.15.2
numpy 1.9.0
Code I am using -
vect = text.SparkCountVectorizer()
result_dist = vect.fit_transform(docs).collect()
If an appropriate to post the issue details here, please redirect the same. Thanks in advance
Are there any instructions to use it on spark cluster? looks like its not integrated with spark packages. Any help is appreciated
From the documentation examples, I tried to vectorize a list of texts with SparkCountVectorizer
:
In:
from splearn.rdd import ArrayRDD
from splearn.feature_extraction.text import SparkCountVectorizer
import pandas as pd
df = pd.read_csv('/Users/user/Downloads/new_labeled_corpus.csv')
#print (df.head())
X = [df['text'].values]
print (type(X))
X_rdd = ArrayRDD(sc.parallelize(X, 4)) # sc is SparkContext
print (X_rdd)
Out:
<class 'list'>
<class 'splearn.rdd.ArrayRDD'> from PythonRDD[27] at RDD at PythonRDD.scala:43
In:
dist = SparkCountVectorizer()
result_dist = dist.fit_transform(X_rdd) # SparseRDD
However, I got this exception:
Out:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-15-5070dde8118a> in <module>()
1 dist = SparkCountVectorizer()
----> 2 result_dist = dist.fit_transform(X_rdd) # SparseRDD
/usr/local/lib/python3.5/site-packages/splearn/feature_extraction/text.py in fit_transform(self, Z)
291 # create vocabulary
292 X = A[:, 'X'] if isinstance(A, DictRDD) else A
--> 293 self.vocabulary_ = self._init_vocab(X)
294
295 # transform according to vocabulary
/usr/local/lib/python3.5/site-packages/splearn/feature_extraction/text.py in _init_vocab(self, analyzed_docs)
152 accum = analyzed_docs._rdd.context.accumulator(set(), SetAccum())
153 analyzed_docs.foreach(
--> 154 lambda x: accum.add(set(chain.from_iterable(x))))
155 vocabulary = {t: i for i, t in enumerate(accum.value)}
156 else:
/usr/local/lib/python3.5/site-packages/splearn/rdd.py in bypass(*args, **kwargs)
172 """
173 def bypass(*args, **kwargs):
--> 174 result = getattr(self._rdd, attr)(*args, **kwargs)
175 if isinstance(result, RDD):
176 if result is self._rdd:
/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/rdd.py in foreach(self, f)
745 f(x)
746 return iter([])
--> 747 self.mapPartitions(processPartition).count() # Force evaluation
748
749 def foreachPartition(self, f):
/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/rdd.py in count(self)
1002 3
1003 """
-> 1004 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
1005
1006 def stats(self):
/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/rdd.py in sum(self)
993 6.0
994 """
--> 995 return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
996
997 def count(self):
/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/rdd.py in fold(self, zeroValue, op)
867 # zeroValue provided to each partition is unique from the one provided
868 # to the final reduce call
--> 869 vals = self.mapPartitions(func).collect()
870 return reduce(op, vals, zeroValue)
871
/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/rdd.py in collect(self)
769 """
770 with SCCallSiteSync(self.context) as css:
--> 771 port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
772 return list(_load_from_socket(port, self._jrdd_deserializer))
773
/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
811 answer = self.gateway_client.send_command(command)
812 return_value = get_return_value(
--> 813 answer, self.gateway_client, self.target_id, self.name)
814
815 for temp_arg in temp_args:
/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/sql/utils.py in deco(*a, **kw)
43 def deco(*a, **kw):
44 try:
---> 45 return f(*a, **kw)
46 except py4j.protocol.Py4JJavaError as e:
47 s = e.java_exception.toString()
/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
306 raise Py4JJavaError(
307 "An error occurred while calling {0}{1}{2}.\n".
--> 308 format(target_id, ".", name), value)
309 else:
310 raise Py4JError(
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 8.0 failed 1 times, most recent failure: Lost task 3.0 in stage 8.0 (TID 35, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
process()
File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/usr/local/lib/python3.5/site-packages/splearn/feature_extraction/text.py", line 289, in <lambda>
A = Z.transform(lambda X: list(map(analyze, X)), column='X').persist()
File "/usr/local/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 238, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "/usr/local/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 204, in <lambda>
return lambda x: strip_accents(x.lower())
AttributeError: 'numpy.ndarray' object has no attribute 'lower'
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:405)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
process()
File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/usr/local/lib/python3.5/site-packages/splearn/feature_extraction/text.py", line 289, in <lambda>
A = Z.transform(lambda X: list(map(analyze, X)), column='X').persist()
File "/usr/local/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 238, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "/usr/local/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 204, in <lambda>
return lambda x: strip_accents(x.lower())
AttributeError: 'numpy.ndarray' object has no attribute 'lower'
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
... 1 more
Any idea of how to correctly vectorize X
with sparkit-learn and why this Py4JJavaError occurred?.
countvectorizer = SparkCountVectorizer(tokenizer=tokenize_pre_process)
count_vector
<class 'rdd.ArrayRDD'> from PythonRDD[22] at collect at rdd.py:168
sel_vt = SparkVarianceThreshold()
red_vt_vector = sel_vt.fit_transform(count_vector)
Traceback (most recent call last):
File "", line 1, in
File "base.py", line 63, in fit_transform
return self.fit(Z, **fit_params).transform(Z)
File "feature_selection.py", line 72, in fit
_, , self.variances = X.map(mapper).treeReduce(reducer)
File "rdd.py", line 179, in getattr
self.class, attr))
AttributeError: <class 'rdd.BlockRDD'> object has no attribute treeReduce
I am using spark 1.2.1, and I think rdd has the method treeReduce.
Would you have any idea why this error could be popping out of the ArrayRDD extendig BlockRDD
From the the documentation, I tried a very simple classification pipeline:
In:
X = df['text'].values
y = df['labels'].values
#<class 'numpy.ndarray'> :
X_rdd = sc.parallelize(X, 4)
#<class 'numpy.ndarray'> :
y_rdd = sc.parallelize(y, 4)
Z = DictRDD((X_rdd, y_rdd),
columns=('text', 'labels'),
dtype=[np.ndarray, np.ndarray])
Then:
Out:
<class 'splearn.rdd.DictRDD'> from PythonRDD[54] at RDD at PythonRDD.scala:43
Then I initialize both, a distribuited pipeline and a local one:
In:
local_pipeline = Pipeline((
('vect', HashingVectorizer()),
('tfidf', TfidfTransformer()),
('clf', LinearSVC())
))
dist_pipeline = SparkPipeline((
('vect', SparkHashingVectorizer()),
('tfidf', SparkTfidfTransformer()),
('clf', SparkLinearSVC())
))
local_pipeline.fit(X, y)
dist_pipeline.fit(Z, clf__classes=np.unique(y))
y_pred_local = local_pipeline.predict(X)
y_pred_dist = dist_pipeline.predict(Z[:, 'text'])
However, I got the following exception:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-26-dfd551e98564> in <module>()
11
12 local_pipeline.fit(X, y)
---> 13 dist_pipeline.fit(Z, clf__classes=np.unique(y))
14
15 y_pred_local = local_pipeline.predict(X)
/usr/local/lib/python3.5/site-packages/splearn/pipeline.py in fit(self, Z, **fit_params)
108 """
109 Zt, fit_params = self._pre_transform(Z, **fit_params)
--> 110 self.steps[-1][-1].fit(Zt, **fit_params)
111 return self
112
/usr/local/lib/python3.5/site-packages/splearn/svm/classes.py in fit(self, Z, classes)
117 check_rdd(Z, {'X': (sp.spmatrix, np.ndarray)})
118 self._classes_ = np.unique(classes)
--> 119 return self._spark_fit(SparkLinearSVC, Z)
120
121 def predict(self, X):
/usr/local/lib/python3.5/site-packages/splearn/linear_model/base.py in _spark_fit(self, cls, Z, *args, **kwargs)
82 )
83 models = Z.map(mapper)
---> 84 avg = models.sum() / models.count()
85 self.__dict__.update(avg.__dict__)
86 return self
/usr/local/lib/python3.5/site-packages/splearn/rdd.py in bypass(*args, **kwargs)
172 """
173 def bypass(*args, **kwargs):
--> 174 result = getattr(self._rdd, attr)(*args, **kwargs)
175 if isinstance(result, RDD):
176 if result is self._rdd:
/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/rdd.py in sum(self)
993 6.0
994 """
--> 995 return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
996
997 def count(self):
/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/rdd.py in fold(self, zeroValue, op)
867 # zeroValue provided to each partition is unique from the one provided
868 # to the final reduce call
--> 869 vals = self.mapPartitions(func).collect()
870 return reduce(op, vals, zeroValue)
871
/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/rdd.py in collect(self)
769 """
770 with SCCallSiteSync(self.context) as css:
--> 771 port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
772 return list(_load_from_socket(port, self._jrdd_deserializer))
773
/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
811 answer = self.gateway_client.send_command(command)
812 return_value = get_return_value(
--> 813 answer, self.gateway_client, self.target_id, self.name)
814
815 for temp_arg in temp_args:
/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/sql/utils.py in deco(*a, **kw)
43 def deco(*a, **kw):
44 try:
---> 45 return f(*a, **kw)
46 except py4j.protocol.Py4JJavaError as e:
47 s = e.java_exception.toString()
/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
306 raise Py4JJavaError(
307 "An error occurred while calling {0}{1}{2}.\n".
--> 308 format(target_id, ".", name), value)
309 else:
310 raise Py4JError(
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 30.0 failed 1 times, most recent failure: Lost task 0.0 in stage 30.0 (TID 96, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
process()
File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/rdd.py", line 864, in func
acc = op(obj, acc)
File "/usr/local/lib/python3.5/site-packages/splearn/linear_model/base.py", line 27, in __add__
model.coef_ += other.coef_
AttributeError: 'int' object has no attribute 'coef_'
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:405)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.GeneratedMethodAccessor41.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
process()
File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/rdd.py", line 864, in func
acc = op(obj, acc)
File "/usr/local/lib/python3.5/site-packages/splearn/linear_model/base.py", line 27, in __add__
model.coef_ += other.coef_
AttributeError: 'int' object has no attribute 'coef_'
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
... 1 more
Any idea of why this is happening and how to solve this issue?.
The scipy.sparse and ndarray interfaces are not fully compatible. It'd be reasonable to separate them into own classes.
During execution of following simple code with Sparkit-Learn:
from splearn.svm import SparkLinearSVC
spark=SparkLinearSVC()
I get following error message:
ImportError: pyspark home needs to be added to PYTHONPATH.
export PYTHONPATH=$PYTHONPATH:$SPARK_HOME/python:../
In accordance with those anserws:
http://stackoverflow.com/questions/28829757/unable-to-add-spark-to-pythonpath
http://stackoverflow.com/questions/23256536/importing-pyspark-in-python-shell
I have added every possible configuration of those PYTHONPATHs to my .bashrc,but error is still occuring.
Currently my .bashrc paths looks like that:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH=$JAVA_HOME/bin:$PATH
export PATH=/home/123/anaconda2/bin:$PATH
export SPARK_HOME=/home/123/Downloads/spark-1.6.1-bin-hadoop2.6
export PATH=$SPARK_HOME/bin:$PATH
export PATH=$JAVA_HOME/jre/lib/amd64/server:$PATH
export PATH=$JAVA_HOME/jre/lib/amd64:$PATH
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
Any possible solution? I am running this on Ubuntu 16.04,Pycharm and spark-1.6.1-bin-hadoop2.6
Like sklearn does.
When running splearn/feature_extraction/text.py using pyspark shell, I am getting invalid syntax at :
vocabulary = {t: i for i, t in enumerate(accum.value)}
python 2.6.6
spark 1.2.0
numpy 1.9.0
scikit 1.5.2
If anappropriate to put the issue details here, please redirect the same. Thanks in advance
Hi,
I have generated a SparseRDD using SparkHashingVectotiser and an ArrayRDD, containing the target values(1s and 0s). I want to do a matrix multiplication between these two. Is there a way to do matrix multiplication of Sparse RDDs?
Execution time could benefit from an optimizer like numba.
From numba's documentation:
"Numba works by generating optimized machine code using the LLVM compiler infrastructure at import time, runtime, or statically (using the included pycc tool). Numba supports compilation of Python to run on either CPU or GPU hardware, and is designed to integrate with the Python scientific software stack."
see http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.std.html#numpy.ndarray.std
see ArrayRDD.mean()
Looks great. Can this support pulling RDD's from Cassandra with CQL ?
Spark 1.4 now supports python 3 https://spark.apache.org/releases/spark-release-1-4-0.html
Hello,
When I try to train a linear model (with SGD, SparkLinearSVC...) I get an exception:
AttributeError: 'int' object has no attribute 'coef_'
Here is a snippet which trigger the error:
from splearn.rdd import DictRDD
from splearn.svm import SparkLinearSVC
from sklearn.datasets import make_classification
import numpy as np
X, y = make_classification(n_classes=2, n_samples=10000, n_features=2,
n_informative=2, n_redundant=0,
n_clusters_per_class=1, random_state=42)
X_rdd = sc.parallelize(X, 4)
y_rdd = sc.parallelize(y, 4)
Z = DictRDD((X_rdd, y_rdd),
columns=('X', 'y'),
dtype=[np.ndarray, np.ndarray])
lr = SparkLinearSVC()
lr.fit(Z, classes=np.unique(y))
y_pred_local = local_pipeline.predict(X)
y_pred_dist = dist_pipeline.predict(Z[:, 'X'])
I get
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1280)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1268)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1267)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1267)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1493)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1455)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1444)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1813)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1826)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1839)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/lib/spark/python/pyspark/worker.py", line 111, in main
process()
File "/usr/lib/spark/python/pyspark/worker.py", line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/lib/spark/python/pyspark/serializers.py", line 263, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/usr/lib/spark/python/pyspark/rdd.py", line 866, in func
acc = op(obj, acc)
File "/usr/local/lib/python3.4/dist-packages/splearn/linear_model/base.py", line 27, in __add__
model.coef_ += other.coef_
AttributeError: 'int' object has no attribute 'coef_'
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:138)
at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:179)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:97)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
... 1 more
It is the same with Python 3.4.2 and Python 2.7.9. I use spark 1.5.0
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.