lensacom / sparkit-learn Goto Github PK

View Code? Open in Web Editor NEW

1.1K 1.1K 254.0 455 KB

PySpark + Scikit-learn = Sparkit-learn

License: Apache License 2.0

Python 98.89% Scala 0.94% Shell 0.17%

apache-spark distributed-computing machine-learning python scikit-learn

sparkit-learn's People

Contributors

Stargazers

Watchers

Forkers

likaiguo jaysonsunshine shubhabrataroy jackchua-placed cubreto yangqiu schevalier jxlijunhao giserh pfjob09 lessease daishichao fw1121 luolaihu hendryli nkhuyu laisun josephwinston madcatone lishuliu ml-ai-nlp-ir hsiaoyi0504 intellifora tilumi xypan1232 miguelperalvo rezhajulio gitter-badger jayhetee jigyasu10 chuyuhsu nervenxc mindis umkevinc maejie rbparrish richardxy ike-okonkwo timesofbadri kevinhsu mrshanth mtrstudio chaoxia719 gaoch023 vecchiocarmelo elbehery avsolatorio davande cfregly rahul-c1 taynaud sasbluesea lemontreeshy zhengrenfeng musicplanet wangxiong2015 pombredanne an100 asish12 michael135 chuckwoodraska codeaudit kytle kose-y jretamales yangkf1985 firearasi nebuladream mjk276 gratefulbuaa qwshy haydenliu kszucs tywe alonsopg kartikpadmanabhan eydelrivero machine21 imperio-wxm wuzhongdehua elliott828 gdtm86 nihsmik weiyudang ashhher3 hiffman sandy4321 mizvol scooter5551 jayinai colinsongf melcutz jornason dutinghou eliaidi dtrckd maayanktyagi nikolayvoronchikhin sainiudit aremirata

sparkit-learn's Issues

Implement ArrayRDD.sort() method

see http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.sort.html#numpy.ndarray.sort

This is NOT equivalent with ArrayRDD.sortBy() or sortByKey(), which are proxied to the underlying RDD object.

LSH implementation

[RFC] Scikit interface for the `ml` and `mllib` packages

PySpark machine learning packages are getting more robust.
It is considerable to make use of the already implemented distributed algorithms through an sklearn compatible interface instead of porting the non-distributed ones.

Port RandomizedSearchCV

TypeError: 'Broadcast' object is unsubscriptable

We are trying to create a Count Vector using SparkCountVectorizer. We are using Python 2.6.6, hence
replaced all the dict_comprehensions in the code. We ran into this following error:

  File "base.py", line 19, in func_wrapper
    return func(*args, **kwargs)
  File "splearn_custom.py", line 176, in _count_vocab
    j_indices.append(vocabulary[feature])
TypeError: 'Broadcast' object is unsubscriptable

Here, splearn_custom.py refers to feature_extraction/text.py

Thanks in advance

Error in Creating DictRdd: Can only zip RDDs with same number of elements in each partition

I am trying to create a DictRdd as follows:

cleanedRdd=sc.sequenceFile(path="hdfs:///bdpilot/text_mining/sequence_input_with_target_v",minSplits=100)
train_rdd,test_rdd = cleanedRdd.randomSplit([0.7,0.3])
train_rdd.saveAsSequenceFile("hdfs:///bdpilot/text_mining/sequence_train_input")

train_rdd = sc.sequenceFile(path="hdfs:///bdpilot3_h/text_mining/sequence_train_input",minSplits=100)

train_y = train_rdd.map(lambda(x,y): int(y.split("~")[1]))
train_text = train_rdd.map(lambda(x,y): y.split("~")[0])

train_Z = DictRDD((train_text,train_y),columns=('X','y'),bsize=50)

But, I get the follwing error, when I do:

train_Z.first()

org.apache.spark.SparkException: Can only zip RDDs with same number of
elements in each partition

I tried the following as well, but with no sucess:

train_y = train_rdd.map(lambda(x,y): int(y.split("~")[1]),perservesPartitioning=True)
train_text = train_rdd.map(lambda(x,y): y.split("~")[0],perservesPartitioning=True)
train_Z = DictRDD((train_text,train_y),columns=('X','y'),bsize=50)

Implement SparkStandardScaler

see https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/preprocessing/data.py#L440

Implement ArrayRDD.min() method

see http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.min.html#numpy.ndarray.min

Cache broadcasted values

Consider raising a Warning if linear models fitted on small dataset

See: https://speakerdeck.com/ogrisel/parallel-and-large-scale-machine-learning-with-scikit-learn?slide=48

Implement ArrayRDD.prod() method

see http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.prod.html#numpy.ndarray.prod
see ArrayRDD.sum()

Instructions for installation

In README

Implement ArrayRDD.max() method

see http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.min.html#numpy.ndarray.max

Implement ArrayRDD.flatten() method

see http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.flatten.html#numpy.ndarray.flatten

Integrate skflow

Interesting for me would be also option to be able to use skflow https://github.com/tensorflow/skflow I guess that it would have to be done together with skflow community.

Since skflow has very similar interface like scikit-learn integration here might be more logical than starting a completely new project.

There actually exists some effort in reaching that at the side of skflow tensorflow/skflow#97

Scala support?

Hi,

I wanted to know if I can use these functionality in Scala too. If yes, would you provide a example in Scala?

Broadcast IDF in SparkTfidfTransformer

To increase efficiency _idf_diag should be broadcasted before transformation.

Syntax Errors

I am getting syntax errors in

splearn/base.py :
- for name in self.transient}, line 12
splearn/feature_extraction/text.py :
- vocabulary = {t: i for i, t in enumerate(accum.value)}, line 154

DBSCAN Import Error

I have been trying to run DBSCAN, using Python from command line .. I got this error

ImportError: cannot import name _get_unmangled_double_vector_rdd

Any one can help me regarding this ?

Implement classification metrics

Create contribution guideline

Decision function for LinearSVC

Hi,

Can we get the confidence score, like we get it in sci-kit learn using decision function method?
I get the following error when I run the code:

svm_model.decision_function(Z[:,'X'])

error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.6/site-packages/sklearn/linear_model/base.py", line 199, in decision_function
    X = check_array(X, accept_sparse='csr')
  File "/usr/lib64/python2.6/site-packages/sklearn/utils/validation.py", line 344, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.

Thanks

ValueError: This solver needs samples of at least 2 classes in the data

Hi,

I am using SparkLinearSVC. The code is as follows:

svm_model = SparkLinearSVC(class_weight='auto')
svm_fitted = svm_model.fit(train_Z,classes=np.unique(train_y))

and I get the following error:

File "/DATA/sdw1/hadoop/yarn/local/usercache/ad79139/filecache/328/spark-assembly-1.2.1.2.2.4.2-2-hadoop2.6.0.2.2.4.2-2.jar/pyspark/worker.py", line 98, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/hdp/2.2.4.2-2/spark/python/pyspark/rdd.py", line 2081, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/usr/hdp/2.2.4.2-2/spark/python/pyspark/rdd.py", line 2081, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/usr/hdp/2.2.4.2-2/spark/python/pyspark/rdd.py", line 258, in func
    return f(iterator)
  File "/usr/hdp/2.2.4.2-2/spark/python/pyspark/rdd.py", line 820, in <lambda>
    return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
  File "/usr/lib/python2.6/site-packages/splearn/linear_model/base.py", line 81, in <lambda>
    mapper = lambda X_y: super(cls, self).fit(
  File "/usr/lib64/python2.6/site-packages/sklearn/svm/classes.py", line 207, in fit
    self.loss
  File "/usr/lib64/python2.6/site-packages/sklearn/svm/base.py", line 809, in _fit_liblinear
    " class: %r" % classes_[0])
ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: 0

whereas, I have 2 classes, namely 0 and 1. The block size of the DictRDD is 2000. The percentage of classes 0 and 1 are 92% and 8% respectively

Unify RDD interface usage in all estimators

Implement ArrayRDD.var() method

see http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.var.html#numpy.ndarray.var

There is already an implementation in https://github.com/lensacom/sparkit-learn/blob/master/splearn/feature_selection/variance_threshold.py#L62 but it should use ndarray.var instead of mean_variance_axis.

Avoid using reduce() with large partition outputs

RDD.reduce() collect the results of mapPartition() then reduces locally. In case of stateful transformers (SparkTfidfTransformer, SparkVarianceThreshold) it causes large memory consumption on driver side.

Use treeReduce() or treeAggregate() instead.

Validate Vocabulary issue

First of all, thanks for the git. It is very helpful.

When running splearn/feature_extraction/text.py in pyspark shell, I am getting "AttributeError: 'SparkCountVectorizer' object has no attribute '_validate_vocabulary' " in fit_transform method. (Is it need to be '_init_vocab' or something??)

Python 2.7
Spark 1.2.0
Scikit 0.15.2
numpy 1.9.0

Code I am using -
vect = text.SparkCountVectorizer()
result_dist = vect.fit_transform(docs).collect()

If an appropriate to post the issue details here, please redirect the same. Thanks in advance

Instructions for installation

Are there any instructions to use it on spark cluster? looks like its not integrated with spark packages. Any help is appreciated

Implement SparkMinMaxScaler

see https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/preprocessing/data.py#L223

Py4JJavaError while fit_transform(X_rdd)

From the documentation examples, I tried to vectorize a list of texts with SparkCountVectorizer:

In:
from splearn.rdd import ArrayRDD
from splearn.feature_extraction.text import SparkCountVectorizer
import pandas as pd

df = pd.read_csv('/Users/user/Downloads/new_labeled_corpus.csv')
#print (df.head())
X = [df['text'].values]
print (type(X))
X_rdd = ArrayRDD(sc.parallelize(X, 4))  # sc is SparkContext
print (X_rdd)

Out:
<class 'list'>
<class 'splearn.rdd.ArrayRDD'> from PythonRDD[27] at RDD at PythonRDD.scala:43

In:
dist = SparkCountVectorizer()
result_dist = dist.fit_transform(X_rdd)  # SparseRDD

However, I got this exception:

Out:
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-15-5070dde8118a> in <module>()
      1 dist = SparkCountVectorizer()
----> 2 result_dist = dist.fit_transform(X_rdd)  # SparseRDD

/usr/local/lib/python3.5/site-packages/splearn/feature_extraction/text.py in fit_transform(self, Z)
    291         # create vocabulary
    292         X = A[:, 'X'] if isinstance(A, DictRDD) else A
--> 293         self.vocabulary_ = self._init_vocab(X)
    294 
    295         # transform according to vocabulary

/usr/local/lib/python3.5/site-packages/splearn/feature_extraction/text.py in _init_vocab(self, analyzed_docs)
    152             accum = analyzed_docs._rdd.context.accumulator(set(), SetAccum())
    153             analyzed_docs.foreach(
--> 154                 lambda x: accum.add(set(chain.from_iterable(x))))
    155             vocabulary = {t: i for i, t in enumerate(accum.value)}
    156         else:

/usr/local/lib/python3.5/site-packages/splearn/rdd.py in bypass(*args, **kwargs)
    172         """
    173         def bypass(*args, **kwargs):
--> 174             result = getattr(self._rdd, attr)(*args, **kwargs)
    175             if isinstance(result, RDD):
    176                 if result is self._rdd:

/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/rdd.py in foreach(self, f)
    745                 f(x)
    746             return iter([])
--> 747         self.mapPartitions(processPartition).count()  # Force evaluation
    748 
    749     def foreachPartition(self, f):

/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/rdd.py in count(self)
   1002         3
   1003         """
-> 1004         return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   1005 
   1006     def stats(self):

/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/rdd.py in sum(self)
    993         6.0
    994         """
--> 995         return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
    996 
    997     def count(self):

/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/rdd.py in fold(self, zeroValue, op)
    867         # zeroValue provided to each partition is unique from the one provided
    868         # to the final reduce call
--> 869         vals = self.mapPartitions(func).collect()
    870         return reduce(op, vals, zeroValue)
    871 

/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/rdd.py in collect(self)
    769         """
    770         with SCCallSiteSync(self.context) as css:
--> 771             port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
    772         return list(_load_from_socket(port, self._jrdd_deserializer))
    773 

/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
    811         answer = self.gateway_client.send_command(command)
    812         return_value = get_return_value(
--> 813             answer, self.gateway_client, self.target_id, self.name)
    814 
    815         for temp_arg in temp_args:

/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/sql/utils.py in deco(*a, **kw)
     43     def deco(*a, **kw):
     44         try:
---> 45             return f(*a, **kw)
     46         except py4j.protocol.Py4JJavaError as e:
     47             s = e.java_exception.toString()

/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    306                 raise Py4JJavaError(
    307                     "An error occurred while calling {0}{1}{2}.\n".
--> 308                     format(target_id, ".", name), value)
    309             else:
    310                 raise Py4JError(

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 8.0 failed 1 times, most recent failure: Lost task 3.0 in stage 8.0 (TID 35, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/local/lib/python3.5/site-packages/splearn/feature_extraction/text.py", line 289, in <lambda>
    A = Z.transform(lambda X: list(map(analyze, X)), column='X').persist()
  File "/usr/local/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 238, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "/usr/local/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 204, in <lambda>
    return lambda x: strip_accents(x.lower())
AttributeError: 'numpy.ndarray' object has no attribute 'lower'

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
    at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:405)
    at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/local/lib/python3.5/site-packages/splearn/feature_extraction/text.py", line 289, in <lambda>
    A = Z.transform(lambda X: list(map(analyze, X)), column='X').persist()
  File "/usr/local/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 238, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "/usr/local/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 204, in <lambda>
    return lambda x: strip_accents(x.lower())
AttributeError: 'numpy.ndarray' object has no attribute 'lower'

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    ... 1 more

Any idea of how to correctly vectorize X with sparkit-learn and why this Py4JJavaError occurred?.

Spark version 1.2.1, error AttributeError: <class 'rdd.BlockRDD'> object has no attribute treeReduce

countvectorizer = SparkCountVectorizer(tokenizer=tokenize_pre_process)
count_vector
<class 'rdd.ArrayRDD'> from PythonRDD[22] at collect at rdd.py:168
sel_vt = SparkVarianceThreshold()
red_vt_vector = sel_vt.fit_transform(count_vector)
Traceback (most recent call last):
File "", line 1, in
File "base.py", line 63, in fit_transform
return self.fit(Z, **fit_params).transform(Z)
File "feature_selection.py", line 72, in fit
_, , self.variances = X.map(mapper).treeReduce(reducer)
File "rdd.py", line 179, in getattr
self.class, attr))
AttributeError: <class 'rdd.BlockRDD'> object has no attribute treeReduce

I am using spark 1.2.1, and I think rdd has the method treeReduce.
Would you have any idea why this error could be popping out of the ArrayRDD extendig BlockRDD

Py4JJavaError while fitting a splearn.rdd.DictRDD?

From the the documentation, I tried a very simple classification pipeline:

In:


X = df['text'].values
y = df['labels'].values

#<class 'numpy.ndarray'> :
X_rdd = sc.parallelize(X, 4)
#<class 'numpy.ndarray'> :
y_rdd = sc.parallelize(y, 4)


Z = DictRDD((X_rdd, y_rdd),
            columns=('text', 'labels'),
            dtype=[np.ndarray, np.ndarray])

Then:

Out:
<class 'splearn.rdd.DictRDD'> from PythonRDD[54] at RDD at PythonRDD.scala:43

Then I initialize both, a distribuited pipeline and a local one:

In:    
local_pipeline = Pipeline((
        ('vect', HashingVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('clf', LinearSVC())
    ))
    dist_pipeline = SparkPipeline((
        ('vect', SparkHashingVectorizer()),
        ('tfidf', SparkTfidfTransformer()),
        ('clf', SparkLinearSVC())
    ))

    local_pipeline.fit(X, y)
    dist_pipeline.fit(Z, clf__classes=np.unique(y))

    y_pred_local = local_pipeline.predict(X)
    y_pred_dist = dist_pipeline.predict(Z[:, 'text'])

However, I got the following exception:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-26-dfd551e98564> in <module>()
     11 
     12 local_pipeline.fit(X, y)
---> 13 dist_pipeline.fit(Z, clf__classes=np.unique(y))
     14 
     15 y_pred_local = local_pipeline.predict(X)

/usr/local/lib/python3.5/site-packages/splearn/pipeline.py in fit(self, Z, **fit_params)
    108         """
    109         Zt, fit_params = self._pre_transform(Z, **fit_params)
--> 110         self.steps[-1][-1].fit(Zt, **fit_params)
    111         return self
    112 

/usr/local/lib/python3.5/site-packages/splearn/svm/classes.py in fit(self, Z, classes)
    117         check_rdd(Z, {'X': (sp.spmatrix, np.ndarray)})
    118         self._classes_ = np.unique(classes)
--> 119         return self._spark_fit(SparkLinearSVC, Z)
    120 
    121     def predict(self, X):

/usr/local/lib/python3.5/site-packages/splearn/linear_model/base.py in _spark_fit(self, cls, Z, *args, **kwargs)
     82         )
     83         models = Z.map(mapper)
---> 84         avg = models.sum() / models.count()
     85         self.__dict__.update(avg.__dict__)
     86         return self

/usr/local/lib/python3.5/site-packages/splearn/rdd.py in bypass(*args, **kwargs)
    172         """
    173         def bypass(*args, **kwargs):
--> 174             result = getattr(self._rdd, attr)(*args, **kwargs)
    175             if isinstance(result, RDD):
    176                 if result is self._rdd:

/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/rdd.py in sum(self)
    993         6.0
    994         """
--> 995         return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
    996 
    997     def count(self):

/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/rdd.py in fold(self, zeroValue, op)
    867         # zeroValue provided to each partition is unique from the one provided
    868         # to the final reduce call
--> 869         vals = self.mapPartitions(func).collect()
    870         return reduce(op, vals, zeroValue)
    871 

/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/rdd.py in collect(self)
    769         """
    770         with SCCallSiteSync(self.context) as css:
--> 771             port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
    772         return list(_load_from_socket(port, self._jrdd_deserializer))
    773 

/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
    811         answer = self.gateway_client.send_command(command)
    812         return_value = get_return_value(
--> 813             answer, self.gateway_client, self.target_id, self.name)
    814 
    815         for temp_arg in temp_args:

/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/sql/utils.py in deco(*a, **kw)
     43     def deco(*a, **kw):
     44         try:
---> 45             return f(*a, **kw)
     46         except py4j.protocol.Py4JJavaError as e:
     47             s = e.java_exception.toString()

/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    306                 raise Py4JJavaError(
    307                     "An error occurred while calling {0}{1}{2}.\n".
--> 308                     format(target_id, ".", name), value)
    309             else:
    310                 raise Py4JError(

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 30.0 failed 1 times, most recent failure: Lost task 0.0 in stage 30.0 (TID 96, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/rdd.py", line 864, in func
    acc = op(obj, acc)
  File "/usr/local/lib/python3.5/site-packages/splearn/linear_model/base.py", line 27, in __add__
    model.coef_ += other.coef_
AttributeError: 'int' object has no attribute 'coef_'

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
    at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:405)
    at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
    at sun.reflect.GeneratedMethodAccessor41.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/rdd.py", line 864, in func
    acc = op(obj, acc)
  File "/usr/local/lib/python3.5/site-packages/splearn/linear_model/base.py", line 27, in __add__
    model.coef_ += other.coef_
AttributeError: 'int' object has no attribute 'coef_'

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    ... 1 more

Any idea of why this is happening and how to solve this issue?.

[RFC] Separate [Dense]ArrayRDD and [Sparse]ArrayRDD

The scipy.sparse and ndarray interfaces are not fully compatible. It'd be reasonable to separate them into own classes.

ImportError: pyspark home needs to be added to PYTHONPATH

During execution of following simple code with Sparkit-Learn:

from splearn.svm import SparkLinearSVC
spark=SparkLinearSVC()

I get following error message:

ImportError: pyspark home needs to be added to PYTHONPATH.
export PYTHONPATH=$PYTHONPATH:$SPARK_HOME/python:../

In accordance with those anserws:
http://stackoverflow.com/questions/28829757/unable-to-add-spark-to-pythonpath
http://stackoverflow.com/questions/23256536/importing-pyspark-in-python-shell
I have added every possible configuration of those PYTHONPATHs to my .bashrc,but error is still occuring.

Currently my .bashrc paths looks like that:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH=$JAVA_HOME/bin:$PATH
export PATH=/home/123/anaconda2/bin:$PATH
export SPARK_HOME=/home/123/Downloads/spark-1.6.1-bin-hadoop2.6
export PATH=$SPARK_HOME/bin:$PATH
export PATH=$JAVA_HOME/jre/lib/amd64/server:$PATH
export PATH=$JAVA_HOME/jre/lib/amd64:$PATH
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH

Any possible solution? I am running this on Ubuntu 16.04,Pycharm and spark-1.6.1-bin-hadoop2.6

Refactor test framework

Like sklearn does.

Invalid syntax issue

When running splearn/feature_extraction/text.py using pyspark shell, I am getting invalid syntax at :
vocabulary = {t: i for i, t in enumerate(accum.value)}

python 2.6.6
spark 1.2.0
numpy 1.9.0
scikit 1.5.2

If anappropriate to put the issue details here, please redirect the same. Thanks in advance

Update README

Implement SparkLabelEncoder

see https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/preprocessing/label.py#L56

SparseRDD multiplication

Hi,

I have generated a SparseRDD using SparkHashingVectotiser and an ArrayRDD, containing the target values(1s and 0s). I want to do a matrix multiplication between these two. Is there a way to do matrix multiplication of Sparse RDDs?

[RFC] Integrate with numba / parakeet

Execution time could benefit from an optimizer like numba.

From numba's documentation:
"Numba works by generating optimized machine code using the LLVM compiler infrastructure at import time, runtime, or statically (using the included pycc tool). Numba supports compilation of Python to run on either CPU or GPU hardware, and is designed to integrate with the Python scientific software stack."

AttributeError: 'int' object has no attribute 'coef_'

Here is a snippet which trigger the error:

from splearn.rdd import DictRDD
from splearn.svm import SparkLinearSVC
from sklearn.datasets import make_classification
import numpy as np

X, y = make_classification(n_classes=2, n_samples=10000, n_features=2, 
                           n_informative=2, n_redundant=0, 
                           n_clusters_per_class=1, random_state=42)

X_rdd = sc.parallelize(X, 4)
y_rdd = sc.parallelize(y, 4)
Z = DictRDD((X_rdd, y_rdd),
            columns=('X', 'y'),
            dtype=[np.ndarray, np.ndarray])

lr = SparkLinearSVC()

lr.fit(Z, classes=np.unique(y))

y_pred_local = local_pipeline.predict(X)
y_pred_dist = dist_pipeline.predict(Z[:, 'X'])

I get

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1280)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1268)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1267)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1267)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1493)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1455)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1444)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1813)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1826)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1839)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
    at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373)
    at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/lib/spark/python/pyspark/worker.py", line 111, in main
    process()
  File "/usr/lib/spark/python/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/lib/spark/python/pyspark/serializers.py", line 263, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/lib/spark/python/pyspark/rdd.py", line 866, in func
    acc = op(obj, acc)
  File "/usr/local/lib/python3.4/dist-packages/splearn/linear_model/base.py", line 27, in __add__
    model.coef_ += other.coef_
AttributeError: 'int' object has no attribute 'coef_'

    at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:138)
    at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:179)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:97)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:88)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    ... 1 more

It is the same with Python 3.4.2 and Python 2.7.9. I use spark 1.5.0

lensacom / sparkit-learn Goto Github PK

sparkit-learn's People

Contributors

Stargazers

Watchers

Forkers

sparkit-learn's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs