GithubHelp home page GithubHelp logo

alteryx / automated-manual-comparison Goto Github PK

View Code? Open in Web Editor NEW
326.0 326.0 150.0 319.21 MB

Automated vs Manual Feature Engineering Comparison. Implemented using Featuretools.

Home Page: https://towardsdatascience.com/why-automated-feature-engineering-will-change-the-way-you-do-machine-learning-5c15bf188b96

License: BSD 3-Clause "New" or "Revised" License

Jupyter Notebook 98.95% Python 0.63% HTML 0.41% Shell 0.01%

automated-manual-comparison's People

Contributors

gsheni avatar thehomebrewnerd avatar willkoehrsen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

automated-manual-comparison's Issues

app_types is null

at Automated Loan Repayment page of following codes:

app_types = {}

# Handle the Boolean variables:
for col in app:
    if (app[col].nunique() == 2) and (app[col].dtype == float):
        app_types[col] = vtypes.Boolean

# Remove the `TARGET`
del app_types['TARGET']

print('There are {} Boolean variables in the application data.'.format(len(app_types)))

the result should be 0 but not 32

What are those 8 Labels?

Dear @WillKoehrsen,

Thanks for your amazing article and I really appreciate this work.
But I have something confused while reading at the end of Metrics section, you said "One customer, 8 different labels. It seems like it might be difficult to predict this customer's spending given her fluctuating total spending! We'll have to see if Featuretools is up to the task."

I don't understand those 8 different labels you mentioned here, is it a binarized label issue?
Could you please explain it?
Thank you very much!!

Allen

tornado.application - ERROR - Exception in Future <Future cancelled> after timeout

When I ran your notebook Automated Engine Life.ipynb
cell [in] 5, I got error messages 'ike this:

tornado.application - ERROR - Exception in Future after timeout
Traceback (most recent call last):
File "/home/xuzhang/anaconda3/envs/featuretools/lib/python3.6/site-packages/tornado/gen.py", line 970, in error_callback
future.result()
concurrent.futures._base.CancelledError
distributed.comm.tcp - WARNING - Closing dangling stream in

Any advice? Thanks

Performance problems

Hi,

thanks for your article. Automated Feature Engineering is very promising.
I am running the Loan Repayment script right now to compare it with my own engineered features.
I am very curious about the results.

What is the recommended horse power to compute the result on one day (like mentioned in the article)?
Elapsed: 18:50:30 | Remaining: 22358:53:57 | Progress: 0%| | Calculated: 3/3563 chunks

The ft.py uses one job by default. Any other value but 1 crashes the script.
I am using a r4.2xlarge aws ec2 instance. But with one job it cannot utilize more than one core.
Nevertheless even with all eight cores, it would still take weeks.

Can you recommend some specs to speed this up?

Best regards

AttributeError: 'functools.partial' object has no attribute '__name__'

I ran the notebook Featuretools on Dask.ipynb on my local machine, however something wrong happened when b.compute() run.
image
10 feature matrix have generated when the error happen.
image
Here are the error info:

tornado.application - ERROR - Exception in callback <bound method BokehTornado._keep_alive of <bokeh.server.tornado.BokehTornado object at 0x7f9488d69d68>>
Traceback (most recent call last):
  File "/home/lili/anaconda3/lib/python3.6/site-packages/tornado/ioloop.py", line 1208, in _run
    self._next_timeout = self.io_loop.time()
  File "/home/lili/anaconda3/lib/python3.6/site-packages/bokeh/server/tornado.py", line 514, in _keep_alive
    c.send_ping()
  File "/home/lili/anaconda3/lib/python3.6/site-packages/bokeh/server/connection.py", line 46, in send_ping
    self._socket.ping(codecs.encode(str(self._ping_count), "utf-8"))
  File "/home/lili/anaconda3/lib/python3.6/site-packages/tornado/websocket.py", line 367, in ping
    self.ws_connection.write_ping(data)
  File "/home/lili/anaconda3/lib/python3.6/site-packages/tornado/websocket.py", line 882, in write_ping
    self._write_frame(True, 0x9, data)
  File "/home/lili/anaconda3/lib/python3.6/site-packages/tornado/websocket.py", line 846, in _write_frame
    return self.stream.write(frame)
  File "/home/lili/anaconda3/lib/python3.6/site-packages/tornado/iostream.py", line 525, in write
    future = self._set_read_callback(callback)
  File "/home/lili/anaconda3/lib/python3.6/site-packages/tornado/iostream.py", line 1058, in _check_closed
    size = 128 * 1024
tornado.iostream.StreamClosedError: Stream is closed
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-34-82e469b60feb> in <module>()
      1 overall_start = timer()
----> 2 b.compute()
      3 overall_end = timer()
      4 
      5 print(f"Total Time Elapsed: {round(overall_end - overall_start, 2)} seconds.")

~/anaconda3/lib/python3.6/site-packages/dask/base.py in compute(self, **kwargs)
    154         dask.base.compute
    155         """
--> 156         (result,) = compute(self, traverse=False, **kwargs)
    157         return result
    158 

~/anaconda3/lib/python3.6/site-packages/dask/base.py in compute(*args, **kwargs)
    393     keys = [x.__dask_keys__() for x in collections]
    394     postcomputes = [x.__dask_postcompute__() for x in collections]
--> 395     results = schedule(dsk, keys, **kwargs)
    396     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
    397 

~/anaconda3/lib/python3.6/site-packages/distributed/client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, **kwargs)
   2198             try:
   2199                 results = self.gather(packed, asynchronous=asynchronous,
-> 2200                                       direct=direct)
   2201             finally:
   2202                 for f in futures.values():

~/anaconda3/lib/python3.6/site-packages/distributed/client.py in gather(self, futures, errors, maxsize, direct, asynchronous)
   1567             return self.sync(self._gather, futures, errors=errors,
   1568                              direct=direct, local_worker=local_worker,
-> 1569                              asynchronous=asynchronous)
   1570 
   1571     @gen.coroutine

~/anaconda3/lib/python3.6/site-packages/distributed/client.py in sync(self, func, *args, **kwargs)
    643             return future
    644         else:
--> 645             return sync(self.loop, func, *args, **kwargs)
    646 
    647     def __repr__(self):

~/anaconda3/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, *args, **kwargs)
    275             e.wait(10)
    276     if error[0]:
--> 277         six.reraise(*error[0])
    278     else:
    279         return result[0]

~/anaconda3/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

~/anaconda3/lib/python3.6/site-packages/distributed/utils.py in f()
    260             if timeout is not None:
    261                 future = gen.with_timeout(timedelta(seconds=timeout), future)
--> 262             result[0] = yield future
    263         except Exception as exc:
    264             error[0] = sys.exc_info()

~/anaconda3/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1097 
   1098     def set_result(self, key, result):
-> 1099         """Sets the result for ``key`` and attempts to resume the generator."""
   1100         self.results[key] = result
   1101         if self.yield_point is not None and self.yield_point.is_ready():

~/anaconda3/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1105             except:
   1106                 future_set_exc_info(self.future, sys.exc_info())
-> 1107             self.yield_point = None
   1108             self.run()
   1109 

~/anaconda3/lib/python3.6/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
   1443                             six.reraise(type(exception),
   1444                                         exception,
-> 1445                                         traceback)
   1446                     if errors == 'skip':
   1447                         bad_keys.add(key)

~/anaconda3/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    690                 value = tp()
    691             if value.__traceback__ is not tb:
--> 692                 raise value.with_traceback(tb)
    693             raise value
    694         finally:

~/anaconda3/lib/python3.6/site-packages/dask/bag/core.py in reify()
   1547 def reify(seq):
   1548     if isinstance(seq, Iterator):
-> 1549         seq = list(seq)
   1550     if seq and isinstance(seq[0], Iterator):
   1551         seq = list(map(list, seq))

~/anaconda3/lib/python3.6/site-packages/dask/bag/core.py in map_chunk()
   1707     else:
   1708         for a in zip(*args):
-> 1709             yield f(*a)
   1710 
   1711     # Check that all iterators are fully exhausted

<ipython-input-25-75ac088d04b8> in feature_matrix_from_entityset()
     11                                                  n_jobs = 1,
     12                                                  verbose = True,
---> 13                                                  chunk_size = es['app'].df.shape[0])
     14 
     15     feature_matrix.to_csv('data/fm/p%d_fm.csv' % es_dict['num'], index = True)

~/anaconda3/lib/python3.6/site-packages/featuretools/computational_backends/calculate_feature_matrix.py in calculate_feature_matrix()
    256                                                  cutoff_df_time_var=cutoff_df_time_var,
    257                                                  target_time=target_time,
--> 258                                                  pass_columns=pass_columns)
    259 
    260     feature_matrix = pd.concat(feature_matrix)

~/anaconda3/lib/python3.6/site-packages/featuretools/computational_backends/calculate_feature_matrix.py in linear_calculate_chunks()
    518                                           cutoff_df_time_var,
    519                                           target_time, pass_columns,
--> 520                                           backend=backend)
    521         feature_matrix.append(_feature_matrix)
    522         # Do a manual garbage collection in case objects from calculate_chunk

~/anaconda3/lib/python3.6/site-packages/featuretools/computational_backends/calculate_feature_matrix.py in calculate_chunk()
    340                                            ids,
    341                                            precalculated_features=precalculated_features,
--> 342                                            training_window=window)
    343 
    344             id_name = _feature_matrix.index.name

~/anaconda3/lib/python3.6/site-packages/featuretools/computational_backends/utils.py in wrapped()
     32         def wrapped(*args, **kwargs):
     33             if save_progress is None:
---> 34                 r = method(*args, **kwargs)
     35             else:
     36                 time = args[0].to_pydatetime()

~/anaconda3/lib/python3.6/site-packages/featuretools/computational_backends/calculate_feature_matrix.py in calc_results()
    314                                                     precalculated_features=precalculated_features,
    315                                                     ignored=all_approx_feature_set,
--> 316                                                     profile=profile)
    317             return matrix
    318 

~/anaconda3/lib/python3.6/site-packages/featuretools/computational_backends/pandas_backend.py in calculate_all_features()
    194 
    195                     handler = self._feature_type_handler(test_feature)
--> 196                     result_frame = handler(group, input_frames)
    197 
    198                     output_frames_type = self.feature_tree.output_frames_type(test_feature)

~/anaconda3/lib/python3.6/site-packages/featuretools/computational_backends/pandas_backend.py in _calculate_agg_features()
    421                 funcname = func
    422                 if callable(func):
--> 423                     funcname = func.__name__
    424 
    425                 to_agg[variable_id].append(func)

AttributeError: 'functools.partial' object has no attribute '__name__'
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://127.0.0.1:50460 remote=tcp://127.0.0.1:45867>


Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.