alteryx / featuretools Goto Github PK

An open source python library for automated feature engineering

License: BSD 3-Clause "New" or "Revised" License

Python 99.92% Makefile 0.08%

feature-engineering machine-learning data-science automated-machine-learning automl python scikit-learn automated-feature-engineering

featuretools's Introduction

"One of the holy grails of machine learning is to automate more and more of the feature engineering process." ― Pedro Domingos, A Few Useful Things to Know about Machine Learning

Featuretools is a python library for automated feature engineering. See the documentation for more information.

Installation

Install with pip

python -m pip install featuretools

or from the Conda-forge channel on conda:

conda install -c conda-forge featuretools

Add-ons

You can install add-ons individually or all at once by running

python -m pip install "featuretools[complete]"

Update checker - Receive automatic notifications of new Featuretools releases

python -m pip install "featuretools[updater]"

Premium Primitives - Use Premium Primitives, including Natural Language Processing primitives:

python -m pip install "featuretools[premium]"

TSFresh Primitives - Use 60+ primitives from tsfresh within Featuretools

python -m pip install "featuretools[tsfresh]"

Dask Support - Use Dask Dataframes to create EntitySets or run DFS with njobs > 1

python -m pip install "featuretools[dask]"

SQL - Automatic EntitySet generation from relational data stored in a SQL database:

python -m pip install "featuretools[sql]"

Example

Below is an example of using Deep Feature Synthesis (DFS) to perform automated feature engineering. In this example, we apply DFS to a multi-table dataset consisting of timestamped customer transactions.

>> import featuretools as ft
>> es = ft.demo.load_mock_customer(return_entityset=True)
>> es.plot()

Featuretools can automatically create a single table of features for any "target dataframe"

>> feature_matrix, features_defs = ft.dfs(entityset=es, target_dataframe_name="customers")
>> feature_matrix.head(5)

            zip_code  COUNT(transactions)  COUNT(sessions)  SUM(transactions.amount) MODE(sessions.device)  MIN(transactions.amount)  MAX(transactions.amount)  YEAR(join_date)  SKEW(transactions.amount)  DAY(join_date)                   ...                     SUM(sessions.MIN(transactions.amount))  MAX(sessions.SKEW(transactions.amount))  MAX(sessions.MIN(transactions.amount))  SUM(sessions.MEAN(transactions.amount))  STD(sessions.SUM(transactions.amount))  STD(sessions.MEAN(transactions.amount))  SKEW(sessions.MEAN(transactions.amount))  STD(sessions.MAX(transactions.amount))  NUM_UNIQUE(sessions.DAY(session_start))  MIN(sessions.SKEW(transactions.amount))
customer_id                                                                                                                                                                                                                                  ...
1              60091                  131               10                  10236.77               desktop                      5.60                    149.95             2008                   0.070041               1                   ...                                                     169.77                                 0.610052                                   41.95                               791.976505                              175.939423                                 9.299023                                 -0.377150                                5.857976                                        1                                -0.395358
2              02139                  122                8                   9118.81                mobile                      5.81                    149.15             2008                   0.028647              20                   ...                                                     114.85                                 0.492531                                   42.96                               596.243506                              230.333502                                10.925037                                  0.962350                                7.420480                                        1                                -0.470007
3              02139                   78                5                   5758.24               desktop                      6.78                    147.73             2008                   0.070814              10                   ...                                                      64.98                                 0.645728                                   21.77                               369.770121                              471.048551                                 9.819148                                 -0.244976                               12.537259                                        1                                -0.630425
4              60091                  111                8                   8205.28               desktop                      5.73                    149.56             2008                   0.087986              30                   ...                                                      83.53                                 0.516262                                   17.27                               584.673126                              322.883448                                13.065436                                 -0.548969                               12.738488                                        1                                -0.497169
5              02139                   58                4                   4571.37                tablet                      5.91                    148.17             2008                   0.085883              19                   ...                                                      73.09                                 0.830112                                   27.46                               313.448942                              198.522508                                 8.950528                                  0.098885                                5.599228                                        1                                -0.396571

[5 rows x 69 columns]

We now have a feature vector for each customer that can be used for machine learning. See the documentation on Deep Feature Synthesis for more examples.

Featuretools contains many different types of built-in primitives for creating features. If the primitive you need is not included, Featuretools also allows you to define your own custom primitives.

Demos

Predict Next Purchase

Repository | Notebook

In this demonstration, we use a multi-table dataset of 3 million online grocery orders from Instacart to predict what a customer will buy next. We show how to generate features with automated feature engineering and build an accurate machine learning pipeline using Featuretools, which can be reused for multiple prediction problems. For more advanced users, we show how to scale that pipeline to a large dataset using Dask.

For more examples of how to use Featuretools, check out our demos page.

Testing & Development

The Featuretools community welcomes pull requests. Instructions for testing and development are available here.

Support

The Featuretools community is happy to provide support to users of Featuretools. Project support can be found in four places depending on the type of question:

For usage questions, use Stack Overflow with the featuretools tag.
For bugs, issues, or feature requests start a Github issue.
For discussion regarding development on the core library, use Slack.
For everything else, the core developers can be reached by email at [email protected]

Citing Featuretools

If you use Featuretools, please consider citing the following paper:

James Max Kanter, Kalyan Veeramachaneni. Deep feature synthesis: Towards automating data science endeavors. IEEE DSAA 2015.

BibTeX entry:

@inproceedings{kanter2015deep,
  author    = {James Max Kanter and Kalyan Veeramachaneni},
  title     = {Deep feature synthesis: Towards automating data science endeavors},
  booktitle = {2015 {IEEE} International Conference on Data Science and Advanced Analytics, DSAA 2015, Paris, France, October 19-21, 2015},
  pages     = {1--10},
  year      = {2015},
  organization={IEEE}
}

Built at Alteryx

Featuretools is an open source project maintained by Alteryx. To see the other open source projects we’re working on visit Alteryx Open Source. If building impactful data science pipelines is important to you or your business, please get in touch.

featuretools's People

Contributors

Stargazers

Watchers

Forkers

joshblum allensmile huangshizhi hj3938 pythonai mutual-ai axfv zhensongqian bigrlab anazou biroc xiaoyexixi alenaliu csmone r00tak zergey hangtongluo jianfly sensecollective feizhuzhuss gitter-badger jettify jbdatascience gregoryh88 willcmcc waldt dumbomir scknkrbg kenqyu hordaway cclauss athenagoras lucacdp mehrdad-shokri wangbin321 studingke110 tiny-gao takabayashi rvaughan vikash0837 ephraimmagopa xuwenkang rjhere ghosthamlet xtmgah ivangom14 faadal xflee ai-surfing igor-krawczuk vishalbelsare yak0xff apilipis worgenzhang rajathjain chou852ishare huyhoang17 polya20 cjmmya mickn baifengbai guptam ytcpub o7s8r6 tjujianyu micseb direkshan-digital yunstanford xulukai wojohowitz00 syq23719034 cnjelita zhongkailv zhxt95 datanest carsondahlberg nagappankv codeaudit wenyanghan pomona-its vivek2319 biodun shafiahmed rizplate iamsubhokarmakar reeftrip zhaoyta chetanmehra bylake liyingkun1237 ck032 rquintino clxdsjyx csala davidwaroquiers sujan-niraula mcnakhaee lurenyi233 ab-be etemiz

featuretools's Issues

EntitySet._related_instances doesn't respect time_last if no instances provided

Calling _related_instances with no instance ids but with a cutoff time, with start and final as the same entity id:

EntitySet._related_instances(entity_id, entity_id,
                           instance_ids=None, time_last=some_actual_time)

does not respect the cutoff time, and instead just returns all instances of the entity
I'm not exactly sure if we intended this, but I doubt it.

It's not currently a problem because we only call EntitySet._related_instances() from EntitySet.get_pandas_data_slice(), which is only called in 2 places, both times with instance_ids provided. However I noticed it in my PR #91 when called get_pandas_data_slice a second time inside of PandasBackend with no instances. I patched it up by specified all the instances of the relevant entity, but we should fix this so it doesn't happen in the future.

Thanks to @Seth-Rothschild for finding this problem.

DFS stacks transform primitives in an incomplete manner

When creating transform features on an entity, DFS applies transform primitives to the entity's identity features to create new features. At this step, DFS could either use these new features as inputs for more transform features (up to max_depth or some other arbitrary depth) or stop making transform features.

Currently, DFS stacks transform features on top of each other, but only if the transform primitive being stacked on comes before the other in the trans_primitives list. So if trans_primitives=[Absolute, Percentile], DFS will only stack Percentile on Absolute, creating a Percentile(Absolute(base_feature)) feature, but if trans_primitives=[Percentile, Absolute], DFS will only stack Absolute on Percentile, creating a Absolute(Percentile(base_feature)) feature.

We should decide if we want transform features to stack like this.

Line-by-line coverage files missing from codecov for recent commits

Most likely due to part of the codecov command failing in CircleCI because the docker image used does not have git installed.

datetime.date in utils.wrangle._check_time_type

Should the utils.wrangle._check_time_type include checking for datetime.date? If so, would you mind if I made a PR for it?

I apologize if this is a naive question. I recently began working with a not-quite-complete demo notebook from someone in Kalyan's group, and this is my first time working with FeatureTools. Right now, the demo is throwing ValueErrors after sending datetime.dateobjects to _check_time_type, and I'm not sure whether this error is with the demo or with the library. (It does seem to run without errors when I make the suggested fix.)

If this is better asked on StackOverflow, please let me know, and I'll move the question there.

Allow strings to specify primitives to be used by Deep Feature Synthesis

Currently, you have to import primitives before passing them to dfs as follows

import featuretools as ft
from featuretools.primitives import Day, Percentile, CumMean, Count, Min, Max, Trend
ft.dfs(entityset=entityset,
         target_entity='customers',
         trans_primitives=[Day, Percentile, CumMean],
         agg_primitives=[Count, Min, Max, Trend])

we should support passing the primitives in as strings

import featuretools as ft
ft.dfs(entityset=entityset,
         target_entity='customers',
         trans_primitives=['Day', 'Percentile', 'CumMean'],
         agg_primitives=['Count', 'Min', 'Max', 'Trend'])

Option for `calculate_feature_matrix` to fail quietly

Is it possible to have calculate_feature_matrix to fail quietly? I'm thinking something similar to how many R functions have na.option=pass. The reason for this is when you are scoring on new data, say for production and you have some weird data in it, you may want a warning to come up, but you don't want the whole process to stop.

Basically if it were to error out, instead it will fill in the column with na and continue on.

Getting the internal function from a primitive

I've been looking at pulling out the internal function from a primitive. Having access to the function allows for making some very powerful custom primitives:

It allows for compositions like Mean(CumSum(Absolute())) to be a depth 1 primitive.
It give easy access to changing input and output types of primitives through the make_x_primitive API (e.g. make NumHour as Hour but with a vtypes.Numeric output)
It allows for mild modifications of existing primitives as new primitives (e.g. return 3*Mean + 1 or Mean(3*array)+1).

The problem is that it seems kind of hard to get at that base function. In python3, I can do

from featuretools.primitives import Absolute
abs = Absolute.get_function([])

which calculates abs([-1, 0 1]) to be array([1, 0, 1]). If we used the Sum primitive instead and did

sum = Sum.get_function([])

then sum([-1, 0 1]) would throw an error. The underlying function for sum wants a pd.Series, so sum(pd.Series([-1, 0, 1]) would get the desired result of 0. I haven't managed to find a way to hack together the base function for Hour or CumSum or in python2 at all.

Would it be worth putting together a def get_underlying_function which smooths over these type discrepancies for a user?

Custom primitive docstrings don't make it into the html documentation

Bug/Feature Request Description

In the API reference page of the docs (generated by docs/source/api_reference.rst), custom primitives have their description as simply "alias of python.path.to.primitive", rather than their actual description. I'm referring specifically Min, which we define using make_agg_primitive().

Issues created here on Github are for bugs or feature requests. For usage questions and questions about errors, please ask on Stack Overflow with the featuretools tag. Check the documentation for further guidance on where to ask your question.

Error while adding a relationship

I am trying to extract features automatically with featuretools from synthetic data I have created. An error is raised when I add a particular relationship es.add_relationship(ips_flows) (line 12 in synethetic_flows.py). The error says that the key ip does not exist, while it does in the input data frame.

Is it a bug ? Or do the linked columns need to meet some specific constraints ?

Context :
- installation with pip install featuretools
- run with python 3.5

synthetic_flows.zip

featuretools for time series data?

Hi all,

I was wondering, is it make sense to use featuretools for time series data?

For example:

        Time,                             DayOfWeek, Target
0	2014-01-01 01:41:50	1                   0
1	2014-01-01 02:06:50	2                   0
2	2014-01-01 02:31:50	3                   0
3	2014-01-01 02:56:50	4                   0
4	2014-01-01 03:21:50	5                   1

Multiprocessing feature is not supported on MacOS

Bug/Feature Request Description

Too bad that Multiprocessing feature is not supported on MacOS

File "/Users/don/anaconda/lib/python3.6/site-packages/featuretools/synthesis/dfs.py", line 213, in dfs
verbose=verbose)
File "/Users/don/anaconda/lib/python3.6/site-packages/featuretools/computational_backends/calculate_feature_matrix.py", line 238, in calculate_feature_matrix
dask_kwargs=dask_kwargs or {})
File "/Users/don/anaconda/lib/python3.6/site-packages/featuretools/computational_backends/calculate_feature_matrix.py", line 708, in parallel_calculate_chunks
workers = n_jobs_to_workers(n_jobs)
File "/Users/don/anaconda/lib/python3.6/site-packages/featuretools/computational_backends/calculate_feature_matrix.py", line 790, in n_jobs_to_workers
cpus = len(os.sched_getaffinity(0))
AttributeError: module 'os' has no attribute 'sched_getaffinity'

Provide example of many-to-many relationship

In situations where we want to run Deep Feature Synthesis for a dataset that has a many-to-many relationship we need to create an extra association entity to handle the mapping. Featuretools supports this use case, but it could be more clear to users of the library if we provided an example either in the documentation on in a demo like the ones we have on the website.

Loading / saving features doesn't support file-like objects as input

Currently Featuretools supports saving a list of features to disk and loading them from disk using a file path as an input. If these functions supported file-like objects as an alternate input we could save features to string buffers or unnamed temporary files.

DFS errors by stacking on expanding features

Supplying the NMostCommon and NUnique features to DFS errors, as it stacks NUnique on NMostCommon, which is an expanding feature that returns lists as output. The resulting values supplied to NUnique look like:

pd.Series([['v1', 'v2', v3'], ['v2, 'v3', 'v1']])

We don't have a good way of handling the stacking of expanding features in pandas_backend, so instead we should just not allow them in DFS.

Generate new features by crossing features in a table?

Feature crossing is a very common technique to find the nonlinear relationships in the dataset.
Can I use featuretool to generate new features by crossing features in a table? And How?

Update conda package

Bug/Feature Request Description

Featuretools package on featuretools conda channel still isn't up to v0.2.0, could you kindly update it?

Package for Windows would be a nice bonus.

why calculate this features?

Hi featuretools team:

Why featuretools calculate below features?
Like std(SK_ID_PREV)

ft.Relationship(
self.__es["previous_application"]["SK_ID_PREV"],
self.__es["installments_payments"]["SK_ID_PREV"]
)

Thanks

Issue with `dask==0.15.2`

Since dask isn't currently a requirement featuretools won't check on install which version you're using. This causes the line

import featuretools as ft

to error because of the optional requirement in load_flight when using dask==0.15.2.

There are two possible resolutions:

If we're planning on going back to dask for parallelization we can put the appropriate line in requirements.txt
If we're not planning on requiring dask, we can check the version before import

Null values for cutoff time

Sometimes we may have null values for the cutoff time. Consider the case where we are trying to generate features to predict whether or not we will sell a product to a customer. We want to exclude all data after the sale. So for customer we sold to, there is one point in time where they became a 'sold to' customer. But for customers we did not sell to, there is no one single point where they became 'not a sale'. In this case, we want to include all the data. We could represent this with a null value in the cutoff time df.

Forgive me if I am missing something, but I am not aware of a way to do this currently in featuretools? Null values in the cutoff time will throw an error?

Written explanation of algorithm

in order to use your code correctly , it needs to understand algorithm

There is a plan for R version?

He guys, I would like to know if there is a plan to have this great tools for a R software.

Tks

Problem with `new_entity_secondary_time_index` in dfs

For the dataset I'm using, I have

es.normalize_entity('coordinates', 'plows', 'truck_name', 
                    make_time_index=True,
                    make_secondary_time_index={'date_fixed': []})

so that I can automatically calculate the first and last time that a plow is on the road. If I give the secondary time index a new name, with new_entity_secondary_time_index

es.normalize_entity('coordinates', 'plows', 'truck_name', 
                    make_time_index=True,
                    make_secondary_time_index={'date_fixed': []},
                    new_entity_secondary_time_index='last_coordinates_time')

then I get a KeyError in dfs. This seems to come from the line in _filter_and_sort:

    618             if time_last is not None and not df.empty:
--> 619                 mask = df[secondary_time_index] >= time_last
    620                 second_time_index_columns = self.secondary_time_index[secondary_time_index]
    621                 df.loc[mask, second_time_index_columns] = np.nan

Passing additional data through calculate feature matrix

It would be useful to be able to pass labels or other extra columns to calculate_feature_matrix and have them be added to the feature matrix that is returned. Proposed implementation: include these additional columns in the cutoff_time argument. The first two columns of cutoff_time will always be the instance ids and cutoff times. Any extra columns will be treated as additional data columns and included in the final feature matrix.

autofunction documentation issues - featuretools.dfs

Hi,

I noticed that a few parameters (entities and ignore_variables) are missing in the documentation for featuretools.dfs. This is probably due to the input types for these parameters being specified as dict[str: something] in the docstring for featuretools.synthesis.dfs.

I would recommend reformatting the input type in the docstring by replacing the colon with another character to fix the documentation issues.

Error during install on windows 10

C:\WINDOWS\System32>python
Python 3.5.2 |Anaconda custom (64-bit)| (default, Jul 5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.

C:\WINDOWS\System32>pip install featuretools
Requirement already satisfied: featuretools in c:\users\cde3\anaconda3\lib\site-packages
Requirement already satisfied: pympler>=0.5 in c:\users\cde3\anaconda3\lib\site-packages (from featuretools)
Requirement already satisfied: tqdm>=4.19.2 in c:\users\cde3\anaconda3\lib\site-packages (from featuretools)
Requirement already satisfied: numpy>=1.14.0 in c:\users\cde3\anaconda3\lib\site-packages (from featuretools)
Requirement already satisfied: cloudpickle>=0.4.0 in c:\users\cde3\anaconda3\lib\site-packages (from featuretools)
Requirement already satisfied: pyyaml>=3.12 in c:\users\cde3\anaconda3\lib\site-packages (from featuretools)
Requirement already satisfied: s3fs>=0.1.2 in c:\users\cde3\anaconda3\lib\site-packages (from featuretools)
Requirement already satisfied: scipy>=1.0.0 in c:\users\cde3\anaconda3\lib\site-packages (from featuretools)
Requirement already satisfied: pandas>=0.20.3 in c:\users\cde3\anaconda3\lib\site-packages (from featuretools)
Requirement already satisfied: future>=0.16.0 in c:\users\cde3\anaconda3\lib\site-packages (from featuretools)
Requirement already satisfied: toolz>=0.8.2 in c:\users\cde3\anaconda3\lib\site-packages (from featuretools)
Requirement already satisfied: dask[complete] in c:\users\cde3\anaconda3\lib\site-packages (from featuretools)
Requirement already satisfied: boto3 in c:\users\cde3\anaconda3\lib\site-packages (from s3fs>=0.1.2->featuretools)
Requirement already satisfied: pytz>=2011k in c:\users\cde3\anaconda3\lib\site-packages (from pandas>=0.20.3->featuretools)
Requirement already satisfied: python-dateutil>=2 in c:\users\cde3\anaconda3\lib\site-packages (from pandas>=0.20.3->featuretools)
Collecting distributed>=1.10 (from dask[complete]->featuretools)
Using cached distributed-1.20.2-py2.py3-none-any.whl
Requirement already satisfied: partd>=0.3.5 in c:\users\cde3\anaconda3\lib\site-packages (from dask[complete]->featuretools)
Requirement already satisfied: s3transfer<0.2.0,>=0.1.10 in c:\users\cde3\anaconda3\lib\site-packages (from boto3->s3fs>=0.1.2->featuretools)
Requirement already satisfied: botocore<1.9.0,>=1.8.38 in c:\users\cde3\anaconda3\lib\site-packages (from boto3->s3fs>=0.1.2->featuretools)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in c:\users\cde3\anaconda3\lib\site-packages (from boto3->s3fs>=0.1.2->featuretools)
Requirement already satisfied: six>=1.5 in c:\users\cde3\anaconda3\lib\site-packages (from python-dateutil>=2->pandas>=0.20.3->featuretools)
Requirement already satisfied: psutil in c:\users\cde3\anaconda3\lib\site-packages (from distributed>=1.10->dask[complete]->featuretools)
Requirement already satisfied: click>=6.6 in c:\users\cde3\anaconda3\lib\site-packages (from distributed>=1.10->dask[complete]->featuretools)
Collecting tblib (from distributed>=1.10->dask[complete]->featuretools)
Using cached tblib-1.3.2-py2.py3-none-any.whl
Collecting tornado>=4.5.1 (from distributed>=1.10->dask[complete]->featuretools)
Using cached tornado-4.5.3-cp35-cp35m-win_amd64.whl
Collecting sortedcontainers (from distributed>=1.10->dask[complete]->featuretools)
Using cached sortedcontainers-1.5.9-py2.py3-none-any.whl
Collecting msgpack-python (from distributed>=1.10->dask[complete]->featuretools)
Using cached msgpack-python-0.5.4.tar.gz
Collecting zict>=0.1.3 (from distributed>=1.10->dask[complete]->featuretools)
Using cached zict-0.1.3-py2.py3-none-any.whl
Requirement already satisfied: locket in c:\users\cde3\anaconda3\lib\site-packages (from partd>=0.3.5->dask[complete]->featuretools)
Requirement already satisfied: docutils>=0.10 in c:\users\cde3\anaconda3\lib\site-packages (from botocore<1.9.0,>=1.8.38->boto3->s3fs>=0.1.2->featuretools)
Requirement already satisfied: heapdict in c:\users\cde3\anaconda3\lib\site-packages (from zict>=0.1.3->distributed>=1.10->dask[complete]->featuretools)
Building wheels for collected packages: msgpack-python
Running setup.py bdist_wheel for msgpack-python ... error
Complete output from command c:\users\cde3\anaconda3\python.exe -u -c "import setuptools, tokenize;file='C:\Users\cde3\AppData\Local\Temp\pip-build-83u6dm4q\msgpack-python\setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" bdist_wheel -d C:\Users\cde3\AppData\Local\Temp\tmpr1pyilt8pip-wheel- --python-tag cp35:
running bdist_wheel
running build
running build_py
creating build
creating build\lib.win-amd64-3.5
creating build\lib.win-amd64-3.5\msgpack
copying msgpack\exceptions.py -> build\lib.win-amd64-3.5\msgpack
copying msgpack\fallback.py -> build\lib.win-amd64-3.5\msgpack
copying msgpack_version.py -> build\lib.win-amd64-3.5\msgpack
copying msgpack_init_.py -> build\lib.win-amd64-3.5\msgpack
running build_ext
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\cde3\AppData\Local\Temp\pip-build-83u6dm4q\msgpack-python\setup.py", line 136, in
'License :: OSI Approved :: Apache Software License',
File "c:\users\cde3\anaconda3\lib\distutils\core.py", line 148, in setup
dist.run_commands()
File "c:\users\cde3\anaconda3\lib\distutils\dist.py", line 955, in run_commands
self.run_command(cmd)
File "c:\users\cde3\anaconda3\lib\distutils\dist.py", line 974, in run_command
cmd_obj.run()
File "c:\users\cde3\anaconda3\lib\site-packages\wheel\bdist_wheel.py", line 179, in run
self.run_command('build')
File "c:\users\cde3\anaconda3\lib\distutils\cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "c:\users\cde3\anaconda3\lib\distutils\dist.py", line 974, in run_command
cmd_obj.run()
File "c:\users\cde3\anaconda3\lib\distutils\command\build.py", line 135, in run
self.run_command(cmd_name)
File "c:\users\cde3\anaconda3\lib\distutils\cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "c:\users\cde3\anaconda3\lib\distutils\dist.py", line 974, in run_command
cmd_obj.run()
File "c:\users\cde3\anaconda3\lib\distutils\command\build_ext.py", line 307, in run
force=self.force)
File "c:\users\cde3\anaconda3\lib\distutils\ccompiler.py", line 1031, in new_compiler
return klass(None, dry_run, force)
File "c:\users\cde3\anaconda3\lib\distutils\cygwinccompiler.py", line 282, in init
CygwinCCompiler.init (self, verbose, dry_run, force)
File "c:\users\cde3\anaconda3\lib\distutils\cygwinccompiler.py", line 157, in init
self.dll_libraries = get_msvcr()
File "c:\users\cde3\anaconda3\lib\distutils\cygwinccompiler.py", line 86, in get_msvcr
raise ValueError("Unknown MS Compiler version %s " % msc_ver)
ValueError: Unknown MS Compiler version 1900

Failed building wheel for msgpack-python
Running setup.py clean for msgpack-python
Failed to build msgpack-python
Installing collected packages: tblib, tornado, sortedcontainers, msgpack-python, zict, distributed
Found existing installation: tornado 4.4.1
DEPRECATION: Uninstalling a distutils installed project (tornado) has been deprecated and will be removed in a future version. This is due to the fact that uninstalling a distutils project will only partially uninstall the project.
Uninstalling tornado-4.4.1:
Successfully uninstalled tornado-4.4.1
Running setup.py install for msgpack-python ... error
Complete output from command c:\users\cde3\anaconda3\python.exe -u -c "import setuptools, tokenize;file='C:\Users\cde3\AppData\Local\Temp\pip-build-83u6dm4q\msgpack-python\setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record C:\Users\cde3\AppData\Local\Temp\pip-mgmr7_3f-record\install-record.txt --single-version-externally-managed --compile:
running install
running build
running build_py
creating build
creating build\lib.win-amd64-3.5
creating build\lib.win-amd64-3.5\msgpack
copying msgpack\exceptions.py -> build\lib.win-amd64-3.5\msgpack
copying msgpack\fallback.py -> build\lib.win-amd64-3.5\msgpack
copying msgpack_version.py -> build\lib.win-amd64-3.5\msgpack
copying msgpack_init_.py -> build\lib.win-amd64-3.5\msgpack
running build_ext
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\cde3\AppData\Local\Temp\pip-build-83u6dm4q\msgpack-python\setup.py", line 136, in
'License :: OSI Approved :: Apache Software License',
File "c:\users\cde3\anaconda3\lib\distutils\core.py", line 148, in setup
dist.run_commands()
File "c:\users\cde3\anaconda3\lib\distutils\dist.py", line 955, in run_commands
self.run_command(cmd)
File "c:\users\cde3\anaconda3\lib\distutils\dist.py", line 974, in run_command
cmd_obj.run()
File "c:\users\cde3\anaconda3\lib\site-packages\setuptools-27.2.0-py3.5.egg\setuptools\command\install.py", line 61, in run
File "c:\users\cde3\anaconda3\lib\distutils\command\install.py", line 539, in run
self.run_command('build')
File "c:\users\cde3\anaconda3\lib\distutils\cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "c:\users\cde3\anaconda3\lib\distutils\dist.py", line 974, in run_command
cmd_obj.run()
File "c:\users\cde3\anaconda3\lib\distutils\command\build.py", line 135, in run
self.run_command(cmd_name)
File "c:\users\cde3\anaconda3\lib\distutils\cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "c:\users\cde3\anaconda3\lib\distutils\dist.py", line 974, in run_command
cmd_obj.run()
File "c:\users\cde3\anaconda3\lib\distutils\command\build_ext.py", line 307, in run
force=self.force)
File "c:\users\cde3\anaconda3\lib\distutils\ccompiler.py", line 1031, in new_compiler
return klass(None, dry_run, force)
File "c:\users\cde3\anaconda3\lib\distutils\cygwinccompiler.py", line 282, in init
CygwinCCompiler.init (self, verbose, dry_run, force)
File "c:\users\cde3\anaconda3\lib\distutils\cygwinccompiler.py", line 157, in init
self.dll_libraries = get_msvcr()
File "c:\users\cde3\anaconda3\lib\distutils\cygwinccompiler.py", line 86, in get_msvcr
raise ValueError("Unknown MS Compiler version %s " % msc_ver)
ValueError: Unknown MS Compiler version 1900

----------------------------------------

Command "c:\users\cde3\anaconda3\python.exe -u -c "import setuptools, tokenize;file='C:\Users\cde3\AppData\Local\Temp\pip-build-83u6dm4q\msgpack-python\setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record C:\Users\cde3\AppData\Local\Temp\pip-mgmr7_3f-record\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in C:\Users\cde3\AppData\Local\Temp\pip-build-83u6dm4q\msgpack-python\

C:\WINDOWS\System32>

Support/approach for sliding window/multiple snapshots in time

Hi there!
(first of all huge thx for dfs, vision & tools, superb work)

My question, the predict_next_purchase sample uses a single cut_off time right? But doesnt that remove a lot of data that could help with the purchase prediction? and we're only using a single day for reference right?

only this data/users -> "Using users who had acivity during training_window days before the cutoff_time, we look to see if they purchase the product in the prediction_window."

I would like to use all data in a single final ml table for the models. Is there support to have the cut off being a sliding window (ex: for each customer) of features from last x days, predicting purchase (yes/no) up x days in the future. So each customer would appear multiple times, depending on choosen sliding window.

Think it's a tipical pattern in predicting future events (predictive maintenance, churn, healthcare). Usually applies to any kind of event prediction. (ex: for every user, machine, predict probability of event E for the next x days for a specific point in time, obv the training the dataset has proper timestamps so that we can "recalculate" feature values for user/machine up to at any point in time)

The dataset becomes non IID obv, some cautions apply.

Makes sense? What's the approach to use DFS with these scenarios?
thx!

pip install fail

Using python 2.7

I tried pip install featuretools. seems to work but then stalls here:

Using cached jmespath-0.9.3-py2.py3-none-any.whl
Collecting botocore<1.8.0,>=1.7.0 (from boto3->s3fs==0.1.2->featuretools)
Using cached botocore-1.7.43-py2.py3-none-any.whl
Requirement already satisfied: locket in c:\users\trader\downloads\winpython-64b
it-2.7.10.3\python-2.7.10.amd64\lib\site-packages (from partd>=0.3.8; extra == "
complete"->dask[complete]==0.15.4->featuretools)
Collecting click>=6.6 (from distributed>=1.16; extra == "complete"->dask[complet
e]==0.15.4->featuretools)
Using cached click-6.7-py2.py3-none-any.whl
Collecting msgpack-python (from distributed>=1.16; extra == "complete"->dask[com
plete]==0.15.4->featuretools)
Using cached msgpack_python-0.4.8-cp27-cp27m-win_amd64.whl
Requirement already satisfied: singledispatch; python_version < "3.4" in c:\user
s\trader\downloads\winpython-64bit-2.7.10.3\python-2.7.10.amd64\lib\site-package
s (from distributed>=1.16; extra == "complete"->dask[complete]==0.15.4->featuret
ools)
Collecting sortedcontainers (from distributed>=1.16; extra == "complete"->dask[c
omplete]==0.15.4->featuretools)
Using cached sortedcontainers-1.5.7-py2.py3-none-any.whl
Collecting zict>=0.1.3 (from distributed>=1.16; extra == "complete"->dask[comple
te]==0.15.4->featuretools)
Using cached zict-0.1.3-py2.py3-none-any.whl
Requirement already satisfied: psutil in c:\users\trader\downloads\winpython-64b
it-2.7.10.3\python-2.7.10.amd64\lib\site-packages (from distributed>=1.16; extra
== "complete"->dask[complete]==0.15.4->featuretools)
Collecting futures; python_version < "3.0" (from distributed>=1.16; extra == "co
mplete"->dask[complete]==0.15.4->featuretools)
Using cached futures-3.1.1-py2-none-any.whl
Collecting tblib (from distributed>=1.16; extra == "complete"->dask[complete]==0
.15.4->featuretools)
Using cached tblib-1.3.2-py2.py3-none-any.whl
Collecting tornado>=4.5.1 (from distributed>=1.16; extra == "complete"->dask[com
plete]==0.15.4->featuretools)
Requirement already satisfied: docutils>=0.10 in c:\users\trader\downloads\winpy
thon-64bit-2.7.10.3\python-2.7.10.amd64\lib\site-packages (from botocore<1.8.0,>
=1.7.0->boto3->s3fs==0.1.2->featuretools)
Collecting heapdict (from zict>=0.1.3->distributed>=1.16; extra == "complete"->d
ask[complete]==0.15.4->featuretools)
Requirement already satisfied: backports-abc>=0.4 in c:\users\trader\downloads\w
inpython-64bit-2.7.10.3\python-2.7.10.amd64\lib\site-packages (from tornado>=4.5
.1->distributed>=1.16; extra == "complete"->dask[complete]==0.15.4->featuretools
)
Requirement already satisfied: certifi in c:\users\trader\downloads\winpython-64
bit-2.7.10.3\python-2.7.10.amd64\lib\site-packages (from tornado>=4.5.1->distrib
uted>=1.16; extra == "complete"->dask[complete]==0.15.4->featuretools)
Building wheels for collected packages: scipy
Running setup.py bdist_wheel for scipy ... error
Complete output from command C:\Users\Trader\Downloads\WinPython-64bit-2.7.10.
3\python-2.7.10.amd64\python.exe -u -c "import setuptools, tokenize;__file__='c:
\\users\\trader\\appdata\\local\\temp\\pip-build-kq6znx\\scipy\\setup.py';f=geta
ttr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.clos
e();exec(compile(code, __file__, 'exec'))" bdist_wheel -d c:\users\trader\appdat
a\local\temp\tmp1dnodvpip-wheel- --python-tag cp27:
lapack_opt_info:
lapack_mkl_info:
  libraries mkl_rt not found in ['C:\\Users\\Trader\\Downloads\\WinPython-64bi
t-2.7.10.3\\python-2.7.10.amd64\\lib', 'C:\\', 'C:\\Users\\Trader\\Downloads\\Wi
nPython-64bit-2.7.10.3\\python-2.7.10.amd64\\libs']
  NOT AVAILABLE

openblas_lapack_info:
C:\Users\Trader\Downloads\WinPython-64bit-2.7.10.3\python-2.7.10.amd64\lib\sit
e-packages\numpy\distutils\system_info.py:655: UserWarning: Specified path c:\op
t\64\lib is invalid.
  return self.get_paths(self.section, key)
  libraries libopenblas_v0.2.20_mingwpy not found in []
  NOT AVAILABLE

atlas_3_10_threads_info:
Setting PTATLAS=ATLAS
  libraries tatlas,tatlas not found in C:\Users\Trader\Downloads\WinPython-64b
it-2.7.10.3\python-2.7.10.amd64\lib
  libraries lapack_atlas not found in C:\Users\Trader\Downloads\WinPython-64bi
t-2.7.10.3\python-2.7.10.amd64\lib
  libraries tatlas,tatlas not found in C:\
  libraries lapack_atlas not found in C:\
  libraries tatlas,tatlas not found in C:\Users\Trader\Downloads\WinPython-64b
it-2.7.10.3\python-2.7.10.amd64\libs
  libraries lapack_atlas not found in C:\Users\Trader\Downloads\WinPython-64bi
t-2.7.10.3\python-2.7.10.amd64\libs
<class 'numpy.distutils.system_info.atlas_3_10_threads_info'>
  NOT AVAILABLE

atlas_3_10_info:
  libraries satlas,satlas not found in C:\Users\Trader\Downloads\WinPython-64b
it-2.7.10.3\python-2.7.10.amd64\lib
  libraries lapack_atlas not found in C:\Users\Trader\Downloads\WinPython-64bi
t-2.7.10.3\python-2.7.10.amd64\lib
  libraries satlas,satlas not found in C:\
  libraries lapack_atlas not found in C:\
  libraries satlas,satlas not found in C:\Users\Trader\Downloads\WinPython-64b
it-2.7.10.3\python-2.7.10.amd64\libs
  libraries lapack_atlas not found in C:\Users\Trader\Downloads\WinPython-64bi
t-2.7.10.3\python-2.7.10.amd64\libs
<class 'numpy.distutils.system_info.atlas_3_10_info'>
  NOT AVAILABLE

atlas_threads_info:
Setting PTATLAS=ATLAS
  libraries ptf77blas,ptcblas,atlas not found in C:\Users\Trader\Downloads\Win
Python-64bit-2.7.10.3\python-2.7.10.amd64\lib
  libraries lapack_atlas not found in C:\Users\Trader\Downloads\WinPython-64bi
t-2.7.10.3\python-2.7.10.amd64\lib
  libraries ptf77blas,ptcblas,atlas not found in C:\
  libraries lapack_atlas not found in C:\
  libraries ptf77blas,ptcblas,atlas not found in C:\Users\Trader\Downloads\Win
Python-64bit-2.7.10.3\python-2.7.10.amd64\libs
  libraries lapack_atlas not found in C:\Users\Trader\Downloads\WinPython-64bi
t-2.7.10.3\python-2.7.10.amd64\libs
<class 'numpy.distutils.system_info.atlas_threads_info'>
  NOT AVAILABLE

atlas_info:
  libraries f77blas,cblas,atlas not found in C:\Users\Trader\Downloads\WinPyth
on-64bit-2.7.10.3\python-2.7.10.amd64\lib
  libraries lapack_atlas not found in C:\Users\Trader\Downloads\WinPython-64bi
t-2.7.10.3\python-2.7.10.amd64\lib
  libraries f77blas,cblas,atlas not found in C:\
  libraries lapack_atlas not found in C:\
  libraries f77blas,cblas,atlas not found in C:\Users\Trader\Downloads\WinPyth
on-64bit-2.7.10.3\python-2.7.10.amd64\libs
  libraries lapack_atlas not found in C:\Users\Trader\Downloads\WinPython-64bi
t-2.7.10.3\python-2.7.10.amd64\libs
<class 'numpy.distutils.system_info.atlas_info'>
  NOT AVAILABLE

C:\Users\Trader\Downloads\WinPython-64bit-2.7.10.3\python-2.7.10.amd64\lib\sit
e-packages\numpy\distutils\system_info.py:572: UserWarning:
    Atlas (http://math-atlas.sourceforge.net/) libraries not found.
    Directories to search for the libraries can be specified in the
    numpy/distutils/site.cfg file (section [atlas]) or by setting
    the ATLAS environment variable.
  self.calc_info()
lapack_info:
  libraries lapack not found in ['C:\\Users\\Trader\\Downloads\\WinPython-64bi
t-2.7.10.3\\python-2.7.10.amd64\\lib', 'C:\\', 'C:\\Users\\Trader\\Downloads\\Wi
nPython-64bit-2.7.10.3\\python-2.7.10.amd64\\libs']
  NOT AVAILABLE

C:\Users\Trader\Downloads\WinPython-64bit-2.7.10.3\python-2.7.10.amd64\lib\sit
e-packages\numpy\distutils\system_info.py:572: UserWarning:
    Lapack (http://www.netlib.org/lapack/) libraries not found.
    Directories to search for the libraries can be specified in the
    numpy/distutils/site.cfg file (section [lapack]) or by setting
    the LAPACK environment variable.
  self.calc_info()
lapack_src_info:
  NOT AVAILABLE

C:\Users\Trader\Downloads\WinPython-64bit-2.7.10.3\python-2.7.10.amd64\lib\sit
e-packages\numpy\distutils\system_info.py:572: UserWarning:
    Lapack (http://www.netlib.org/lapack/) sources not found.
    Directories to search for the sources can be specified in the
    numpy/distutils/site.cfg file (section [lapack_src]) or by setting
    the LAPACK_SRC environment variable.
  self.calc_info()
  NOT AVAILABLE

Running from scipy source directory.
non-existing path in 'scipy\\integrate': 'quadpack.h'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "c:\users\trader\appdata\local\temp\pip-build-kq6znx\scipy\setup.py", l
ine 416, in <module>
    setup_package()
  File "c:\users\trader\appdata\local\temp\pip-build-kq6znx\scipy\setup.py", l
ine 412, in setup_package
    setup(**metadata)
  File "C:\Users\Trader\Downloads\WinPython-64bit-2.7.10.3\python-2.7.10.amd64
\lib\site-packages\numpy\distutils\core.py", line 135, in setup
    config = configuration()
  File "c:\users\trader\appdata\local\temp\pip-build-kq6znx\scipy\setup.py", l
ine 336, in configuration
    config.add_subpackage('scipy')
  File "C:\Users\Trader\Downloads\WinPython-64bit-2.7.10.3\python-2.7.10.amd64
\lib\site-packages\numpy\distutils\misc_util.py", line 1029, in add_subpackage
    caller_level = 2)
  File "C:\Users\Trader\Downloads\WinPython-64bit-2.7.10.3\python-2.7.10.amd64
\lib\site-packages\numpy\distutils\misc_util.py", line 998, in get_subpackage
    caller_level = caller_level + 1)
  File "C:\Users\Trader\Downloads\WinPython-64bit-2.7.10.3\python-2.7.10.amd64
\lib\site-packages\numpy\distutils\misc_util.py", line 935, in _get_configuratio
n_from_setup_py
    config = setup_module.configuration(*args)
  File "scipy\setup.py", line 15, in configuration
    config.add_subpackage('linalg')
  File "C:\Users\Trader\Downloads\WinPython-64bit-2.7.10.3\python-2.7.10.amd64
\lib\site-packages\numpy\distutils\misc_util.py", line 1029, in add_subpackage
    caller_level = 2)
  File "C:\Users\Trader\Downloads\WinPython-64bit-2.7.10.3\python-2.7.10.amd64
\lib\site-packages\numpy\distutils\misc_util.py", line 998, in get_subpackage
    caller_level = caller_level + 1)
  File "C:\Users\Trader\Downloads\WinPython-64bit-2.7.10.3\python-2.7.10.amd64
\lib\site-packages\numpy\distutils\misc_util.py", line 935, in _get_configuratio
n_from_setup_py
    config = setup_module.configuration(*args)
  File "scipy\linalg\setup.py", line 20, in configuration
    raise NotFoundError('no lapack/blas resources found')
numpy.distutils.system_info.NotFoundError: no lapack/blas resources found

----------------------------------------
Failed building wheel for scipy
Running setup.py clean for scipy
Complete output from command C:\Users\Trader\Downloads\WinPython-64bit-2.7.10.
3\python-2.7.10.amd64\python.exe -u -c "import setuptools, tokenize;__file__='c:
\\users\\trader\\appdata\\local\\temp\\pip-build-kq6znx\\scipy\\setup.py';f=geta
ttr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.clos
e();exec(compile(code, __file__, 'exec'))" clean --all:

`setup.py clean` is not supported, use one of the following instead:

  - `git clean -xdf` (cleans all files)
  - `git clean -Xdf` (cleans all versioned files, doesn't touch
                      files that aren't checked into the git repo)

Add `--force` to your command to use it anyway if you must (unsupported).


----------------------------------------
Failed cleaning build dir for scipy
Failed to build scipy
Installing collected packages: tqdm, futures, jmespath, botocore, s3transfer, bo
to3, s3fs, partd, click, msgpack-python, sortedcontainers, heapdict, zict, tblib
, tornado, distributed, dask, scipy, featuretools
Found existing installation: partd 0.3.2
  Uninstalling partd-0.3.2:
    Successfully uninstalled partd-0.3.2
Found existing installation: click 5.0
  Uninstalling click-5.0:
    Successfully uninstalled click-5.0
Found existing installation: tornado 4.2.1
  Uninstalling tornado-4.2.1:
    Successfully uninstalled tornado-4.2.1
Exception:
Traceback (most recent call last):
File "C:\Users\Trader\Downloads\WinPython-64bit-2.7.10.3\python-2.7.10.amd64\l
ib\site-packages\pip\basecommand.py", line 215, in main
  status = self.run(options, args)
File "C:\Users\Trader\Downloads\WinPython-64bit-2.7.10.3\python-2.7.10.amd64\l
ib\site-packages\pip\commands\install.py", line 342, in run
  prefix=options.prefix_path,
File "C:\Users\Trader\Downloads\WinPython-64bit-2.7.10.3\python-2.7.10.amd64\l
ib\site-packages\pip\req\req_set.py", line 795, in install
  requirement.commit_uninstall()
File "C:\Users\Trader\Downloads\WinPython-64bit-2.7.10.3\python-2.7.10.amd64\l
ib\site-packages\pip\req\req_install.py", line 767, in commit_uninstall
  self.uninstalled.commit()
File "C:\Users\Trader\Downloads\WinPython-64bit-2.7.10.3\python-2.7.10.amd64\l
ib\site-packages\pip\req\req_uninstall.py", line 142, in commit
  rmtree(self.save_dir)
File "C:\Users\Trader\Downloads\WinPython-64bit-2.7.10.3\python-2.7.10.amd64\l
ib\site-packages\pip\_vendor\retrying.py", line 49, in wrapped_f
  return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "C:\Users\Trader\Downloads\WinPython-64bit-2.7.10.3\python-2.7.10.amd64\l
ib\site-packages\pip\_vendor\retrying.py", line 212, in call
  raise attempt.get()
File "C:\Users\Trader\Downloads\WinPython-64bit-2.7.10.3\python-2.7.10.amd64\l
ib\site-packages\pip\_vendor\retrying.py", line 247, in get
  six.reraise(self.value[0], self.value[1], self.value[2])
File "C:\Users\Trader\Downloads\WinPython-64bit-2.7.10.3\python-2.7.10.amd64\l
ib\site-packages\pip\_vendor\retrying.py", line 200, in call
  attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "C:\Users\Trader\Downloads\WinPython-64bit-2.7.10.3\python-2.7.10.amd64\l
ib\site-packages\pip\utils\__init__.py", line 102, in rmtree
  onerror=rmtree_errorhandler)
File "C:\Users\Trader\Downloads\WinPython-64bit-2.7.10.3\python-2.7.10.amd64\l
ib\shutil.py", line 247, in rmtree
  rmtree(fullname, ignore_errors, onerror)
File "C:\Users\Trader\Downloads\WinPython-64bit-2.7.10.3\python-2.7.10.amd64\l
ib\shutil.py", line 247, in rmtree
  rmtree(fullname, ignore_errors, onerror)
File "C:\Users\Trader\Downloads\WinPython-64bit-2.7.10.3\python-2.7.10.amd64\l
ib\shutil.py", line 247, in rmtree
  rmtree(fullname, ignore_errors, onerror)
File "C:\Users\Trader\Downloads\WinPython-64bit-2.7.10.3\python-2.7.10.amd64\l
ib\shutil.py", line 247, in rmtree
  rmtree(fullname, ignore_errors, onerror)
File "C:\Users\Trader\Downloads\WinPython-64bit-2.7.10.3\python-2.7.10.amd64\l
ib\shutil.py", line 247, in rmtree
  rmtree(fullname, ignore_errors, onerror)
File "C:\Users\Trader\Downloads\WinPython-64bit-2.7.10.3\python-2.7.10.amd64\l
ib\shutil.py", line 247, in rmtree
  rmtree(fullname, ignore_errors, onerror)
File "C:\Users\Trader\Downloads\WinPython-64bit-2.7.10.3\python-2.7.10.amd64\l
ib\shutil.py", line 247, in rmtree
  rmtree(fullname, ignore_errors, onerror)
File "C:\Users\Trader\Downloads\WinPython-64bit-2.7.10.3\python-2.7.10.amd64\l
ib\shutil.py", line 247, in rmtree
  rmtree(fullname, ignore_errors, onerror)
File "C:\Users\Trader\Downloads\WinPython-64bit-2.7.10.3\python-2.7.10.amd64\l
ib\shutil.py", line 252, in rmtree
  onerror(os.remove, fullname, sys.exc_info())
File "C:\Users\Trader\Downloads\WinPython-64bit-2.7.10.3\python-2.7.10.amd64\l
ib\site-packages\pip\utils\__init__.py", line 114, in rmtree_errorhandler
  func(path)
WindowsError: [Error 5] Access is denied: 'c:\\users\\trader\\appdata\\local\\te
mp\\pip-4kia5z-uninstall\\users\\trader\\downloads\\winpython-64bit-2.7.10.3\\py
thon-2.7.10.amd64\\lib\\site-packages\\tornado\\speedups.pyd'

tried installing dependecies manually but couldnt make it happen. I need this to work.
Any help would be awesome!

thanks guys and gals

support for non-relational data?

Great to see this "Deep Feature Synthesis" tool to help automate the feature engineering.
Based on the paper "Deep feature synthesis: Towards automating data science endeavors"-
DSAA2015, and online DOC (https://docs.featuretools.com/index.html), currently it seems like DFS could only support "relation data" (couple entity tables and relational tables)?
Do I miss something?
Any plan to include the support for non-relational data (e.g. only one big table) as well?

What does dfs_filter do and what's the intention to use it?

I find that the traversal filter has been used in dfs while recursively building forward and backward dfs. But I don't understand what this filter really do and I haven't find any docs or annotation that explained it.
It seems like this filter makes some check on entity's variable count and unique_percent, which is confusing to me. Why do we need to do this kind of check in recursively building dfs steps?
And there is another filter LimitModeUniques in dfs_filter.py which is also imported in dfs but seems not to be used. Is there any plan to use that filter in the future?
Thx for reading my long questions and looking forward to your reply!

Scikit-Learn Pipelines and FeatureUnion

Bug/Feature Request Description

The requested feature is to be able to use the feature matrix transformation from with a Scikit-Learn pipeline. Something like the below:

from sklearn.pipeline import Pipeline

feature_matrix_enc, features_enc = ft.encode_features(feature_matrix, feature_defs)

pipe = Pipeline([
    ("features", feature_matrix_enc),
    ("feature_selection", feature_selection),
    ("estimator", estimator)
])

...then when pipe.predict(X_test) is called the transform() method encodes the new data the same as:

feature_matrix = ft.calculate_feature_matrix(saved_features, X_test)

So that, at the end of the sklearn pipeline the estimator can be called.

Primitive stacking on direct features

Suppose we have an entityset with a parent entity E1 and a child entity E2 and we're building features on E2 with Deep Feature Synthesis. If E1 has a categorical variable, it seems that the direct feature of that categorical will be automatically generated and used in resulting feature matrix. However, that feature won't be used for any stacked features.

This becomes a problem when we want a primitive of multiple variables from different tables. The following example shows an entityset where a user might expect the feature CAT_PRIMITIVE(values_row_1, transactions.categorical) to be generated.

import pandas as pd
import featuretools as ft
import featuretools.variable_types as vtypes
datadict = {'values_row_1': [1, 1, 2],
            'transaction_id': [1, 2, 3],
            'categorical': ['cat', 'cat', 'lion']}

data = pd.DataFrame(datadict)

variable_types = {'values_row_1': vtypes.Numeric,
                  'transaction_id': vtypes.Categorical,
                  'categorical': vtypes.Categorical
                  }

es = ft.EntitySet('mock_entityset')

es.entity_from_dataframe(entity_id='values',
                         dataframe=data,
                         index='my_index',
                         variable_types=variable_types,
                         )

es.normalize_entity(base_entity_id='values',
                    new_entity_id='transactions',
                    index='transaction_id',
                    additional_variables=['categorical']
                    )

from featuretools.primitives import make_trans_primitive

def cat_primitive(value, categorical):
    return [x for x in categorical]

Prim = make_trans_primitive(cat_primitive,
                            input_types=[vtypes.Numeric, vtypes.Categorical],
                            return_type=vtypes.Numeric)

fm, features = ft.dfs(entityset=es, 
                      target_entity='values',
                      agg_primitives=[],
                      trans_primitives=[Prim])

features

Don't pin versions in requirements.txt

Per https://packaging.python.org/discussions/install-requires-vs-requirements/#install-requires,

featuretools is breaking convention by pinning specific versions of all its dependencies in install-requires: "It is not considered best practice to use install_requires to pin dependencies to specific versions"

Specifically, this is causing a dependency conflict when I try to install the following packages with pipenv:

pipenv --python 2.7
pipenv install numpy scipy pandas scikit-learn tensorflow featuretools

Creating feature with multi time step behind

Thank you very much for the tools.

I would like to build multi time step behind feature from a time stamp. I just checking if there is any example / manual for doing this.

For example, I have the following table:

My table>
row_id, id, time, purchase
1, id1, 2016-01-12 13:00:00, 10
2, id1, 2016-01-12 14:00:00, 15
3, id1, 2016-01-12 15:00:00, 20
4, id1, 2016-01-12 16:00:00, 25
5, id2, 2016-01-12 16:00:00, 13
6, id2, 2016-01-12 16:00:00, 17
7, id2, 2016-01-12 16:00:00, 19

The output I wish to have is: for each time I like to calculate the average of last two hours purchase average. The output for example:

row_id, id, time, average_purchase
1, id1, 2016-01-12 13:00:00, NA
2, id1, 2016-01-12 14:00:00, 5
3, id1, 2016-01-12 15:00:00, 12.5
4, id1, 2016-01-12 16:00:00, 17.5
5, id2, 2016-01-12 16:00:00, NA
6, id2, 2016-01-12 16:00:00, 6.5
7, id2, 2016-01-12 16:00:00, 15

Thanks in advance,
Fayzur

Track requirements and add coverage

https://requires.io/
helpful for keeping requirements up-to-date

https://coveralls.io/
helpful for keeping track of library health through test coverage on the project.

@kmax12 if you setup these accounts (i believe both free for public repos) I can add to the circle config to get them up and running

Support default_value in make_agg_primitive

If you are using the make_agg_primitive to define a custom primitive, you currently cannot provide the default_value to use. This parameter is supported by the lower level class-based definition of primitives, so it should available through the helper method.

Silent failure in 'cutoff_time'

It seems that it's possible to pass in a string rather than a time to 'cutoff_time' which blocks dfs from using the cutoff time at all without warning. This can show up in the #91 branch in the form of using all values rather than valid ones for the cutoff time.

Thanks to @bschreck for finding this and the error in my code that led to it.

Integer time index in entity turned into timestamp in feature matrix index

When calculating features where the target entity uses an integer-based time index, the time index for the resulting feature matrix is in datetime format. As a side effect, additional columns in the cutoff_time dataframe are not passed correctly.

Why does featuretools use first column as an index but not the pandas index field?

Hi,

Pandas creates an implicit index if one isn't specified as a column. What I want to achieve is to use the pandas' index in featuretools but it can't be passed as a name in index argument. featuretools uses first column by default and that part is not clear to me. Why does featuretools use first column as an index but not the pandas index field? How to let featuretools use the index field instead?

The code:
https://github.com/Featuretools/featuretools/blob/906777bbafc18892a927dfdc5ac3f3b8d40de1b5/featuretools/entityset/entityset.py#L441-L459

error in function 'pd_time_unit' within 'transform_primitive.py'

should we delete the '.values' in 'getattr(pd_index, time_unit).values', otherwise I get the error of 'AttributeError: 'numpy.ndarray' object has no attribute 'values'' in this line.

def pd_time_unit(time_unit):
     def inner(pd_index):
           return getattr(pd_index, time_unit)
return inner

Document that relationships must be one to many

Should be added to this section: https://docs.featuretools.com/loading_data/using_entitysets.html#adding-a-relationship

Avoid confusions such as #49

Python 3 support?

Hi guys,

Just wondering if there is a roadmap for Python 3 support?

Why not build Entity features after Direct features？

I find that direct features are now built at last step in the realization of DFS algorithm, which causes the problem in #81 .
However, I've also read the paper ‘Deep Feature Synthesis: Towards Automating DataScience Endeavors’. In this paper, you said in Section Ⅱ.C that "We must first synthesize rfeat and dfeat features so we can apply efeat feature to the results." Actually I think the idea in your paper is much more reasonable. So why don't you use this idea? Is there any inexplicit trouble?

Minor docstring inaccuracies

I came accross these two issues:

In entity_from_dataframe it says secondary_time_index is a string, when it should be a dictionary.
In normalize_entity there's an undocumented kwarg variable_types.
I'm thinking of using the branch to fix them to also standardize the notation in those two docstrings for lists, dicts and optional arguments to

list[str]
(dict[str->list[str]], optional)

Make column names in `load_retail` easier to read

Change the names of the columns in the load_retail function as follows:

                     {'Unnamed: 0': 'transaction_id',
                       'InvoiceNo': 'invoice_id',
                       'StockCode': 'product_id',
                       'Description': 'description',
                       'Quantity': 'quantity',
                       'InvoiceDate': 'invoice_date',
                       'UnitPrice': 'price',
                       'CustomerID': 'customer_id',
                       'Country': 'country'}

and update any connected tests. Suggestions welcome for alternate column names.

Multiple threading version to calculate feature matrix

Hello,

calculating feature matrix takes reasonable amount of time on single thread, while most of the other threads are doing nothing.

Is it possible to calculate feature matrix in parallel mode with multi threading support?

Error with 'entity_from_csv'

I am trying to extract features automatically with featuretools from Windows event logs available online. When I run entity_from_csv on the authentication dataset I get an error like "segmentation fault" or "free: invalid pointer".

To reproduce the error :

The csv file contains 13,000,000 lines. When I take only the first 12,000,000 lines there is no error. When I take only the last 1,000,000 there is no error either, so I do not think the problem is due to the content (like a special character) of the last lines.

Do you have an idea about the origin of this error ?

Context :

installation with pip install featuretools
run with python 2.7
64Go (more than enough for the dataset considered)

Tuple handling in normalize entity

I get the following error running dfs after normalizing my entityset by an index which is a tuple. MWE is forthcoming, but the normalize_entity kwarg convert_links_to_integers=True avoids the error.

Thanks to @bschreck for pointing me to convert_links_to_integers

KeyError                                  Traceback (most recent call last)
<ipython-input-17-f9a2ed177666> in <module>()
      6                       cutoff_time=cutoff_times,
      7                       max_depth=3,
----> 8                       verbose=True)
/Users/featurelabs07/Documents/Repositories/featuretools/featuretools/synthesis/dfs.pyc in dfs(entities, relationships, entityset, target_entity, cutoff_time, instance_ids, agg_primitives, trans_primitives, allowed_paths, max_depth, ignore_entities, ignore_variables, seed_features, drop_contains, drop_exact, where_primitives, max_features, cutoff_time_in_index, save_progress, features_only, training_window, approximate, verbose)
    181                                                   cutoff_time_in_index=cutoff_time_in_index,
    182                                                   save_progress=save_progress,
--> 183                                                   verbose=verbose)
    184     else:
    185         feature_matrix = calculate_feature_matrix(features,
/Users/featurelabs07/Documents/Repositories/featuretools/featuretools/computational_backends/calculate_feature_matrix.pyc in calculate_feature_matrix(features, cutoff_time, instance_ids, entities, relationships, entityset, cutoff_time_in_index, training_window, approximate, save_progress, verbose, backend_verbose, verbose_desc, profile)
    194                                           save_progress, backend,
    195                                           no_unapproximated_aggs, cutoff_df_time_var,
--> 196                                           target_time, pass_columns)
    197         feature_matrix.append(_feature_matrix)
    198         # Do a manual garbage collection in case objects from calculate_batch
/Users/featurelabs07/Documents/Repositories/featuretools/featuretools/computational_backends/calculate_feature_matrix.pyc in calculate_batch(features, group, approximate, entityset, backend_verbose, training_window, profile, verbose, save_progress, backend, no_unapproximated_aggs, cutoff_df_time_var, target_time, pass_columns)
    270                                        ids,
    271                                        precalculated_features=precalculated_features,
--> 272                                        training_window=window)
    273 
    274         id_name = _feature_matrix.index.name
/Users/featurelabs07/Documents/Repositories/featuretools/featuretools/computational_backends/calculate_feature_matrix.pyc in wrapped(*args, **kwargs)
    319         def wrapped(*args, **kwargs):
    320             if save_progress is None:
--> 321                 r = method(*args, **kwargs)
    322             else:
    323                 time = args[0].to_pydatetime()
/Users/featurelabs07/Documents/Repositories/featuretools/featuretools/computational_backends/calculate_feature_matrix.pyc in calc_results(time_last, ids, precalculated_features, training_window)
    243                                                 ignored=all_approx_feature_set,
    244                                                 profile=profile,
--> 245                                                 verbose=backend_verbose)
    246         return matrix
    247 
/Users/featurelabs07/Documents/Repositories/featuretools/featuretools/computational_backends/pandas_backend.pyc in calculate_all_features(self, instance_ids, time_last, training_window, profile, precalculated_features, ignored, verbose)
    103                                                  time_last=time_last,
    104                                                  training_window=training_window,
--> 105                                                  verbose=verbose)
    106         large_eframes_by_filter = None
    107         if any([f.uses_full_entity for f in self.feature_tree.all_features]):
/Users/featurelabs07/Documents/Repositories/featuretools/featuretools/entityset/entityset.pyc in get_pandas_data_slice(self, filter_entity_ids, index_eid, instances, entity_columns, time_last, training_window, verbose)
    172                                                      instance_ids=instances,
    173                                                      time_last=time_last,
--> 174                                                      training_window=training_window)
    175 
    176             eframes = {filter_eid: toplevel_slice}
/Users/featurelabs07/Documents/Repositories/featuretools/featuretools/entityset/entityset.pyc in _related_instances(self, start_entity_id, final_entity_id, instance_ids, time_last, add_link, training_window)
   1030                                               variable_id=rvar_new,
   1031                                               time_last=time_last,
-> 1032                                               training_window=window)
   1033 
   1034             # group the rows in the new dataframe by the instances of the first
/Users/featurelabs07/Documents/Repositories/featuretools/featuretools/entityset/entity.pyc in query_by_values(self, instance_vals, variable_id, columns, time_last, training_window, return_sorted, start, end, random_seed, shuffle)
    258 
    259         elif variable_id is None or variable_id == self.index:
--> 260             df = self.df.loc[instance_vals]
    261             df.dropna(subset=[self.index], inplace=True)
    262 
/Users/featurelabs07/homeenv/lib/python2.7/site-packages/pandas/core/indexing.pyc in __getitem__(self, key)
   1371 
   1372             maybe_callable = com._apply_if_callable(key, self.obj)
-> 1373             return self._getitem_axis(maybe_callable, axis=axis)
   1374 
   1375     def _is_scalar_access(self, key):
/Users/featurelabs07/homeenv/lib/python2.7/site-packages/pandas/core/indexing.pyc in _getitem_axis(self, key, axis)
   1614                     raise ValueError('Cannot index with multidimensional key')
   1615 
-> 1616                 return self._getitem_iterable(key, axis=axis)
   1617 
   1618             # nested tuple slicing
/Users/featurelabs07/homeenv/lib/python2.7/site-packages/pandas/core/indexing.pyc in _getitem_iterable(self, key, axis)
   1125             # if it cannot handle; we only act on all found values
   1126             indexer, keyarr = labels._convert_listlike_indexer(
-> 1127                 key, kind=self.name)
   1128             if indexer is not None and (indexer != -1).all():
   1129                 return self.obj.take(indexer, axis=axis)
/Users/featurelabs07/homeenv/lib/python2.7/site-packages/pandas/core/indexes/base.pyc in _convert_listlike_indexer(self, keyarr, kind)
   1492             keyarr = self._convert_arr_indexer(keyarr)
   1493 
-> 1494         indexer = self._convert_list_indexer(keyarr, kind=kind)
   1495         return indexer, keyarr
   1496 
/Users/featurelabs07/homeenv/lib/python2.7/site-packages/pandas/core/indexes/interval.pyc in _convert_list_indexer(self, keyarr, kind)
    665         # we have missing values
    666         if (locs == -1).any():
--> 667             raise KeyError
    668 
    669         return locs
KeyError:

EntitySet Concat() does not reindex data

EntitySets store an internal dictionary mapping the indexes that connect entities across relationships. This speeds up computation time by enabling quicker joins, much like what a database does. To create these indexes, inside of EntitySet.add_relationship, we internally call:

self.index_data(relationship)

However, when we concat two EntitySets like this:

es_updated = es_old.concat(es_new)

We update the internal dataframes of es_old but do not update the indexes. This affects calculate_feature_matrix down the road not just as a performance hit, but also potentially in the actual feature values.

An easy fix is to add the following to EntitySet.concat:

for r in es.relationships:
    es.index_data(r)

LookupError: Time index not found in dataframe

Bug Description

When I run quick start, I met this issue.

      2 feature_matrix_customers,features_defs = ft.dfs(entities = entities,
      3                                                 relationships = relationships,
----> 4                                                 target_entity = "customers")
      5 feature_matrix_customers
-----

Training window documentation is out of date

Needs to be updated in dfs and calculate_feature_matrix

alteryx / featuretools Goto Github PK

featuretools's Introduction

Installation

Add-ons

Example

Demos

Testing & Development

Support

Citing Featuretools

Built at Alteryx

featuretools's People

Contributors

Stargazers

Watchers

Forkers

featuretools's Issues

Bug/Feature Request Description

Bug/Feature Request Description

Bug/Feature Request Description

Bug/Feature Request Description

Bug Description

Recommend Projects

Recommend Topics

Recommend Org

Jobs