Comments (5)
Hi @zhxt95,
Thanks for the interesting question! The reason for the ordering in Featuretools is to avoid creating duplicate features. If we were to create the direct features before creating the transform features, a primitive could be applied to make same feature in two different ways.
As an example, suppose we have two entities transactions
and customers
and a feature date of birth
in customers
. When we use ft.dfs
with target entity transactions
we would create both customers.TimeSince(date of birth)
and TimeSince(customers.date of birth)
. By doing direct features after transforms, we only create the latter.
This is something that could be enhanced in a future release of dfs in Featuretools.
from featuretools.
Well, as you said, duplicate features are indeed annoying problem. Yet I still think it is necessary to build direct features first.
For example, suppose there are two entities. Entity customer
contains features job
and outgoing
each week while entity payroll
tells us the salary
of each job. If we want to know whether a customer's incoming can cover his outgoing, a feature like Subtract(payroll.salary, outgoing)
is very useful.
With ft.dfs, now I can not automatically generate a feature like this no matter what depth I set. But with the idea in your paper, I can get what I want.
from featuretools.
I've noticed that the situation I mentioned above is using a trans_primitives with two input features of which one is a direct feature and another is not.
For this kind of situation, duplicating will not be a problem any more.
So I think build efeat after dfeat and use trans only for this situation may possible be a better way.
from featuretools.
@zhxt95 That example makes sense. We'd be happy to extend ft.dfs
to handle that case. Are you interested in attempting to make a PR with the change you're suggesting?
Basically, the PR would swap the order of building transform and direct features, but then add in additional logic to avoid features that would calculate the same thing. The primary challenge here is to make that logic as straightforward and general possible.
If you take a stab at this, we'd be happy to discuss and help!
from featuretools.
I've made #123 as a solution, but I'm not sure if it works well. I'd appreciate it if you can review it and give some suggestions:)
from featuretools.
Related Issues (20)
- Deprecation warnings: `is_categorical_dtype` and `infer_datetime_format` HOT 1
- Datetime Variables as Cyclical Features
- Update release process to use PyPI trusted publisher approach
- Support pandas 2.0 for featuretools[spark]
- Support Woodwork 0.28.0 for featuretools[spark]
- Add support for pandas 2.2.0
- Update featuretools deps to work with woodwork 0.28.0
- Remove reference to premium_primitives main HOT 1
- Remove numpy 2.0.0 upper bound
- Fix minimum dependency checker action
- Fix slack alert on tests with WW main
- Featuretools trying to import private scipy function float_factorial HOT 3
- Remove premium primitives from docs to be able to release it
- Update make_ecommerce_entityset to allow fixture use without Dask installed
- Update deps so min dependency generator works
- fix release notes version for 1.3.0 release
- Potential performance Issue: Slow read_csv() Function with pandas below 2.0.1
- Restrict dask dependencies
- Investigate and remove Dask restriction in pyproject.toml
- How to use featuretools at the test time? It seems featuretools' feature definitions do not store train time statistics to accurately apply primitives like 'PERCENTILE' at the test time
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from featuretools.