GithubHelp home page GithubHelp logo

mindsdb / lightwood Goto Github PK

View Code? Open in Web Editor NEW
438.0 18.0 92.0 143.14 MB

Lightwood is Legos for Machine Learning.

License: GNU General Public License v3.0

Python 55.54% Makefile 0.06% CSS 0.38% Jupyter Notebook 44.02%
ml automl encoders mindsdb probabilistic-programming pytorch machine-learning neural-networks hacktoberfest

lightwood's Introduction


MindsDB is the platform for building AI from enterprise data. You can create, serve, and fine-tune models in real-time from your database, vector store, and application data. Tweet

๐Ÿ“– About us

MindsDB is the platform for building AI from enterprise data.

With MindsDB, you can deploy, serve, and fine-tune models in real-time, utilizing data from databases, vector stores, or applications, to build AI-powered apps - using universal tools developers already know.

MindsDB integrates with numerous data sources, including databases, vector stores, and applications, and popular AI/ML frameworks, including AutoML and LLMs. MindsDB connects data sources with AI/ML frameworks and automates routine workflows between them. By doing so, we bring data and AI together, enabling the intuitive implementation of customized AI systems.

Learn more about features and use cases of MindsDB here.

๐Ÿš€ Get Started

To get started, install MindsDB locally via Docker or Docker Desktop, following the instructions in linked doc pages.

MindsDB enhances SQL syntax to enable seamless development and deployment of AI-powered applications. Furthermore, users can interact with MindsDB not only via SQL API but also via REST APIs, Python SDK, JavaScript SDK, and MongoDB-QL.

๐ŸŽฏ Solutions โš™๏ธ SQL Query Examples
๐Ÿค– Fine-Tuning FINETUNE mindsdb.hf_model FROM postgresql.table;
๐Ÿ“š Knowledge Base CREATE KNOWLEDGE_BASE my_knowledge FROM (SELECT contents FROM drive.files);
๐Ÿ” Semantic Search SELECT * FROM rag_model WHERE question='What product is best for treating a cold?';
โฑ๏ธ Real-Time Forecasting SELECT * FROM binance.trade_data WHERE symbol = 'BTCUSDT';
๐Ÿ•ต๏ธ Agents CREATE AGENT my_agent USING model='chatbot_agent', skills = ['knowledge_base'];
๐Ÿ’ฌ Chatbots CREATE CHATBOT slack_bot USING database='slack',agent='customer_support';
โฒ๏ธ Time Driven Automation CREATE JOB twitter_bot ( <sql_query1>, <sql_query2> ) START '2023-04-01 00:00:00';
๐Ÿ”” Event Driven Automation CREATE TRIGGER data_updated ON mysql.customers_data (sql_code)

๐Ÿ’ก Examples

MindsDB enables you to deploy AI/ML models, send predictions to your application, and automate AI workflows.

Discover more tutorials and use cases here.

AI Workflow Automation

This category of use cases involves tasks that get data from a data source, pass it through an AI/ML model, and write the output to a data destination.

Common use cases are anomaly detection, data indexing/labeling/cleaning, and data transformation.

This example showcases the data enrichment flow, where input data comes from a PostgreSQL database and is passed through an OpenAI model to generate new content which is saved into a data destination.

We take customer reviews from a PostgreSQL database. Then, we deploy an OpenAI model that analyzes all customer reviews and assigns sentiment values. Finally, to automate the workflow for incoming customer reviews, we create a job that generates and saves AI output into a data destination.

-- Step 1. Connect a data source to MindsDB
CREATE DATABASE data_source
WITH ENGINE = "postgres",
PARAMETERS = {
    "user": "demo_user",
    "password": "demo_password",
    "host": "samples.mindsdb.com",
    "port": "5432",
    "database": "demo",
    "schema": "demo_data"
};

SELECT *
FROM data_source.amazon_reviews_job;

-- Step 2. Deploy an AI model
CREATE ML_ENGINE openai_engine
FROM openai
USING
    openai_api_key = 'your-openai-api-key';

CREATE MODEL sentiment_classifier
PREDICT sentiment
USING
    engine = 'openai_engine',
    model_name = 'gpt-4',
    prompt_template = 'describe the sentiment of the reviews
						strictly as "positive", "neutral", or "negative".
						"I love the product":positive
						"It is a scam":negative
						"{{review}}.":';

DESCRIBE sentiment_classifier;

-- Step 3. Join input data with AI model to get AI output
SELECT input.review, output.sentiment
FROM data_source.amazon_reviews_job AS input
JOIN sentiment_classifier AS output;

-- Step 4. Automate this workflow to accomodate real-time and dynamic data
CREATE DATABASE data_destination
WITH ENGINE = "engine-name",      -- choose the data source you want to connect to save AI output
PARAMETERS = {                    -- list of available data sources: https://docs.mindsdb.com/integrations/data-overview
    "key": "value",
	...
};

CREATE JOB ai_automation_flow (
	INSERT INTO data_destination.ai_output (
		SELECT input.created_at,
			   input.product_name,
			   input.review,
			   output.sentiment
		FROM data_source.amazon_reviews_job AS input
		JOIN sentiment_classifier AS output
		WHERE input.created_at > LAST
	);
);

AI System Deployment

This category of use cases involves creating AI systems composed of multiple connected parts, including various AI/ML models and data sources, and exposing such AI systems via APIs.

Common use cases are agents and assistants, recommender systems, forecasting systems, and semantic search.

This example showcases AI agents, a feature developed by MindsDB. AI agents can be assigned certain skills, including text-to-SQL skills and knowledge bases. Skills provide an AI agent with input data that can be in the form of a database, a file, or a website.

We create a text-to-SQL skill based on the car sales dataset and deploy a conversational model, which are both components of an agent. Then, we create an agent and assign this skill and this model to it. This agent can be queried to ask questions about data stored in assigned skills.

-- Step 1. Connect a data source to MindsDB
CREATE DATABASE data_source
WITH ENGINE = "postgres",
PARAMETERS = {
    "user": "demo_user",
    "password": "demo_password",
    "host": "samples.mindsdb.com",
    "port": "5432",
    "database": "demo",
    "schema": "demo_data"
};

SELECT *
FROM data_source.car_sales;

-- Step 2. Create a skill
CREATE SKILL my_skill
USING
    type = 'text2sql',
    database = 'data_source',
    tables = ['car_sales'],
    description = 'car sales data of different car types';

SHOW SKILLS;

-- Step 3. Deploy a conversational model
CREATE ML_ENGINE langchain_engine
FROM langchain
USING
      openai_api_key = 'your openai-api-key';
      
CREATE MODEL my_conv_model
PREDICT answer
USING
    engine = 'langchain_engine',
    model_name = 'gpt-4',
    mode = 'conversational',
    user_column = 'question' ,
    assistant_column = 'answer',
    max_tokens = 100,
    temperature = 0,
    verbose = True,
    prompt_template = 'Answer the user input in a helpful way';

DESCRIBE my_conv_model;

-- Step 4. Create an agent
CREATE AGENT my_agent
USING
    model = 'my_conv_model',
    skills = ['my_skill'];

SHOW AGENTS;

-- Step 5. Query an agent
SELECT *
FROM my_agent
WHERE question = 'what is the average price of cars from 2018?';

SELECT *
FROM my_agent
WHERE question = 'what is the max mileage of cars from 2017?';

SELECT *
FROM my_agent
WHERE question = 'what percentage of sold cars (from 2016) are automatic/semi-automatic/manual cars?';

SELECT *
FROM my_agent
WHERE question = 'is petrol or diesel more common for cars from 2019?';

SELECT *
FROM my_agent
WHERE question = 'what is the most commonly sold model?';

Agents are accessible via API endpoints.

๐Ÿค Contribute

If youโ€™d like to contribute to MindsDB, install MindsDB for development following this instruction.

Youโ€™ll find the contribution guide here.

We are always open to suggestions, so feel free to open new issues with your ideas, and we can guide you!

This project is released with a Contributor Code of Conduct. By participating in this project, you agree to follow its terms.

Also, check out the rewards and community programs here.

๐Ÿค Support

If you find a bug, please submit an issue on GitHub here.

Here is how you can get community support:

If you need commercial support, please contact the MindsDB team.

๐Ÿ’š Current contributors

Made with contributors-img.

๐Ÿ”” Subscribe to updates

Join our Slack community and subscribe to the monthly Developer Newsletter to get product updates, information about MindsDB events and contests, and useful content, like tutorials.

โš–๏ธ License

For detailed licensing information, please refer to the LICENSE file.

lightwood's People

Contributors

abitrolly avatar adripo avatar alexandre-dz-oscore avatar azulgarza avatar btseytlin avatar ea-rus avatar george3d6 avatar hakunanatasha avatar hamishfagg avatar jaredc07 avatar kination avatar lezcano avatar lyndonfan avatar maximlopin avatar michaellantz avatar mindsdb-devops avatar mrandri19 avatar noraa-july-stoke avatar ongspxm avatar paxcema avatar quantumplumber avatar rajveer43 avatar riadhlaabidi avatar stpmax avatar surendra1472 avatar talaathasanin avatar tomhuds avatar torrmal avatar vaithak avatar zoranpandovski avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lightwood's Issues

Run unit tests as part of the travis CI tests

Currently we have a bunch of relatively quick unit tests for each file, we should run some of these (or even all of them) as part of the CI tests.

This will also force us to keep them updated, a lot of them were/are deprecated since they were written when lightwood was first being developed.

Possible issue with encoder output size differences.

So, there's a possible issue that I can best formalize as:

Given n encoders with outputs of different size, that encode variables which are equally important for predicting the target variable, the network might have too many parameters dedicated to the larger inputs and thus learn very fast / overfit on the inputs encoded with the largest representation.

Some possible solutions:

  1. Make all mixer have a standard size input (possibly equal to the size of the output variable[s], which could also be combined with training all encoders to predict the output variable...)

  2. Add "input networks" as part of the mixer for each different input, these network with have an in layer equal to the size of the output variable and an out-layer of a standard size (maybe equal to the largest encoded input size)

  3. This issue would mainly happen with categorical one-hot-encoded values with very few dimensions (e.g. categories with 2-20 possible values) and numbers (which are always represented by a 4-variable input vector [isnan, iszero, sign, normalized_numerical_value].

We might be able to "hack" around this by simply "copy-pasting" the inputs from the smaller categories up to a certain threshold (pick magic number or some function in relation to the size of the largest encoded input).

We could also change numerical encoders where instead of having a single value for the numerical value, we have two values per bucket of the numerical value, one represents the normalized numerical between the start and end value of the bucket, the other represents how strongly we believe the value to be in this bucket. If we pick these buckets to be equal in size to the ones generate by MindsdDB in the histogram, this is also helpful since then lightwood itself can give a numerical prediction + a number of high probability buckets for said numerical prediction, even if the prediction is wrong there's a high chance of the buckets being correct.

@torrmal you were the one that came up 2 and seemed to be very keen on 1, so if I'm missing anything there please add to them or correct me.

For now I'm leaning towards implementation number 2, since it would affect the least number of moving parts.

Weird segfault issue on import

Basically, there's a weird segfault happening when importing lightwood... in some cases.

Examples:

Segmentation fault:
import transformers

Segmentation fault:
import mindsdb
import lightwood

Segmentation fault:
import mindsdb
import transformers

Works:
import lightwood
import transformers

Works:
import lightwood
import mindsd

Doesn't happen on all machine. No idea of the cause, investigating now.

Support training multiple mixers and add more mixers

Support the training of multiple mixers and add a few more mixers (especially one or two boosting models, since they seem to often beat our own mixer on certain datasets, or approach it's accuracy in a fraction of the time).

We could either/or: input the predictions from these mixers into the final NN mixer, use these mixers instead of the NN mixers, adopt an ensemble prediction style (where we trust the majority and give a confidence based on how the predictions align).

This architecture change could also be used to train multiple NN mixers (e.g. train one with selfaware on and one with selfaware off, and use the one with selfaware off if it's much more accurate on the testing data). We could even expose more mixers to the user/mindsdb and let them chose which one to use for a given prediction based on various criteria.

installation error in python 3.8.1

The installation fails like this
ERROR: Could not find a version that satisfies the requirement torch>=1.3.0 (from pyro-ppl>=0.4.1->lightwood) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2) ERROR: No matching distribution found for torch>=1.3.0 (from pyro-ppl>=0.4.1->lightwood)

Add column drop-out once the model converges

Once we reach a "maximum" accuracy for the model start feeding it incomplete datasets by using the datasource's dropout feature. This should help the network be able to better predict with missing input data in the future.

The important things to think about / implement:

  • Which columns do we chose to drop ? Do we go through all of them one by one ? Do we try to see how the awareness netwrok reacts to certain missing columns and chose combinations of columns to drop based on that ?
  • If the dropout-trained model have slightly worse accuracy than the best one trained without dropout, which one do we pick ?
  • We need to change the dropout interface for the datasource so that it can be called during training, rather than only modified via the config during the setup of the datasource.

Training seems to stop to early on samll dataset

This is on a proprietary dataset, so sadly I can't include it here.

Training seems to stop too early on very small datasets, before the algorithm is allowed to converge to an optimal solution. If I just copy-past the dataset 10x times in the same file we reach 100% accuracy, whilst leaving it as is only allows us to reach ~94%.

We should find a way to run a dataset multiple times (or not run it fully) during an epoch based on it's size, or change the nr of epochs before evaluation dynamically based on dataset size.

This could also be done in mindsdb but I'd prefer it if lightwood itself knew how to do this, as it's closely related to the training process.

Windows install via pypi repo fails

pip install lightwood will fail on windows due to us sourcing the torch dependency from outside of pypi.

Not much we can do about it at the moment, since the pypi version of torch for windows seems to fail installing most of the time (or install a version that's way too old).

we should look into fixing when new releases of torch arrive on pypi and/or new updates for the various libraries involved with torch appear on windows.

Temporary "fix" is just asking people to install from github.

ValueError: Input contains NaN, infinity or a value too large for dtype('float64')

Your Environment

Python version: 3.7.4
Pip version: 19
Operating system: Ubuntu 18.04
Python environment used (e.g. venv, conda): venv
Mindsdb version you tried to install: 1.6.8

Describe the bug
ValueError is thrown when training

To Reproduce
Steps to reproduce the behavior, for example:

Use this example
You should see the error: ValueError: Input contains NaN

Additional context
Screenshot from 2019-10-08 13-41-47: https://user-images.githubusercontent.com/7192539/66394176-aa3d6680-e9d4-11e9-9e1b-d61fe50b26c4.png

Issues training on small datasets / Make sure nr parameters > nr of rows

Pretty self explanatory, essentially if we have more parameters in the mixer than unique rows in the input dataset this might result in the model overfiting to predict each separate row. Then the testing dataset just becomes a selector between a number of over-fitted models.

To some extent implementing #73 might help with this.

I'm also not yet sure whether or not this is an actual issue we ran into. There were indeed cases where predicting with lightwood on n rows (where n is small, say around 800), that had a relatively large input representation (which would result in a network with dozens or hundreds of thousands of parameters), resulted in surprisingly poor accuracy and long training time.

However, copy-pasting the rows a few times seems to have fixed the issue... so, I don't see how that would necessarily fit this model, since this issue would be cause by the amount of distinct rows, not by the absolute number of rows.

@torrmal if you have any further opinions on this or if you think I'm miss-understanding your stance about this issue please correct me.

Add more check tot he CI tests

The lightwood CI tests don't cover much ground at the moment, we should add a few things to them:

a) Try turning various flags on/off (e.g. OVERSAMPLE, SELFAWARE, PLINEAR)

b) Run on one or two datasets which are either deterministic (and should reach ~100% accuracy) or for which we have a lot of previous benchmarks (e.g. default on credit), and look at the actual accuracy obtained on them, if it's surprisingly small then don't auto deploy to pypi. For why this is needed see release 0.11.6 and 0.11.7 which are essentially "broken", in that they don't reach a decent accuracy on any dataset, but passed the CI tests

c) #65

Training on large datasets (with no cache) OOM

When training on very large datasets, even if the cache is disabled, we sometimes run OOM, especially on GPUs, even on rather large ones, since memory is usually rather limited (<12GB).

Initially I thought this is owned to the accumulating gradient tensors pytorch stores during forwardprop, similar to what was happening in .predict, but I can't find any evidence of this.

I'll have to investigate this further (any large datasets, say > 2GB with that yields a few thousand input dimensions when encoded, should do the trick for testing)... except for image datasets, since we loads those from disk when encoding and reduce the dimension by quite a lot).

Separate encoder logic

Currently encoder's encode function encodes all the data in the column and creates the mapping required for this encoding (e.g. the dictionary for one-hot or the number min-max range of numerical encoders) in one go.

We should separate this into something like:

create_encoding_mapping
and
encode

Where the later should be allowed to operate with an arbitrary amount of data from the column.

Input parameters error for callback_on_iter() function

Describe the bug
TypeError is thrown when training a new model.
In the latest version of lightwood, accuracy is sent to callbeck_on_iter function that doesn't accept that parameter

callback_on_iter(epoch, training_error, test_error, delta_mean, self.calculate_accuracy(test_data_ds))

To Reproduce
Steps to reproduce the behavior:

  1. Train new data
  2. See error: callback_on_iter() takes 5 positional arguments but 6 were given

Screenshots
Screenshot from 2019-09-05 01-07-42

Could not find a version that satisfies the requirement torch>=1.1.0.post2

Installing lightwood produces the following error:

ERROR: Could not find a version that satisfies the requirement torch>=1.1.0.post2 (from lightwood) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2, 0.4.1, 0.4.1.post2, 1.0.0, 1.0.1, 1.0.1.post2, 1.1.0)
ERROR: No matching distribution found for torch>=1.1.0.post2 (from lightwood)

The strange thing is that torch 1.1.0.post2 version is available on PyPi. Similar issues are reported on pytorch repo

Make certain encoders "learn" a correlation with the target variable

We discussed doing this with the categorical autoencoder first, but it could be done with other encoders as well.

Essentially, we should try to predict the target variable from the intermediary representation when training the autoencoder. So instead of the autoencoder being:

column_value -> IR -> column_value

it would become:

column_value -> IR -> column_value + target

This could be useful for tow reasons:

a) We know that if the IR can be used to reasonably well predict the target, it can be also used by the mixer to predict the target

b) We could obtain some sort of "column correlation" score from this which could help us determine the importance of the column, which might be interesting for both the user and for lightwood itself when it decided what column to drop-out during later stages of training (see #68)

Cuda-enabled Learning

Hi,

How do I modify the learn method to be cudified? I saw some stuff from earlier with specifying devices, etc, but wasn't sure about the details there as I couldn't find anything in the documentation

Columns with too large or with many dimensions for the categorical autoencoder

Certain categorical columns seem large enough that we run oom when training the categorical autoencoder. This can happen for two reasons:

a) Columns with loads of dimensions. In this case we might just want to eliminate categorical columns with 5000+ or so dimensions for mindsdb, or encode them as text when appropriate

b) [More pressing] When the column has a large but doesn't contain an excessive number of dimensions. Not sure why it would crash in this situation, but I'm too tired to check at the moment. Should look into this later.

Windows Installation issue

Describe the bug
Installation failed for lightwood=0.7.6. Error:

Packages installed from PyPI cannot depend on packages which are not also hosted on PyPI. lightwood depends on torch@ https://download.pytorch.org/whl/cu100/torch-1.1.0-cp37-cp37m-win_amd64.whl

To Reproduce
Steps to reproduce the behavior:

  1. pip install lightwood=0.7.6 or just pip install lightwood since 0.7.6 is latest

Desktop (please complete the following information):

  • OS: windows 10
  • Lightwood version 0.7.6
  • Python Version 3.7.3

Travis build not failing for failed tests

Describe the bug
Even if there are failures in test scripts then also the travis build will pass
To Reproduce
Run travis build

Expected behavior
If there is an error in travis unit test scripts then the build should fail

Additional context
This is just the replication of similar issue in mindsdb/mindsdb mindsdb/mindsdb#343

Allow disabling the encoder and transofrmer caches

Allow the disabling of encoder and transofrmer caches, via a flag and/or automatically when the data in a column or in all the columns would result in caches that are too big.

Before doing this we need to implemented #29

Adde argument for if and which gpus to use

For debugging purposes and for the sake of people wanting to try lightwood that have issues with cudnn we should add an argument to allow forcing lightwood to use the cpu.

We should also add (ideally via the same argument) the ability to specify a list of GPUs to use (for people wishing to keep certain GPUs dedicated to other models).

Start training on a sub-set

Instead of training on the whole dataset at once, start training on a small sub-set (say, as large as 5 or so batches) and as the network start converging on that subset start feeding it the whole dataset (or a bigger sub-set).

This sort of "priming" might help us achieve convergence quicker, which could be rather good considering 0.11.8 increased training times.

Training parallelism on multi-GPU machines

We need to make lightwood able to make use of multi-GPU machines in order to train faster. Pytorch should have support for this, shouldn't be terribly hard to implement, but testing to make sure nothing is broken by this might take a bit.

TypeError: object of type 'NoneType' has no len()

Describe the bug
TypeError: object of type 'NoneType' has no len() is thrown when using lightwood as backend.

To Reproduce
Steps to reproduce the behavior:

  1. Use this example

Screenshots
Screenshot from 2019-09-02 17-31-02

Desktop (please complete the following information):

  • OS: Ubuntu 18.04
  • Lightwood version 0.9.0
  • Python Version 3.7.4

Save encoder

Since we are now training certain encoders, and that process takes time, it would be nice if we could save each encoder once training is complete, so that if we run into various issues with the other encoder or mixers, or if we want to tweak the mixer behavior but not the encoders, we don't have to re-train everything again.

Only really an issue on large very large datasets, but considering we have some datasets where it take over a day to train the text encoders (on an undersampled version), I think this might be a time saver in the long run.

This kind of modular saving/freezing of certain components could also be a good lead into being able to partially re-train a model when new data points come.

Add generic training loop

Currently a lot of the elements from the training look happening inside the nn mixer's iter_fit and in the callback that the predictor api passes to it are re-used in the categorical autoencoder and the text encoder. It might be worth while abstracting away a few of the things an re-using them in all 3 places.

ImportError: cannot import name 'Imputer' from 'sklearn.preprocessing'

Describe the bug
Installing the latest version of lightwood throws ImportError. I guess the issue is related to the sciki-learn.

To Reproduce
Steps to reproduce the behavior:

  1. Train model, the issue is not related to a specific dataset
  2. See error
from cesium import featurize

File "/home/zoran/MyProjects/lightwood/l/lib/python3.7/site-packages/cesium-0.9.9-py3.7-linux-x86_64.egg/cesium/featurize.py", line 10, in
from sklearn.preprocessing import Imputer
ImportError: cannot import name 'Imputer' from 'sklearn.preprocessing' (/home/zoran/MyProjects/lightwood/l/lib/python3.7/site-packages/scikit_learn-0.22rc3-py3.7-linux-x86_64.egg/sklearn/preprocessing/init.py)

Screenshots
Screenshot from 2019-12-02 14-40-15

Encoder transfer

There's certain encoder which we might want to train whenever lightwood learns a model, such as the categorical auto-encoder, the basic RNN text encoder and various other encoders (see #6).

We might want to re-train a model but not re-train all the encoder used for that model (e.g. If a new column was added).

Partially, this requires implementing modular re-training logic (i.e. in this case start the mixer training from scratch but leave the encoders as is), but also that we're able to train encoders for new columns or for a list of columns that we think have changed in such a way as to warrant re-encoding.

AX Optimization causes transofrmer error during training

Sometimes, when running the ax optimization (happened on a proprietary dataset, doesn't seem to happen on other), transformer's self.feature_len_map mysteriously gets cleared before the first call to the callback function during train.

I'm not sure what's causing this, we need to look into it further.

Cache disabling break predicti functionality

If we disable encoded value caching when making predictions, lightwood crashes because it tries to encode the missing target variable column[s].

Example stack trace:

ERROR:mindsdb-logger-core-logger:libs/controllers/transaction.py:126 - Could not load module ModelInterface                                                                                   
                                                                                                                                                                                              
ERROR:mindsdb-logger-core-logger:libs/controllers/transaction.py:127 - Traceback (most recent call last):                                                                                     
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc                                                                                   
    return self._engine.get_loc(key)                                                                                                                                                          
  File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc                                                                                                          
  File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc                                                                                                          
  File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item                                                                             
  File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item                                                                             
KeyError: '<target_value>'                                                                                                                                                                       
                                                                                                                                                                                              
During handling of the above exception, another exception occurred:                                                                                                                           
                                                                                                                                                                                              
Traceback (most recent call last):                                                                                                                                                            
  File "/home/ubuntu/george_experiments/mindsdb/mindsdb/libs/controllers/transaction.py", line 123, in _call_phase_module                                                                     
    return module(self.session, self)(**kwargs)                                                                                                                                               
  File "/home/ubuntu/george_experiments/mindsdb/mindsdb/libs/phases/base_module.py", line 54, in __call__                                                                                     
    ret = self.run(**kwargs)                                                                                                                                                                  
  File "/home/ubuntu/george_experiments/mindsdb/mindsdb/libs/phases/model_interface/model_interface.py", line 33, in run                                                                      
    self.transaction.hmd['predictions'] = self.transaction.model_backend.predict()                                                                                                            
  File "/home/ubuntu/george_experiments/mindsdb/mindsdb/libs/backends/lightwood.py", line 228, in predict                                                                                     
    predictions = self.predictor.predict(when_data=run_df)                                                                                                                                    
  File "/home/ubuntu/george_experiments/lightwood/lightwood/api/predictor.py", line 353, in predict                                                                                           
    return self._mixer.predict(when_data_ds)                                                                                                                                                  
  File "/home/ubuntu/george_experiments/lightwood/lightwood/mixers/nn/nn.py", line 65, in predict                                                                                             
    for i, data in enumerate(data_loader, 0):                                                                                                                                                 
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 346, in __next__                                                                                
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration                                                                                                                      
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch                                                                                  
    data = [self.dataset[idx] for idx in possibly_batched_index]                                                                                                                              
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>                                                                             
    data = [self.dataset[idx] for idx in possibly_batched_index]                                                                                                                              
  File "/home/ubuntu/george_experiments/lightwood/lightwood/api/data_source.py", line 111, in __getitem__                                                                                     
    sample[feature_set][col_name] = self.get_encoded_column_data(col_name, feature_set, custom_data={col_name: [self.data_frame[col_name].iloc[idx]]})[0]                                     
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 2995, in __getitem__                                                                                      
    indexer = self.columns.get_loc(key)                                                                                                                                                       
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2899, in get_loc                                                                                   
    return self._engine.get_loc(self._maybe_cast_indexer(key))                                                                                                                                
  File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc                                                                                                          
  File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc                                                                                                          
  File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item                                                                             
  File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item                                                                             
KeyError: '<target_value>'

ModuleNotFoundError: No module named 'botorch.models.fidelity'

Describe the bug
There is import error using the latest lightwood version lightwood==0.13.5.

To Reproduce
Steps to reproduce the behavior:

  1. Use full_test.py or any example from mindsdb repository/

** Stacktrace **

File "train.py", line 5, in
import mindsdb
File "/home/zoran/MyProjects/mindsdb-examples/pl/lib/python3.7/site-packages/mindsdb/init.py", line 7, in
import lightwood
File "/home/zoran/MyProjects/mindsdb-examples/pl/lib/python3.7/site-packages/lightwood/init.py", line 8, in
import lightwood.model_building
File "/home/zoran/MyProjects/mindsdb-examples/pl/lib/python3.7/site-packages/lightwood/model_building/init.py", line 1, in
from lightwood.model_building.basic_ax_optimizer.basic_ax_optimizer import BasicAxOptimizer
File "/home/zoran/MyProjects/mindsdb-examples/pl/lib/python3.7/site-packages/lightwood/model_building/basic_ax_optimizer/basic_ax_optimizer.py", line 1, in
import ax
File "/home/zoran/MyProjects/mindsdb-examples/pl/lib/python3.7/site-packages/ax/init.py", line 5, in
from ax.modelbridge import Models
File "/home/zoran/MyProjects/mindsdb-examples/pl/lib/python3.7/site-packages/ax/modelbridge/init.py", line 6, in
from ax.modelbridge.factory import (
File "/home/zoran/MyProjects/mindsdb-examples/pl/lib/python3.7/site-packages/ax/modelbridge/factory.py", line 13, in
from ax.modelbridge.discrete import DiscreteModelBridge
File "/home/zoran/MyProjects/mindsdb-examples/pl/lib/python3.7/site-packages/ax/modelbridge/discrete.py", line 18, in
from ax.models.discrete_base import DiscreteModel
File "/home/zoran/MyProjects/mindsdb-examples/pl/lib/python3.7/site-packages/ax/models/init.py", line 5, in
from ax.models.torch.botorch import BotorchModel
File "/home/zoran/MyProjects/mindsdb-examples/pl/lib/python3.7/site-packages/ax/models/torch/botorch.py", line 10, in
from ax.models.torch.botorch_defaults import (
File "/home/zoran/MyProjects/mindsdb-examples/pl/lib/python3.7/site-packages/ax/models/torch/botorch_defaults.py", line 12, in
from botorch.models.fidelity.gp_regression_fidelity import (
ModuleNotFoundError: No module named 'botorch.models.fidelity'

Ignore deployment for Doc type change

Describe the bug
Travis deploy is running for Docs also

To Reproduce
send any PR with a Doc change such as .md, LICENSE or .travis.yml

Expected behavior
Deployment should be skipped

Additional context
This is similar to what was requested in mindsdb/mindsdb, just a replication of mindsdb/mindsdb#352

Numerical encoder / Predict data frame iteration issue

There was an issue in mindsdb's CI tests where it was passing a list of correct numbers as the values of predict (as the column in the dataframe), yet when encoding it the numerical encoder somehow stumbles upon a single variable with the value None.

There's a hotfix in numerical encoder line 76, but we need to figure out why this is happening, I suspect it's a lightwood bug since there's no issue mindsdb side (the numbers being passed all are correct, not-None, not-infinite).

Setup linter

Describe the bug
Setup python linter for project

Expected behavior
Linting code

I know this is not the main purpose of the project, but I think that setup linter will be better for project maintenance.

How do you think?

Training parallelism on multiple machines

We should start thinking about this once we're done with #63 , pytorch and various pytroch related frameworks should provide some support for this. But I doubt it's going to be very useful outside of very large datasets, due to the large data transfer and synchronization overhead.

Good candidate to try this out on would be the Ax Optimizer, which could run multiple trials on different machines.

Training encoders

It might be worthwhile looking into training some of the final layers for the img and sequence encoders once the mixer starts achieving decent performance.

However, this would require (I think) making the encoders part of the actual network we optimize or adding a bunch of error propagation logic to the encoder objects and creating the glue code between the error on the final layer of the mixer and the first layer of each encoder.

Implementing this dynamic encoder modification might also help us test different encoders during the initial steps of training in the future, to automatically determine the best encoders for a specific dataset.

Make text encoders predict numerical targets (and maybe other types)

Attach a head to the distilBERT (or preferably make it generic, all of the current ones output 768 embeddings anyway) that can predict numerical targets, and train in a similar way we do for categorical targets.

Maybe try using the categorical head and see if it fits the bill with the right function (maybe change/remove the last layer if it's a softmax / some other exponential normalization function).

We could also do this for text/image/sequence outputs, but for text there's the LM (language modeling) heads that hugging-face already provides, we don't really support image outputs and I'm not really familiar with the representations that come out of cesium, so I'm not sure how easy those would be to "predict" or what kind of loss/model one would want to use for them.

Provide column importance scores via lightwood

This should probably come after #68 and #69

We could try providing some sort of column importance score from lightwood based on:

a) The results of training with drop-out (#68), i.e. if dropping out a given column yields a low accuracy then it's probably very important and vice-versa
b) The correlation that certain encoder find between the IR of the column and the target variable (see #69)
c) Maybe some analysis of the input weights corresponding to the inputs derived from the column and/or a more in-depth analysis of their importance in the graph and/or maybe even the gradient flow during training.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.