mindsdb / mindsdb_native Goto Github PK

Machine Learning in one line of code

License: GNU General Public License v3.0

Python 100.00%

ml xai automl machine-learning machinelearning artificial-intelligence hacktoberfest

mindsdb_native's Introduction

This repository is now deprecated, please consider using mindsdb proper for a high level automatic machine learning solution or using the new lightwood if you want something lower-level.

MindsDB is an Explainable AutoML framework for developers built on top of Pytorch. It enables you to build, train and test state of the art ML models in as simple as one line of code.

Try it out

Installation

Desktop: You can use MindsDB on your own computer in under a minute, if you already have a python environment setup, just run the following command:

 pip install mindsdb_native --user

Note: Python 64 bit version is required. Depending on your environment, you might have to use pip3 instead of pip in the above command.*

If for some reason this fail, don't worry, simply follow the complete installation instructions which will lead you through a more thorough procedure which should fix most issues.

Docker: If you would like to run it all in a container simply:

sh -c "$(curl -sSL https://raw.githubusercontent.com/mindsdb/mindsdb/master/distributions/docker/build-docker.sh)"

Usage

Once you have MindsDB installed, you can use it as follows:

Import MindsDB:

from mindsdb_native import Predictor

One line of code to train a model:

# tell mindsDB what we want to learn and from what data
Predictor(name='home_rentals_price').learn(
    to_predict='rental_price', # the column we want to learn to predict given all the data in the file
    from_data="https://s3.eu-west-2.amazonaws.com/mindsdb-example-data/home_rentals.csv" # the path to the file where we can learn from, (note: can be url)
)

One line of code to use the model:

# use the model to make predictions
result = Predictor(name='home_rentals_price').predict(when_data={'number_of_rooms': 2, 'initial_price': 2000, 'number_of_bathrooms':1, 'sqft': 1190})

# you can now print the results
print('The predicted price is between ${price} with {conf} confidence'.format(price=result[0].explanation['rental_price']['confidence_interval'], conf=result[0].explanation['rental_price']['confidence']))

Visit the documentation to learn more

Google Colab: You can also try MindsDB straight here

Contributing

To contibute to MindsDB please checkout the Contribution guide.

Current contributors

Made with contributors-img.

Report Issues

Please help us by reporting any issues you may have while using MindsDB.

License

MindsDB License

mindsdb_native's People

Contributors

Stargazers

Watchers

mindsdb_native's Issues

current MSSQL driver cant fetch big data

Python version: 3.6.9
Operating system: ubuntu
Mindsdb version: last staging
Additional info if applicable: pytds==1.10.0

I have troubles when creating predictor with mssql datasource. Reason in pytds. This library raise ClosedConnectionError('Server closed connection',) if returned results is bigger then some value.
I used regular 'home_rentals' dataset, saved as table in mssql. If i try make datasource from all rows, then i will get error. If i make datasource from, for example, 1000 rows, then it success created. Or i can make datasource from 3-4 columns of all rows, it will be successful too. If i use pymssql instead of pytds, then all work well on full dataset. I tried find some parameter of pytds to fix this issue, but not find it. I know pytds was chosen as pure-python lib to make installer build easer, but may be it exists alternative for it?

Not clear how to read the accuracy of trained model

After predictor.learn is not clear how can I access the "metadata of the model" i.e Accuracy
I am thinking something like

model.accuracy(model) = 0.85

or something a like.

Only store train/val/test indices instead of copying dataframes

In DataSplitter we do this:

self.transaction.input_data.train_df = self.transaction.input_data.data_frame.iloc[train_indexes[KEY_NO_GROUP_BY]].copy()
self.transaction.input_data.test_df = self.transaction.input_data.data_frame.iloc[test_indexes[KEY_NO_GROUP_BY]].copy()
self.transaction.input_data.validation_df = self.transaction.input_data.data_frame.iloc[validation_indexes[KEY_NO_GROUP_BY]].copy()

While in fact we can just store the indices, while still using one dataframe under the hood.

This should improve memory consumption for large datasets and allow to simplify code in DataTransformer and following phases

MongoDB Datasource

Select lightwood models [DRAFT]

Add the ability to test multiple lightwood configurations in lightwood.py and select the one with the highest accuracy (either self-reported accuracy or accuracy as validated by the model analysis phase)

Make DataTransformer phase run before DataSplitter

It seems like we could run most if not all of the DataTransformer work before splitting into train/test/validation and the code will be much cleaner.

I am not sure though about the "Unbias dataset for training logic" here:
https://github.com/mindsdb/mindsdb_native/blob/stable/mindsdb_native/libs/phases/data_transformer/data_transformer.py#L126
I don't understand any of it, maybe it's dependant on the splits.

Automatic sampling & transform cache disabling

We need to:

a) Start sampling training data when the dataset is too large (e.g. can barely fit in RAM)
b) Automatically disable the transformed cache for lightwood when the dataset is too large (e.g. ~5-10% or more of the total RAM)

For now let's say we have 4 potential state for the data:

small
medium
big
huge

Determined as a how much of the available RAM the pandas dataframe uses.

If small --- do nothing
If medium --- use the current sampling algorithm to sample the data analysis data
If big --- same as medium + set transform cache disable in the lightwood configuration for training
If huge --- same as big + use the current sampling algorithm to sample data for the testing, training and validation sets.

Version check compares mindsdb_native to mindsdb (server)

Negative values for column importance.

*Please describe your issue and how we can replicate it

Dataset: African Crisis.
Model:
afr_year.zip

...
   {
            "column_name":"banking_crisis",
            "importance_score":-1.0,
            "data_type":"Categorical",
            "data_type_distribution":{
               "type":"categorical",
               "x":[
                  "Categorical"
               ],
               "y":[
                  1044
               ]
            },
...

issue: we are not supposed to have negative values.

32 bit version of python

We should see if there's an easy way to make it work, since 32 bits might still be used for a long time on various embedded device.

See old discussion here: mindsdb/mindsdb#405

Inconsistencies between the test accuracy and the confusion matrix.

Given the following dataset:

Plackett-Burman_Coffee-July.xlsx

and trying to predit: Coffe_Malt,Chocolat,Gold,Medium_Barley,Dark_Barley,Dandelion,Beets,Chicory_Roots,Figs,Dates,Cacao

There seem to be some inconsistencies between the confusion matrixes and the test accuracy (i.e. the one on the validation dataset) in scout.

Please make sure that the confusion matrix is computed on the validation dataset (the one we also use for the validation_accuracy and for the bayesian confidence models).

Barring that being an issue, please investigate if this might just boil down to an issue with scout and if so followup with @Ricram2

Add filter capabilities to the DataSource

Warning: This issue is intended as a test to be given to candidates for a job at Mindsdb, please do not start working on it, unless we've message you telling you to do so.

Mindsdb has a bunch of DataSources: https://github.com/mindsdb/mindsdb_native/tree/stable/mindsdb_native/libs/data_sources
All inheriting from the main DataSource class: https://github.com/mindsdb/mindsdb_native/blob/stable/mindsdb_native/libs/data_types/data_source.py

We need to add a filter function to all of these data sources, this should be able to take a list containing tuples of [column_name, operator, value] and filter (using the WHERE clause) based on those tuples. It should also accept a second argument (limit) which will limit the number of rows being returned by the filter method.

The filtered rows should be returned as part of a pandas data frame.

For a lot of datasources, this can be a very simple method using the internal pandas data-frame, but for databases-based datasource (mariadb, postgres, mysql, clickhouse) it might be wiser to "alter" the internal SQL statement used to query the database in order to add the filters & limit there and get the data more efficiently that way (with fallback to filtering the data frame).

You need not provide a query-based implementation for every single database-based datasource, it's sufficient to do so for one or two of them if the method is conceptually sound and could apply to every single one of them.

Make sure all datasource (even the database based one) can implement filter on the internal pandas dataframe if you haven't implemented it using database queries on that particular datasource.

Examples of usage:

filter([('index','>',800),('name','=','John Doe')])

filter(where=[('foo','LIKE','%bar%')], limit=250)

New Text Types

TEXT should become it's own type instead of a subtype of sequential with two subtypes:

Short Text and Rich Text.

Lightwood should use a different encoder (the hugging face transformer) for Rich Tet and for Short Text (for which it should use the new to-be-written by @maximlopin encoder)

Wrong string 'None' values in prediction result

Python version: 3.6.9
Mindsdb version: 2.00.0 staging

To reproduce issue repeat steps from readme.md:

Predictor(name='home_rentals_price').learn(
    to_predict='rental_price', # the column we want to learn to predict given all the data in the file
    from_data="https://s3.eu-west-2.amazonaws.com/mindsdb-example-data/home_rentals.csv" # the path to the file where we can learn from, (note: can be url)
)
result = Predictor(name='home_rentals_price').predict(when_data={'number_of_rooms': 2, 'initial_price': 2000, 'number_of_bathrooms':1, 'sqft': 1190})

after check data in result:

result[0].as_dict() 
{'number_of_rooms': '2', 'number_of_bathrooms': '1', 'sqft': 1190, 'location': 'None', 'days_on_market': None, 'initial_price': 2000, 'neighborhood': 'None', 'rental_price': 3523, '__observed_rental_price': None, 'rental_price_model_confidence': 0.995, 'rental_price_confidence_range': [3785, 2962], 'rental_price_confidence': 0.8764537068240028}

Pay attention to 'days_on_market' and 'location' fields: first have value = None, it is correct, but second have value = 'None' (string type) - that should be None.

ModelInterface: "ValueError: month must be in 1..12"

To reproduce
Download this dataset (covid.csv):
https://www.kaggle.com/tanmoyx/covid19-patient-precondition-dataset

Run this code:

from mindsdb_native import Predictor
pred = Predictor('value_error')
pred.learn(
    from_data='covid.csv',
    to_predict='date_died',
    ignore_columns=['id'],
    advanced_args={'null_values': {'date_died': ['9999-99-99']}, 'debug': True}
)

Expected the predictor to successfully train a model, but there is an exception at ModelInterface phase: ValueError: minute must be in 0..59

Wrong 'word_dist' in analysis of text column

Here is test data:
test.zip
Here is result of analysis column 'jjj':

'word_dist': {'b': 54, 'e': 6, 'l': 5, 'x': 4, 'k': 4, 'j': 3, 'g': 3, 'r': 2, 'w': 2, '0': 1, '1': 1, '2': 1, '3': 1, '4': 1, '5': 1, '6': 1, '7': 1, '8': 1, '9': 1, '10': 1, '11': 1, '12': 1, '13': 1, '14': 1, '15': 1, '16': 1, '17': 1, '18': 1, '19': 1, '20': 1, '21': 1, '22': 1, '23': 1, '24': 1, '25': 1, '26': 1, '27': 1, '28': 1, '29': 1, '30': 1, '31': 1, '32': 1, '33': 1, '34': 1, '35': 1, '36': 1, '37': 1, '38': 1, '39': 1, '40': 1, 'other words': 9}

In column 'jjj' at least 50 times meets "a", but it missed in 'word_dist'

Build a license-inclusive dependency graph

Build a depenency graph for this library that includes the licenses of all of our dependencies to make sure they are compatible with GLP-3.0

For further discussion see: mindsdb/mindsdb#466

@maximlopin already has an implementation mostly ready.

Data cleaner runs twice to drop foreign keys

The whole design of mindsdb native is based on sequentially executed phases that process data.

But the DataCleaner phase runs twice in LearnTransaction, as a special case. This is used as a hack, because:

Data has to be cleaned before running analysis by dropping empty columns and duplicate rows
DataAnalysis identifies foreign keys, which by default have to be dropped too. It's wrong to have DataAnalysis drop anything (it's not the responsibility of the phase). So DataCleaner is executed again to drop the foreign keys.

This contradicts the core architecture of mindsdb phases. IIt is very opaque - we only revealed it because of a bug. At some point, we have to rework it, so it's more transparent and contributors can understand what's going on.

My best idea for a new design:

DataCleaner stays after DataExtractor, because things like dropping duplicates should be done before data analysis
Dropping foreign keys is moved to DataTransformer
DataTransformer is moved before DataSplitter. Except for the logic of DataTransformer that operates on separate splits (the reweighting thing), this is moved to DataSplitter.

The good thing is that we already have tests for all the cases covered and the bug that appeared when I touched that code:
#79

Change confidence determination on numerical benchmarks

r2_score is un-ideal for some numerical benchmarks because we might care more about % difference rather than absolute difference. Take this into consideration and use r2 from the log value or some other score (e.g. % difference mean) for evaluating numerical value prediction tasks.

Add "switch" between lightwood stable/staging branches for tests

#59

we need a "switch" in mindsdb_native so we can import the staging version of lightwood for running tests.
the PR above is an example of when it's useful: lightwood has a new feature (TextAutoEncoder) and we want to use it when it's not at lightwood/stable yet

Add failure state for models

We need to allow models (lightwood models for now, but any models in the future) to "fail" during the ModelAnalyzer phase.

This means that, if a model is performing really poorly, instead of giving it to the user to make predictions, mindsdb_native should deem it "unusable" instead and throw and error if a .predict call is tried + inform the user about this using a final error log.

We should have an advanced argument that allows these "failed" models to still be used in .predict calls in the rare cases where this might be necessary or desirable for debugging.

Some "easy" ways to determine failure, we can look at various conditions on the validation data inside the ModelAnalysis phase:

Accuracy is == to random
The target is only positive but some results are negative (or vice versa)

Array columns encoded as timeseries

When i train predictor on dataset with array column, i get lot of outputs like:

WARNING:root:Weird element encountered in timeseries: 3, !
WARNING:root:Weird element encountered in timeseries: 9] !
WARNING:root:Weird element encountered in timeseries: [8, !
WARNING:root:Weird element encountered in timeseries: 9, !
WARNING:root:Weird element encountered in timeseries: 0] !
WARNING:root:Weird element encountered in timeseries: [7, !
WARNING:root:Weird element encountered in timeseries: 6, !
WARNING:root:Weird element encountered in timeseries: 6] !

here is test dataset:
test.zip

Only use stats_v2 internally

Only use stats_v2 internally instead of using the old column_stats.

DataCleaner issues with ignored column

To replicate clone mindsdb_examples and run https://github.com/mindsdb/mindsdb-examples/blob/master/benchmarks/cifar_100/video_tutorial.py

However has time to look at this tomorrow please do, if it's a bug that's not specific to just this niche example let's hotfix it into stable (assuming it shows up on stable)

Existing of column with name 'id' in dataset make bad predictor

Mindsdb version: last staging
Lightwood version: 0.34

For example let take 'home_rentals' dataset and let train predictor to predict 'rental_price'. I made test of three cases:

train on data 'as is', without any changes. I got right results.
train on data with serial column with name not equal to 'id'. In this case results good too.
train on data with serial column with name 'id'. In this case results absolutely bad. 'rental_price' in prediction results in most cases is negative. min/max values looks like random values.

Timestamps older than Unix epoch

Your Environment

Python version: 3.6.11
Operating system: Ubuntu 16.04
Mindsdb native version: 2.3.0
Additional info if applicable:

Please describe your issue and how we can replicate it

Discovered while analyzing the dataset mentioned here. Basically, all date timestamps prior to the Unix epoch (Jan 1st 1970) are transformed by pandas to very big negative values in this line of code, which affects network stability downstream.

(Microsfot) SQL server datasource

Using argument use_gpu=False has no effect

I update mindsdb_native to last staging (currently same as stable 2.3.0), start train predictor with arg use_gpu=False and get following error:

ERROR:mindsdb-logger-ec5c640a-dd60-11ea-a269-2c56dc4ecd27:/home/maxs/dev/mdb/venv_new/sources/mindsdb_native/mindsdb_native/libs/controllers/transaction.py:178 - CUDA error: no kernel image is available for execution on the device

Process PredictorProcess-1:1:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/maxs/dev/mdb/venv_new/sources/mindsdb/mindsdb/interfaces/native/predictor_process.py", line 33, in run
    **kwargs
  File "/home/maxs/dev/mdb/venv_new/sources/mindsdb_native/mindsdb_native/libs/controllers/predictor.py", line 260, in learn
    logger=self.log)
  File "/home/maxs/dev/mdb/venv_new/sources/mindsdb_native/mindsdb_native/libs/controllers/transaction.py", line 51, in __init__
    self.run()
  File "/home/maxs/dev/mdb/venv_new/sources/mindsdb_native/mindsdb_native/libs/controllers/transaction.py", line 183, in run
    self._run()
  File "/home/maxs/dev/mdb/venv_new/sources/mindsdb_native/mindsdb_native/libs/controllers/transaction.py", line 179, in _run
    raise e
  File "/home/maxs/dev/mdb/venv_new/sources/mindsdb_native/mindsdb_native/libs/controllers/transaction.py", line 164, in _run
    self._call_phase_module(module_name='ModelInterface', mode='train')
  File "/home/maxs/dev/mdb/venv_new/sources/mindsdb_native/mindsdb_native/libs/controllers/transaction.py", line 121, in _call_phase_module
    return module(self.session, self)(**kwargs)
  File "/home/maxs/dev/mdb/venv_new/sources/mindsdb_native/mindsdb_native/libs/phases/base_module.py", line 54, in __call__
    ret = self.run(**kwargs)
  File "/home/maxs/dev/mdb/venv_new/sources/mindsdb_native/mindsdb_native/libs/phases/model_interface/model_interface.py", line 31, in run
    self.transaction.model_backend.train()
  File "/home/maxs/dev/mdb/venv_new/sources/mindsdb_native/mindsdb_native/libs/backends/lightwood.py", line 220, in train
    self.predictor.learn(from_data=train_df, test_data=test_df, callback_on_iter=self.callback_on_iter, eval_every_x_epochs=eval_every_x_epochs)
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/lightwood/api/predictor.py", line 212, in learn
    from_data_ds.prepare_encoders()
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/lightwood/api/data_source.py", line 304, in prepare_encoders
    training_data=input_encoder_training_data)
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/lightwood/api/data_source.py", line 267, in prepare_column_encoder
    encoder_instance.prepare_encoder(column_data)
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/lightwood/encoders/text/rnn.py", line 42, in prepare_encoder
    self._encoder = EncoderRNN(self._input_lang.n_words, hidden_size).to(device)
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/torch/nn/modules/module.py", line 607, in to
    return self._apply(convert)
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/torch/nn/modules/module.py", line 354, in _apply
    module._apply(fn)
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 161, in _apply
    self.flatten_parameters()
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 151, in flatten_parameters
    self.batch_first, bool(self.bidirectional))
RuntimeError: CUDA error: no kernel image is available for execution on the device

I start look what happened, came to line mindsdb_native/mindsdb_native/libs/backends/lightwood.py#172, lightwood.config.config.CONFIG.USE_CUDA surely equal to False.

ID column is not correctly identified as such

Your Environment

Python version: 3.6.11
Operating system: Ubuntu 16.04
Mindsdb native version: 2.3.0
Additional info if applicable:

Please describe your issue and how we can replicate it
For the automobile_insurance dataset, column 'Customer' is an ID column (8220 different values for 8220 rows), however it is not identified as such which results in creating and training an unnecessarily big autoencoder.

Add new data type and type deduction for multiple categories (tags)

Basically add ability to do multi-label classification.

Related to: mindsdb/lightwood#203

Example dataset for multi-label classification: https://www.kaggle.com/cryptexcode/mpst-movie-plot-synopses-with-tags

Make sure lock is explicitly removed before closing

If there is a crash, in some situations, the lock file might remain open and the lock won't be removed.

We can fix this in two ways:

a) Add a try/except/finally to everything in functional.py and everything in controllers.py that uses a lock.

b) Add an atexit to everything mentioned above

Option a is annoying because we'd loss the crash logs and would have to add custom code (using traceback) to recover them.

Option b seems fairly good actually, but I don't know, maybe I'm missing something.

ntlk download path has no permission to write

Mindsdb version: last staging

I tried upload predictor, and get following error:

[nltk_data] Downloading package stopwords to /home/maxs/nltk_data...
ERROR:mindsdb-logger-core-logger:/home/maxs/dev/mdb/venv_new/sources/mindsdb_native/mindsdb_native/libs/controllers/transaction.py:124 - Could not load module TypeDeductor

Exception on /datasources/zzz [PUT]
Traceback (most recent call last):
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/Flask-1.1.2-py3.6.egg/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/Flask-1.1.2-py3.6.egg/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/flask_restx-0.2.0-py3.6.egg/flask_restx/api.py", line 375, in wrapper
    resp = resource(*args, **kwargs)
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/Flask-1.1.2-py3.6.egg/flask/views.py", line 89, in view
    return self.dispatch_request(*args, **kwargs)
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/flask_restx-0.2.0-py3.6.egg/flask_restx/resource.py", line 44, in dispatch_request
    resp = meth(*args, **kwargs)
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/flask_restx-0.2.0-py3.6.egg/flask_restx/marshalling.py", line 248, in wrapper
    resp = f(*args, **kwargs)
  File "/home/maxs/dev/mdb/venv_new/sources/mindsdb/mindsdb/api/http/namespaces/datasource.py", line 115, in put
    ca.default_store.save_datasource(ds_name, source_type, source, file_path)
  File "/home/maxs/dev/mdb/venv_new/sources/mindsdb/mindsdb/interfaces/datastore/datastore.py", line 131, in save_datasource
    df_with_types = cast_df_columns_types(df, self.get_analysis(df)['data_analysis_v2'])
  File "/home/maxs/dev/mdb/venv_new/sources/mindsdb/mindsdb/interfaces/datastore/datastore.py", line 28, in get_analysis
    return self.mindsdb_native.analyse_dataset(ds)
  File "/home/maxs/dev/mdb/venv_new/sources/mindsdb/mindsdb/interfaces/native/mindsdb.py", line 46, in analyse_dataset
    return F.analyse_dataset(ds)
  File "/home/maxs/dev/mdb/venv_new/sources/mindsdb_native/mindsdb_native/libs/controllers/functional.py", line 91, in analyse_dataset
    logger=log
  File "/home/maxs/dev/mdb/venv_new/sources/mindsdb_native/mindsdb_native/libs/controllers/transaction.py", line 51, in __init__
    self.run()
  File "/home/maxs/dev/mdb/venv_new/sources/mindsdb_native/mindsdb_native/libs/controllers/transaction.py", line 192, in run
    self._call_phase_module(module_name='TypeDeductor', input_data=self.input_data)
  File "/home/maxs/dev/mdb/venv_new/sources/mindsdb_native/mindsdb_native/libs/controllers/transaction.py", line 121, in _call_phase_module
    return module(self.session, self)(**kwargs)
  File "/home/maxs/dev/mdb/venv_new/sources/mindsdb_native/mindsdb_native/libs/phases/base_module.py", line 54, in __call__
    ret = self.run(**kwargs)
  File "/home/maxs/dev/mdb/venv_new/sources/mindsdb_native/mindsdb_native/libs/phases/type_deductor/type_deductor.py", line 303, in run
    col_name)
  File "/home/maxs/dev/mdb/venv_new/sources/mindsdb_native/mindsdb_native/libs/phases/type_deductor/type_deductor.py", line 244, in get_column_data_type
    nr_words, word_dist, nr_words_dist = analyze_sentences(data)
  File "/home/maxs/dev/mdb/venv_new/sources/mindsdb_native/mindsdb_native/libs/helpers/text_helpers.py", line 64, in analyze_sentences
    nltk.download('stopwords')
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/nltk-3.5-py3.6.egg/nltk/downloader.py", line 779, in download
    for msg in self.incr_download(info_or_id, download_dir, force):
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/nltk-3.5-py3.6.egg/nltk/downloader.py", line 643, in incr_download
    for msg in self._download_package(info, download_dir, force):
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/nltk-3.5-py3.6.egg/nltk/downloader.py", line 703, in _download_package
    os.mkdir(os.path.join(download_dir, info.subdir))
PermissionError: [Errno 13] Permission denied: '/home/maxs/nltk_data/corpora'

Probably need change ntlk path to other place.

Increase sampling threshold and ensure learn & analyse_dataset consistency

Increase the sampling threshold such that no sampling is done for the statistical analysis up to, say, 10,000 rows. Shouldn't be required on small datasets.
Make sure the default sampling settings for the Analyse phase are consistent between learn and analyse_dataset, such that the resulting analysis by calling them without providing that argument is consistent between the two.

Allow for model training to fail

We need to allow models (lightwood models for now, but any models in the future) to "fail" during the ModelAnalyzer phase.

We should have an advanced argument that allows these "failed" models to still be used in .predict calls in the rare cases where this might be necessary or desirable for debugging.

Some "easy" ways to determine failure, we can look at various conditions on the validation data inside the ModelAnalysis phase:

Accuracy is == to random
The target is only positive but some results are negative (or vice versa)
... any ideas ?

Allow external data split

Add a way of providing an external data split to mindsdb via the advanced_args

This should be done by having an advanced_arg called something like data_split_indexes that is a dictionary with 3 lists of indexes:

validation_indexes
train_indexes
test_indexes

The best way to impl this is probably just to add this custom splitting behavior to the DataSplitter

Deduplicate data

Long discussion, see: mindsdb/mindsdb#414 and the PR resulting from that.

Will probably just re-do it from scratch.

Text data analysis

Get the Language Distribution using https://github.com/flairNLP/flair

No Language => Category
A language => Move onto step 3

Get Nr of Words distribution. A histogram of the number of words for each sample in the column (don't bother linking this methodology to the language for now, though we might have to in the future for non indo-european languages)
Get Word Distribution. A histogram with the nr of occurrences for every word. Cap this histogram at e.g. the 1000 most common words.

Determine "Rich Text" vs "Short Text" based on some heuristic using the above. Say:

if len(word distribution) > 500 and mean(nr of words) > 5 THEN "Rich Text" else "Short Text"

Add the types as described in #11 first.

Support null output for dates

Look at this dataset: https://www.kaggle.com/zhijinzhai/loandata
It has a column paid_off_time which is of type Date. There are many null values, but it's natural for this dataset and mindsdb should support having it as the output column without dropping nulls.

Currently mindsdb will drop all rows where paid_off_time is null at the DataCleaner phase, and will train a model to predict paid_off_time given "loan_status" == "PAIDOFF", because rows where "loan_status" != "PAIDOFF" are dropped.

This might make sense for other data types too.

This can be made optional (a flag in advanced_args).

Alternative outlier detection algorithm

Add an alternative bucket-based outlier detection algorithm (maybe allowing to switch between the two with a variable inside the DataAnalysis phase while we prototype this internally).

Instead of using the actual values to detect the outliers, this will use the buckets themselves are characterized by three dimensions:

Nr of elements in the bucket
Bucket start value
Bucket end value

If getting 2 or 3 proves difficult just use one of them and it's fine (but should be pretty easy, i.e. should be enough to just look towards the previous/next bucket for the boundary + make an assumptions about the min/max element in the start/end bucket being the outer limit of that)

Then just throw these 3 dimensional points into the LOF or alternatively in the outlier forest implemented in this PR: https://github.com/mindsdb/mindsdb_native/tree/missing_the_outliers_for_the_peaks or some other algorithm.

Ideally just pick what "looks right" by prototyping 2 or 3 and looking at a few datasets in Scout, but if you think there's some theoretical or semi-theoretical-mixed-with-trustworthy-empiricism backing to a given method, go ahead and use that.

Don't classify single-words as text

Currently, single-words can sometimes be classified as text.

I think this should happen close to never in real data, and it might lead to some situations where we classify categories as text.

For now, if a sentence contains a single word, can you go ahead and just assume it's a category much like we did before.

Note: Not a bug per se, I know I asked for this initially, I just now realized the probability of cock-ups there is great and the reward tiny.

Docker integration test don't work in Travis on windows/osx

Currently integration tests only run on linux because they require to run a docker-compose up command to start database containers. Travis does support Docker for Windows, but when trying to run the db containers there is an error:
no matching manifest for windows/amd64 10.0.17763 in the manifest list entries

This basically means that docker is in "windows containers" mode while the container is linux-based, and docker should be switched to "Linux containers" mode. I haven't found a way to do this on Travis, it's probably a bug in Travis.
Here's the same issue on Travis Community:
https://travis-ci.community/t/running-linux-containers-as-part-of-tests-on-a-windows-build/8956

Hopefully a fix or solution will come and we can turn on integration tests on windows.

Logs of last failed build:

Worker information
Secret environment variables are not obfuscated on Windows, please refer to our documentation: https://docs.travis-ci.com/user/best-practices-security
0.06s0.04s3.55sDisabling Windows Defender
$ powershell -Command Set-MpPreference -DisableArchiveScanning \$true
$ powershell -Command Set-MpPreference -DisableRealtimeMonitoring \$true
$ powershell -Command Set-MpPreference -DisableBehaviorMonitoring \$true
0.17s
git.checkout
1.37s$ git clone --depth=50 https://github.com/mindsdb/mindsdb.git mindsdb/mindsdb
0.09s
Setting environment variables from repository settings
$ export GITHUB_TOKEN=[secure]
$ export PYPI_SYSADMIN_PASSWORD=[secure]
$ export REGISTRY_PASS=[secure]
$ export REGISTRY_USER=[secure]
Setting environment variables from .travis.yml
$ export PATH=/c/Python37:/c/Python37/Scripts:$PATH
$ export BADGE=windows
$ bash -c 'echo $BASH_VERSION'
4.4.23(1)-release
before_install.1
35.94s$ choco install python --version=3.7.3
before_install.2
3.89s$ python -m pip install --upgrade pip
before_install.3
9.23s$ choco install docker-compose
before_install.4
144.12s$ choco install docker-desktop
install.1
0.03s$ if [ "$TRAVIS_OS_NAME" != "windows" ]; then travis_wait 15 pip3 install --upgrade pip; fi
install.2
93.75s$ if [ "$TRAVIS_OS_NAME" = "windows" ]; then travis_wait 15 pip install --no-cache-dir -e .; fi
install.3
3.79s$ if [ "$TRAVIS_OS_NAME" = "windows" ]; then travis_wait 15 pip install --no-cache-dir -r requirements_test.txt; fi
install.4
4.07s$ if [ "$TRAVIS_OS_NAME" = "windows" ]; then travis_wait 15 pip install --no-cache-dir -r optional_requirements_extra_data_sources.txt; fi
install.5
0.03s$ if [ "$TRAVIS_OS_NAME" != "windows" ]; then travis_wait 15 pip3 install --no-cache-dir -e .; fi
install.6
0.03s$ if [ "$TRAVIS_OS_NAME" != "windows" ]; then travis_wait 15 pip3 install --no-cache-dir -r requirements_test.txt; fi
install.7
NaNs$ if [ "$TRAVIS_OS_NAME" != "windows" ]; then travis_wait 15 pip3 install --no-cache-dir -r optional_requirements_extra_data_sources.txt; fi
1.51s$ if [ "$TRAVIS_OS_NAME" != "osx" ]; then  docker-compose up -d; fi
/c/Users/travis/.travis/functions: line 607:  1821 Terminated              travis_jigger "${!}" "${timeout}" "${cmd[@]}"
Pulling mysql (mysql:8.0)...
8.0: Pulling from library/mysql
The command "if [ "$TRAVIS_OS_NAME" != "osx" ]; then  docker-compose up -d; fi" failed and exited with 1 during .
Your build has been stopped.
no matching manifest for windows/amd64 10.0.17763 in the manifest list entries

Predictor custom backend test fails due to error in ProbabilisticValidator

This test is failing, serves as code to reproduce the problem:
https://github.com/mindsdb/mindsdb/blob/master/tests/unit_tests/libs/controllers/test_predictor.py#L99

Test Clickhouse Datasource with MergeTree Engine

Add a unit test for the clickhouse datasource that create a table using the MergeTree engine and selects from that.

All main table engines people use are just offshots of MergeTree so if that works everything should.

In theory all tables should behave the same way for basic select queries but... better not do take our chances with this one because 99% of users will have their data in table with MergeTree engines.

Tracking library (prolonged) usage

When mindsdb_native is imported we do an HTTP request to see if any updates are available.

These requests could be used to anonymously measure how people are using the library. The most interesting thing is Day-X Retention (users in cohort using the library after X days / total users in cohort): measuring the number of people who continue to use the library after 1 day, 7 days, 30 days, etc. This is a good proxy measure for understanding how useful the library is: the higher the retention, the more people continue using the library.

It can be used to measure the usefulness of a release: if we notice a drop in user retention after some major change then there probably is a problem with the release.

To do this we need to:

Find out if we can identify prolonged usage from our HTTP logs at all
If we can, somehow aggregate the metrics and integrate them with MixPanel or something else to display the metric

New sampling interface

We should have a new sampling interface that's a dictionary (named dictionary ?) that can be passed to .learn, it should look something like:

sample_settings = {
'sample_for_analysis`: None
,`sample_for_training`: None
,'sample_margin_of_error': 0.01
,'sample_percentage': None
,'sample_function': None
}

See #4 for more details on how these arguments will behave. The gist of it is:

None == default behavior for both analysis and training sampling as specified in #4

If sample_function is specified (should be a function pointer) that wil be used for sampling from the dataset for both the data analysis and the training sample (if they are enabled from their respective arguments for if we automatically detect they should be enabled based on #4)
Else If sample_percentage is specified that will be used to randomly sampling from the dataset for both the data analysis and the training sample (if they are enabled or if we automatically detect they should be enabled)
Else If sample_margin_of_error is specified that will be used for our internal sampling algorithm for both the data analysis and training sampling (if they are enabled or if we automatically detect they should be enabled)
Else use the default sample_margin_of_error of 0.01 + the behavior specified above

SIGSEGV interrupt during ModelInterface phase

Not going to copy paste everything here, see closed issue in the old mindsdb:
mindsdb/mindsdb#494

Report train, validate and test accuracy

Mindsdb should report 3 accuracies (All determined during the mode analysis phase):

The accuracy on the training dataset
The accuracy on the validation dataset
The accuracy on the test dataset

They should be under model_analysis -> column_name -> accuracy -> {train/test/validation}

Use whatever accuracy metric is being use right now in the ModelAnalysis phase to compute the accuracy on the validation dataset.

Find redundant arguments

Some arguments may be redundant, especially in the learn and/or predict methods, i.e. maybe they are never used or maybe they are confusing and they should be renamed, or maybe they are currently in the unstable_parameters_dict but really they are very important and should be exposed to the user directly.

Let's discuss what they might be in order to clear up the interface before the 2.0.0 release.

Wrong column type detection

Same dataset as in issue #111
test.zip
Analysis define type of column 'arrstr' as 'Categorical/Category'. But each value in this column unique, better will be type 'Text', untill we dont support Text/Array.

Split Transaction into subclasses

Currenty whenever Predictor.learn, .predict, .test or analyse_dataset are called, a Transaction object is created.
However for each of these operations it behaves differently.
This is implemented using methods like Transaction._execute_learn, Transaction._execute_analyze` and so on.

It would be cleaner to make subclasses of Transaction for each operation.