GithubHelp home page GithubHelp logo

hazyresearch / meerkat Goto Github PK

View Code? Open in Web Editor NEW
813.0 15.0 42.0 68.09 MB

Creative interactive views of any dataset.

License: Apache License 2.0

Makefile 0.07% Python 69.82% Batchfile 0.04% JavaScript 1.60% HTML 0.05% CSS 0.02% Svelte 27.35% TypeScript 0.93% Jinja 0.13%
ml data-science foundation-models machine-learning pandas

meerkat's Introduction

Meerkat logo

GitHub pre-commit

Create interactive views of any dataset.

Website | Quickstart | Docs | Contributing | Discord | Blogpost

⚡️ Quickstart

pip install meerkat-ml

Next Steps. Check out our Getting Started page and our documentation to start building with Meerkat.

Why Meerkat?

Meerkat is an open-source Python library that helps users visualize, explore, and annotate any dataset. It is especially useful when processing unstructured data types (e.g. free text, PDFs, images, video) with machine learning models.

✏️ Features and Design Principles

Here are four principles that inform Meerkat's design.

(1) Low overhead. With four lines of Python, start interacting with any dataset.

  • Zero-copy integrations with your preferred data abstractions: Pandas, Arrow, HF Datasets, Ibis, SQL.
  • Limited data movement. With Meerkat, you interact with your data where it already lives: no uploads to external databases and no reformatting.
import meerkat as mk
df = mk.from_csv("paintings.csv")
df["image"] = mk.files("image_url")
df
Meerkat logo

(2) Diverse data types. Visualize and annotate almost any data type in Meerkat interfaces: text, images, audio, video, MRI scans, PDFs, HTML, JSON.

(3) "Intelligent" user interfaces. Meerkat makes it easy to embed machine learning models (e.g. LLMs) within user interfaces to enable intelligent functionality such as searching, grouping and autocomplete.

df["embedding"] = mk.embed(df["img"], engine="clip")
match = mk.gui.Match(df,
	against="embedding",
	engine="clip"
)
sorted_df = mk.sort(df,
	by=match.criterion.name,
	ascending=False
)
gallery = mk.gui.Gallery(sorted_df)
mk.gui.html.div([match, gallery])
Meerkat logo

(4) Declarative (think: Seaborn), but also infinitely customizable and composable. Meerkat visualization components can be composed and customized to create new interfaces.

plot = mk.gui.plotly.Scatter(df=plot_df, x="umap_1", y="umap_2",)

@mk.gui.reactive
def filter(selected: list, df: mk.DataFrame):
    return df[df.primary_key.isin(selected)]

filtered_df = filter(plot.selected, plot_df)
table = mk.gui.Table(filtered_df, classes="h-full")

mk.gui.html.flex([plot, table], classes="h-[600px]") 
Meerkat logo

✨ Use cases where Meerkat shines

  • Exploratory analysis over unstructured data types. Demo
  • Spot-checking the behavior of large language models (e.g. GPT-3). Demo
  • Identifying systematic errors made by machine learning models. Demo
  • Rapid labeling of validation data.

🤔 Use cases where Meerkat may not be the right fit

  • Are you only working with structured data (e.g. numerical and categorical variables)? Popular data visualization libraries (e.g. Seaborn, Matplotlib) are often sufficient. If you're looking for interactivity, Plotly and Streamlit work well with structured data. Meerkat is differentiated in how it visualizes unstructured data types: long-form text, PDFs, HTML, images, video, audio...
  • Are you trying to make a straightforward demo of a machine learning model (single input/output, chatbot) and share with the world? Gradio is likely a better fit! Though, if your demo involves visualizing lots of data, you may find Meerkat useful.
  • Are you trying to manually label tens of thousands of data points? If you are looking for a data labeling tool to use with a labeling team, there are great open source labeling solutions designed for this (e.g. LabelStudio). In contrast, Meerkat is great fit for teams/individuals without access to a large labeling workforce who are using pretrained models (e.g. GPT-3) and need to label validation data or in-context examples.

✉️ About

Meerkat is being built by Machine Learning PhD students in the Hazy Research lab at Stanford. We're excited to build for a future where models will make it easier for teams to sift and reason through large volumes of unstructtured data effortlessly.

Please reach out to kgoel [at] cs [dot] stanford [dot] edu, eyuboglu [at] stanford [dot] edu, and arjundd [at] stanford [dot] edu if you would like to use Meerkat for a project, at your company or if you have any questions.

meerkat's People

Contributors

ad12 avatar albertfgu avatar anarayan avatar dakuo avatar dastratakos avatar dilan-js avatar hannahkim24 avatar jaraujo98 avatar khaledsaab avatar krandiash avatar lorr1 avatar priya2698 avatar sam-randall avatar sandkoan avatar seyuboglu avatar shaozhang0115 avatar tchang1997 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

meerkat's Issues

[FEATURE] Do away with `ListColumn` index in favor of something faster

Meerkat imposes a ListColumn index on all DataPanel. In many cases, this is the slowest column in the dp and it bottlenecks performance, since all the other columns are based on numpy, pandas or torch.
We should do away with the index column in favor of something else...

It's also worth thinking about what purpose the "index" column serves in the DataPanel. Its not clear to me that it's providing anything important (though I may be missing something). It seems that its main purpose is to provide some sort of unique "id" for every row in the DataPanel, but I don't think this is something we should impose on the DataPanel. That being said, I see the appeal of being able to specify some "id" columns that have some special properties (e.g. always get carried from one dp to the next when indexing).

I propose doing away with this single "index" column design in favor of a new design inspired by the idea of indexes in database management systems:

  • We can mark any column in a DataPanelas an index. For example:
mimic_dp = ...
mimic_dp.set_index(column="dicom_id")
  • Rows can then be looked up by that index in a shorthand form:
row = mimic_dp.idx["dicom_id", "id_e324198"]
sub_dp =  mimic_dp.idx["dicom_id", ["id_e324198", "id_e1236493",]]
  • Under the hood, columns can specify fast implementations of indexes. By default columns will implement the naive slow O(n) index:
def idx(self, index_name, index):
    return self[self[index_name] == index]

but columns can override this with faster implementations, e.g. based on a pandas.Index object which is backed by a Cython dict, so provides O(1) lookups (https://pandas.pydata.org/pandas-docs/stable/development/internals.html).

Make column testbeds more modular

The BlockManager (see #104) introduces a need for more robust DataPanel testing that tests DataPanels with a diverse set of columns. As we add more columns, we don't want to have to update the DataPanel tests for each new column. Instead, we should specify a TestBed for each column that plugs in to the DataPanel tests.

Started this for NumpyArrayColumn with #108

Base writers off of concat and update concat to preserve subclass type

Right now map relies on columns specifying writer classes like the TorchWriter below for TensorColumn

class TorchWriter(AbstractWriter):
    def __init__(
        self,
        *args,
        **kwargs,
    ):
        super(TorchWriter, self).__init__(*args, **kwargs)

    def open(self) -> None:
        self.outputs = []

    def write(self, data, **kwargs) -> None:
        self.outputs.extend(data)

    def flush(self, *args, **kwargs):
        return torch.stack(self.outputs)

    def close(self, *args, **kwargs):
        pass

    def finalize(self, *args, **kwargs) -> None:
        pass

Except for in the memmap case, these writers are basically just doing a concat, so they can be consolidated into one writer class based off of concat.

Additionally, we need concat to preserve subclass type, which we can accomplish by converting the static method into a instance method.

[BUG] Datapanel outputs in notebook not truncated

When I create a datapanel, and run dp.head() in a jupyter notebook, the values of the entries are completely displayed instead of a truncated version.

On the Domino repo, gdro branch, run the following in a jupyter notebook (on most recent dev branch of meerkat):

df = build_cxr_df.out(load=True)
dp = get_dp(df)
dp.head()

The output looks like the attached screenshot.
Screen Shot 2021-09-06 at 5 31 41 PM

[BUG] ConstructorError module not imported

Using version 0.2.0.

import meerkat as mk
import meerkat.nn

dp['emb_col'] = mk.nn.EmbeddingColumn(some 2d np array)

if i save the datapanel and then load it back, i get this error. this was working previously, but i updated the meerkat package this week:

ConstructorError: while constructing a Python object
module 'meerkat.nn.embedding_column' is not imported

[BUG] `map` adds a new `index` column to `DataPanel`

When dp.map is passed a function that returns a dict, the returned DataPanel should use the same index (i.e. example ids) as the original DataPanel. Currently, it attaches a fresh index column that may not match the index column from dp.

[BUG] add_columns method produces duplicate columns on overwriting

add_column method, when used with overwrite=True, produces a duplicate column.

Code:

dp = mk.DataPanel({
    'text': ['The quick brown fox.', 'Jumped over.', 'The lazy dog.'],
    'label': [0, 1, 0]
})

dp.add_column("label", [1,2,3], overwrite=True)

dp.columns gives ['text', 'label', 'index', 'label']. On printing this DataPanel, both these labelcolumns have the new data.

[BUG] Incorrect number of classes when predictions passed to ClassificationOutputColumn

## Data from test_prediction_column.py
logits = torch.as_tensor(
    [
        [-100, -2, -50, 0, 1],
        [0, 3, -1, 5, 4],
        [100, 0, 0, -1, 5],
        [-100, -2, -50, 0, 1],
    ]
).type(torch.float32)
expected_preds = torch.as_tensor([4, 3, 0, 4])

logit_col = ClassificationOutputColumn(logits=logits)
probs_col = ClassificationOutputColumn(probs = logit_col.probabilities().data)
preds_col = ClassificationOutputColumn(preds = logit_col.predictions().data)
print(logit_col.num_classes, probs_col.num_classes, preds_col.num_classes)

The output is 5 5 4.

[BUG] `ImageColumn` in README doesn't work

Describe the bug
For using ImageColumn, ImageColumn.from_filepaths(...) has to be used and simply using ImageColumn(...) does not work.

To Reproduce
The following code from the README gives an error and ImageColumn needs to be replaced with ImageColumn.from_filepaths to get the correct functionality.

from mosaic import DataPanel, ImageColumn

#Images are NOT read from disk at DataPanel creation...
dp = DataPanel({
    'text': ['The quick brown fox.', 'Jumped over.', 'The lazy dog.'],
    'image': ImageColumn(['fox.png', 'jump.png', 'dog.png']),
    'label': [0, 1, 0]
}) 

# ...only at this point is "fox.png" read from disk
dp["image"][0]

[FEATURE] Add caching functionality to LambdaColumn

I’m envisioning is something in between a map and a LambdaColumn where the computation happens lazily but is cached once it’s computed. Right now, it’s either you do it all up front or you don’t get caching.

This idea was raised @ANarayan who pointed out that it would be helpful for caching feature preprocessing in NLP pipelines.

[FEATURE] Add `from_npy` class method to `NumpyArrayColumn`

Could have signature matching np.load and look something like this:

@classmethod
def from_npy(cls, path, mmap_mode=None, allow_pickle=False, fix_imports=True, encoding='ASCII'):
  data = np.load(path, mmap_mode=mmap_mode, allow_pickle=allow_pickle, fix_imports=fix_imports, encoding=encoding)
  return cls(data)

@Priya2698

[BUG] "'numpy.ndarray' object is not callable" when creating `ImageColumn`

Describe the bug
A clear and concise description of what the bug is.

Error: TypeError: 'numpy.ndarray' object is not callable

To Reproduce

  1. Code:
    import pandas as pd
    print(pd.version)
    import os
    import meerkat as mk
    import numpy as np

from meerkat.contrib.imagenette import download_imagenette

BASE_DIR = "./datasets"
dataset_dir = download_imagenette(BASE_DIR);

dp = mk.DataPanel.from_csv(os.path.join(dataset_dir, "imagenette.csv"))
dp["img"] = mk.ImageColumn.from_filepaths(filepaths=dp["img_path"])
dp.head()

  1. Run Code

  2. Error: TypeError: 'numpy.ndarray' object is not callable

  3. Code snippet '....'

  4. Instructions (Run '...')

  5. Errors and traceback '....'

Include any relevant screenshots.

Expected behavior
I expected a new ImageColumn called "img" to be created.

System Information
MacOS, Linux
pandas version 1.2.4

Additional context
N/A

Add support for `list_datasets()` in registry

list_datasets() function to the registry which returns a list of dataset names, so that people who are using the registry from an outside library see whats available programatically?

[FEATURE] Visualize `DataPanel` without converting to pandas

Visualizing a DataPanel in a Jupyter notebook can be frustratingly slow because we are currently: (1) converting the DataPanel to a special "visual" Pandas DataFrame (via _repr_pandas_) and then (2) visualizing using Pandas visualization out of the box. Step 1 can be very slow for large dps.

Make our own HTML visualization module that circumvents the conversion to. We should borrow heavily from/plug into https://github.com/pandas-dev/pandas/blob/master/pandas/io/html.py

[FEATURE] Add better support for backwards compatibility with previously saved DataPanels

Since meerkat is changing quite quickly, and folks are often working off the dev branch, it's hard to ensure that every DataPanel that gets saved is backwards compatible with future versions of Meerkat. Eventually, once Meerkat is stable and folks are working off of major and minor versions, we should support backwards compatibility. But for the time being, when everyone's working off of dev, how should we support them? I see two options:

(1) Save the meerkat commit with every DataPanel and column, and create a conversion script that allows for converting saved DataPanel's between commits.

(2) Try to support backwards compatibility in the read and write code directly.

I think (2) will be quite challenging because we rely on Pickle, which runs into issues when classes and names change. One approach is to offer an "approximate" read that skips all data in pickles and raises appropriate warnings. This seems not ideal though and adds a lot of mess to the code.

I think my preference is for (1), but curious to hear other thoughts.

[BUG] ValueError: array is not C-contiguous with `EmbeddingColumn`

Getting the following ValueError when using EmbeddingColumn

    faiss_index.add(embs)
  File "/dfs/scratch0/lorr1/env_bootleg_38/lib/python3.8/site-packages/faiss/__init__.py", line 104, in replacement_add
    self.add_c(n, swig_ptr(x))
  File "/dfs/scratch0/lorr1/env_bootleg_38/lib/python3.8/site-packages/faiss/swigfaiss.py", line 6016, in swig_ptr
    return _swigfaiss.swig_ptr(a)
ValueError: array is not C-contiguous

Steps and code snippet that reproduce the behavior:

# entity_dp is datapanel with "emb" as EmbeddingColumn
embs = entity_dp["emb"].numpy()
# THIS WAS MY FIX! - it broke without it
embs = np.ascontiguousarray(embs)
index = faiss.IndexFlatL2
faiss_index = index(embs.shape[1])
faiss_index.add(embs)

thanks @lorr1

[FEATURE] Avoid recomputing values when chaining `LambdaColumn` in one DataPanel

If a single DataPanel contains a chain of LambdaColumns, like so

dp["a_b"] = LambdaColumn(dp["a"], fn)
dp["a_b_c"] = LambdaColumn(dp["a_b"], fn_2)

then indexing the DataPanel with dp[0] will perform the materialization of dp["a_b"] twice.

Ideally, the DataPanel should be aware of these dependencies and only materialize things once.

[FEATURE] Add support for rename in `DataPanel`

Currently, renaming the columns in a DataPanel is cumbersome. Say, for example, we want to rename a column "ppl" to "people", this might look like:

dp["people"] = dp["ppl"]
dp.remove("ppl")

Further, if we wanted to do this out of place, we'd have to call an additional dp.view() at the top.

In Pandas, this can be done with a single line of code

[BUG] Importing Pandas requires updated version of Pandas

Describe the bug
A clear and concise description of what the bug is.

Importing meerkat requires an updated version of pandas

To Reproduce
Steps and code snippet that reproduce the behavior:

  1. Code: import meerkat as mk
  2. Instructions: Run
  3. Error: ModuleNotFoundError: No module named 'pandas.core.strings.accessor'; 'pandas.core.strings' is not a package

Expected behavior

I expected the import to function as normal.

System Information

  • MacOS, Linux

Context:
My current pandas version is 1.1.5. Updating pandas to version 1.2.4 resolves this issue.

[FEATURE] Sort DataPanel by a column

Add a sort function that can be used to sort the DataPanel by values in a column.

dp = mk.DataPanel({'a': [1, 3, 2], 'b': ['a', 'c', 'b']})
dp.sort('a') # sorted view into the dp

Make `DataPanel(dp)` return some shallow copied version of the original `dp`.

Issue

It is very natural for users (and developers) to construct new DataPanel objects from existing ones via DataPanel(dp).

Important Aside

An unexpected consequence of this issue is finding a good way to stratify which attributes should be recomputed and which should simply be shallow copied over.

As an example, two attributes that every DataPanel has is _data and _identifier. _data is typically large and heavy-weight, so we will almost always want to shallow copy it. _identifier is quite lightweight and may be unique to different DataPanels, so maybe this is a property we recompute each time in __init__. Note this is just an example, we may want the identifier to persist.

This is especially relevant for subclassing DataPanel. As of PR #57, self.from_batch() is used to construct new DataPanel containers from existing ones with shared underlying data. However, as the PR mentions, self.from_batch() is called by many other ops (_get, merge, concat, etc.), and none of these methods have a seamless way of passing arguments other than data to __init__.

An example of this is EntityDataPanel, where the index_column should be passed from the current instance to the newly constructed instance. Because there is no way to plumb that information through different calls, the initializer of EntityDataPanel gets called with EntityDataPanel(index_column=None) even if the current instance has an index column. This results in a new column "_ent_index" being added to the new EntityDataPanel.

Proposed Solution 1

Implement a private instance method called _clone(data=None, visible_columns=None...) -> DataPanel/subclass which implements the default functionality for how to construct a new DataPanel with the relevant arguments to plumb from current instance to new instance. We can then call self._clone(data=data, visible_columns-optional) instead of self.from_batch() in ops like _get, merge, concat, etc.

Let's consider the EntityDataPanel case. We want to plumb self.index_column from a current EntityDataPanel to all EntityDataPanels constructed in its image. ._clone will look something like

class EntityDataPanel:
    def _clone(self, data=None) -> EntityDataPanel:
        if data is None:
            data = self.data
        return EntityDataPanel(data, identifier=identifier, index_column=self.index_column)

We can then have ops like DataPanel._get() for example use self._clone() instead of self.from_batch(). For example

class DataPanel:
    def _get(self, idx, materialize=False):
        ...
        # example cases where `index` returns a datapanel
        elif isinstance(index, slice):
            # slice index => multiple row selection (DataPanel)
            # return self.from_batch(
            #    {
            #        k: self._data[k]._get(index, materialize=materialize)
            #        for k in self.visible_columns
            #    })
            return self._clone({
                k: self._data[k]._get(index, materialize=materialize)
                for k in self.visible_columns
            })
        ...

Proposed Solution 2

Instead of having developers reimplement ._clone(), we can have them implement something like _state_keys() but for init args. Something like ._clone_kwargs():

class EntityDataPanel:
    def _clone_kwargs(self) -> EntityDataPanel:
        default_kwargs = super()._clone_kwargs()
        default_kwargs.update({"index_column": self.index_column})
        return default_kwargs
class DataPanel:
    def _default_kwargs(self):
        return {"data": self.data, "identifier": self.identifier}

    def _clone(self, **kwargs):
        default_kwargs = self._clone_kwargs()
        if kwargs:
            default_kwargs.update(kwargs)
        return self.__class__(**default_kwargs)

[FEATURE] Use of visible_columns should be limited to when materialize=True

Is your feature request related to a problem? Please describe.
When indexing a subset of columns in a DataPanel, we are currently always returning a view of the DataPanel with visible rows set. We should only be using visible columns when materialize is False, otherwise we should clone a new DataPanel and only have it contain a subset of columns. See

Describe the solution you'd like

if isinstance(index[0], str):
      if not set(index).issubset(self.visible_columns):
          missing_cols = set(self.visible_columns) - set(index)
          raise ValueError(f"DataPanel does not have columns {missing_cols}")
      dp = self.view()
      dp.visible_columns = index
      return dp

Should become

if isinstance(index[0], str):
                if not set(index).issubset(self.visible_columns):
                    missing_cols = set(self.visible_columns) - set(index)
                    raise ValueError(f"DataPanel does not have columns {missing_cols}")

                if materialize:
                    dp = self._clone(
                        data = {
                            k: self._data[k] for k in index
                        }
                    )
                else:
                    dp = self.view()
                    dp.visible_columns = index
                return dp

[BUG] FileExistsError on `map` with multiple workers

When running a map over a DataPanel with multiple workers, I get this. This may be because we're creating a log directory for every batch. Consider changing this.

Traceback (most recent call last):                                                                                                                                                                                                                                                                                                   [0/349]
  File "/home/common/envs/conda/envs/rg-sabri/lib/python3.8/pathlib.py", line 1287, in mkdir
    self._accessor.mkdir(self, mode)
FileExistsError: [Errno 17] File exists: '/root/mosaic/RGDataset'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sabri/code/terra/terra/__init__.py", line 230, in _run
    out = self.fn(**args_dict)
  File "/home/sabri/code/spr-21/notebooks/05-23_cxr_forager.py", line 23, in convert_cxr_to_png
    paths = dp.map(
  File "/home/sabri/code/mosaic/mosaic/datapanel.py", line 863, in map
    return super().map(
  File "/home/sabri/code/mosaic/mosaic/mixins/mapping.py", line 58, in map
    for i, batch in tqdm(
  File "/home/common/envs/conda/envs/rg-sabri/lib/python3.8/site-packages/tqdm/std.py", line 1133, in __iter__
    for obj in iterable:
  File "/home/sabri/code/mosaic/mosaic/datapanel.py", line 702, in batch
    yield DataPanel.from_batch({**cell_batch._data, **batch_batch._data})
  File "/home/sabri/code/mosaic/mosaic/datapanel.py", line 550, in from_batch
    return cls(batch, identifier=identifier)
  File "/home/sabri/code/mosaic/mosaic/datapanel.py", line 123, in __init__
    self._create_logdir()
  File "/home/sabri/code/mosaic/mosaic/datapanel.py", line 384, in _create_logdir
    self.logdir.mkdir(parents=True, exist_ok=True)
  File "/home/common/envs/conda/envs/rg-sabri/lib/python3.8/pathlib.py", line 1287, in mkdir
    self._accessor.mkdir(self, mode)
  File "/home/common/envs/conda/envs/rg-sabri/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 7921) is killed by signal: Killed.

[BUG] Can't overwrite save with block/manager.py

In block/manager.py line 188, there's a call to os.makedirs(block_dirs). If the folder already exists, this throws an error. I wasnt' sure how you folks wanted to handle this. I generally think people will commonly save over folders so maybe have the exists ok flag turned on?

Add support for prediction columns in mosaic

Include the ability to add prediction columns directly to testbenches for evaluating models.

Potential design:
prediction mixin --> ClassifierMixin [logits, probs, argmax -- moving between things]

Need to support more than classification (e.g. segmentation, text generation).

Linking the prediction to the task.

One caveat for ClassifierMixin is supporting multi-label problems (i.e. imageA can be both class1 and class2)

[BUG] Error importing meerkat without write access to /tmp

I installed the latest version of meerkat v0.2.1 using pip install meerkat-ml.

I got the following permission denied error when import meerkat in python. Because by default it creates a directory under /tmp but I have no write access to /tmp.

Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/home/siyitang/anaconda3/envs/s3/lib/python3.9/site-packages/meerkat/__init__.py", line 10, in <module>
  initialize_logging()
 File "/home/siyitang/anaconda3/envs/s3/lib/python3.9/site-packages/meerkat/logging/utils.py", line 26, in initialize_logging
  os.makedirs(log_path, exist_ok=True)
 File "/home/siyitang/anaconda3/envs/s3/lib/python3.9/os.py", line 215, in makedirs
  makedirs(head, exist_ok=exist_ok)
 File "/home/siyitang/anaconda3/envs/s3/lib/python3.9/os.py", line 225, in makedirs
  mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/tmp/2021_10_29/17_06_57'

Temporary workaround is to set export TMPDIR=<some-dir>.

[BUG] Appending along columns not working without suffix argument

Appending to a DataPanel along columns does not work without suffix argument even when the column names do not overlap.

dp = ms.DataPanel({
    'text': ['The quick brown fox.', 'Jumped over.', 'The lazy dog.'],
    'label': [0, 1, 0]
})
dp2 = ms.DataPanel({
    'string': ['The quick brown fox.', 'Jumped over.', 'The lazy dog.'],
    'target': [0, 1, 0]
})
dp.append(dp2, axis=1)

This code throws ValueError. It works when I provide any suffix, although they are not used.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-18-5f32282aa054> in <module>()
----> 1 dp.append(dp2, axis=1)

1 frames
/usr/local/lib/python3.7/dist-packages/mosaic/datapanel.py in append(self, dp, axis, suffixes, overwrite)
    422             if not overwrite and shared:
    423                 if suffixes is None:
--> 424                     raise ValueError()
    425                 left_suf, right_suf = suffixes
    426                 data = {

ValueError:

[FEATURE] Better support for missing values

Currently, missing values are supported only in the context of merge. There the solution is rather slipshod: we convert the column to a ListColumn that can store a mix of None and other types.

Replace this with something faster and more robust.
Consider: wrapper column around a smaller column with only the non missing columns.

[BUG] `TensorColumn.view(...)` is overloaded

Describe the bug
TensorColumn is meant to operate like a torch.Tensor, but certain naming conventions may conflict with tensor names.

For example, we often want to reshape our tensor without copying the underlying data (e.g. tensor.view). Should TensorColumn.view() call AbstractColumn.view() (called currently) or torch.Tensor.view(), which take different args? If the former, we should be explicit that view is not supported.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.