Should we leave room for metadata on structures? Frank Harrell's Hmisc package allows

Metadata for columns and/or DataFrames,about juliadata/dataframes.jl

Comments (36)

quinnj commented on May 11, 2024 1

Do we really want arbitrary metadata in the type? Seems like there are generally other ways to include metadata about your DataFrame w/o stuffing it into the type itself.

from dataframes.jl.

nalimilan commented on May 11, 2024 1

I think we something like that would be useful, even if that's not the highest priority. How could you store metadata about columns without support in DataFrame itself?

from dataframes.jl.

HarlanH commented on May 11, 2024

Yes, I think this could be useful. At the DataVec level, we will need meta-data for factor-like behavior (#6). And some of the other things you suggest make sense too. On the other hand, we probably want to rely less on arbitrary attributes, like R, and more on types, when there's the possibility of doing so.

from dataframes.jl.

houshuang commented on May 11, 2024

I would love to see this, wrote about my wish for better support for things like questionnaires with the code book integrated with the code here (towards the bottom): http://reganmian.net/blog/2013/10/02/likert-graphs-in-r-embedding-metadata-for-easier-plotting/...

Of course also raises the issue about serialization.

from dataframes.jl.

johnmyleswhite commented on May 11, 2024

If we can make this work without performance degradation, I'm in.

from dataframes.jl.

nalimilan commented on May 11, 2024

Standardizing on a few meta-data attributes like variable label and unit would be wonderful. In R, Harrell's Hmisc offers this feature, but unfortunately very few package use it since it's not standard at all. OTC, SAS has built-in support for variable labels, which are used e.g. to label tables and plot axes automatically. Stata also has this concept, and even allows associating longer "notes" to variables, to explicit their meaning.

More specialized attributes like question names would be useful, if there was an easy way for a separate package to create and use them.

from dataframes.jl.

johnmyleswhite commented on May 11, 2024

Adding units should be trivial, especially if Julia settles on a standard unit package soon. What are the variable labels for: descriptions of the columns to supplement the brief names?

from dataframes.jl.

nalimilan commented on May 11, 2024

Yeah, variable labels are just the readable, complete name of the variable, as opposed to the abbreviated form that is practical to type (no spaces, no special characters...) but often cryptic and ugly which is used for variable names. The most typical use of variable labels is when you want to provide a good default for axes labels, like "Annual GDP growth", rather than "GDPG". They could also be useful to describe the contents of a database, with a function like Hmisc's describe() [1] or SAS's proc contents.

1: http://www.inside-r.org/packages/cran/Hmisc/docs/describe

from dataframes.jl.

HarlanH commented on May 11, 2024

Yes, I agree with all of this. A long, human-readable Name (which could be
leveraged for axes labels by plotting routines), Units, and
Level-of-measurement would be very helpful. Possibly also Domain.

LoM could be very handy for statistical modeling routines and the creation
of appropriate model matrices (or the throwing of warnings). I never
intended PooledDataVector to be equivalent to Factor -- it's a
representational optimization. Would be much better for statistical
routines to look for Nominal or Ordinal types and act appropriately, even
if the underlying type is a non-pooled integer or string.

On Sun, Oct 6, 2013 at 11:54 AM, Milan Bouchet-Valat <
[email protected]> wrote:

Yeah, variable labels are just the readable, complete name of the
variable, as opposed to the abbreviated form that is practical to type (no
spaces, no special characters...) but often cryptic and ugly which is used
for variable names. The most typical use of variable labels is when you
want to provide a good default for axes labels, like "Annual GDP growth",
rather than "GDPG". They could also be useful to describe the contents of a
database, with a function like Hmisc's describe() [1] or SAS's proc
contents.

1: http://www.inside-r.org/packages/cran/Hmisc/docs/describe

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/35#issuecomment-25770479
.

from dataframes.jl.

houshuang commented on May 11, 2024

All great ideas. long name, LoM (for example I'd love to indicate that something is a likert-item, which is more specific than just categorical), etc. Not sure what is meant by domain? Units of course useful for measurements. An open ended comment field would be great for code book stuff (how data is collected, coded etc) - I could see some great ways of showing this, especially in the web view.

Not sure how this would fit in, but in my R code, I also have the concept of grouping columns - for example having five groups of questions.

Also curious about how we serialize this - we can't just spit this out into CSV again. What's DataFrame's "native" format for storing all this metadata? Ideally it would be something that was compatible with other tools as well. HDF5?

from dataframes.jl.

johnmyleswhite commented on May 11, 2024

I think separating Likert scales from other categorical variables might be too specific for something as generic as DataFrames: what functions would apply to them that don't apply to other categorical variables?

I believe domain is meant in the math sense of "allowable, but not necessarily present, values for entries in this column".

We once had grouped columns, but they were dropped because they proved difficult to maintain. They need to added back in, but it takes a good chunk of work to do.

Serialization is kind of a nightmare. I think HDF5 may work, but that's a question for people with more expertise than I have in our current serialization infrastructure.

from dataframes.jl.

houshuang commented on May 11, 2024

I was thinking of LoM as user defined, for example I might want to graph
likert-scales differently from a demographic categorical variable... But
this isn't super-important.

Stian

On Mon, Oct 7, 2013 at 11:57 AM, John Myles White
[email protected]:

I think separating Likert scales from other categorical variables might be
too specific for something as generic as DataFrames: what functions would
apply to them that don't apply to other categorical variables?

I believe domain is meant in the math sense of "allowable, but not
necessarily present, values for entries in this column".

We once had grouped columns, but they were dropped because they proved
difficult to maintain. They need to added back in, but it takes a good
chunk of work to do.

Serialization is kind of a nightmare. I think HDF5 may work, but that's a
question for people with more expertise than I have in our current
serialization infrastructure.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/35#issuecomment-25820799
.

http://reganmian.net/blog -- Random Stuff that Matters

from dataframes.jl.

johnmyleswhite commented on May 11, 2024

You could actually do that already: you'd just make a DataArray{LikertResponse}, where LikertResponse is a custom type. This is one of the virtues of our approach to NA: you can create a DataArray for any type in Julia, not just those we've built into the system.

from dataframes.jl.

houshuang commented on May 11, 2024

I think serialization becomes an important issue - many of these things can
probably be done already by subclassing DataFrame etc (and whether it's
better to extend DataFrame or subclass it becomes a design question),
however the key question is how I can setup my data the way I want it (with
full names, groups, etc), and then store it for future analysis by other
scripts...

On Mon, Oct 7, 2013 at 12:04 PM, John Myles White
[email protected]:

You could actually do that already: you'd just make a
DataArray{LikertResponse}, where LikertResponse is a custom type. This is
one of the virtues of our approach to NA: you can create a DataArray for
any type in Julia, not just those we've built into the system.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/35#issuecomment-25821376
.

http://reganmian.net/blog -- Random Stuff that Matters

from dataframes.jl.

nalimilan commented on May 11, 2024

If you add support for arbitrary meta-data attributes to DataFrames, it will be easy for separate packages to mark some columns as grouped using a group index. No need to hardcode support for every specific feature - just make it easy to extend.

from dataframes.jl.

johnmyleswhite commented on May 11, 2024

What would arbitrary metadata consist of? A Dict called metadata that people can do anything with?

from dataframes.jl.

nalimilan commented on May 11, 2024

Sure, a Dict containing vectors with one value per column, or even just a DataFrame, since attributes would all have the same length. Only standard attributes would have a pre-specified type, others would be free. Of course setters and getters would make the whole process transparent.

from dataframes.jl.

johnmyleswhite commented on May 11, 2024

If you're up for making a demo with that approach, it'd be nice to see. My instinct is that trying to avoid pre-specified types is going to make things slow, but I could be wrong.

from dataframes.jl.

tshort commented on May 11, 2024

I like the idea of metadata, but I'm worried that it complicates things, especially if applied to a DataFrame. As John said, a demo would be a great way to work things out. We once had a concept of column groupings that we eventually pulled out because it tended to complicate things. Trying out an implementation is the best way to judge the balance of additional complexity relative to its benefit.

Applying metadata to columns but embedding that data into the DataFrame structure has issues. For example, I may create a DataFrame column that points to a DataArray originally in a different DataFrame like: df1["colX"] = df2["colY"]. If df2 had column labels or other metadata, it would be lost because the DataArray df2["colY"] doesn't know about that. This type of column reuse is common in DataFrames.

It's easier to attach metadata to DataArrays or other column data. Then, the metadata goes with columns. Nothing really needs to change in the DataFrame structure.

from dataframes.jl.

nalimilan commented on May 11, 2024

I've never really programmed in Julia yet, so I cannot promise anything...

Tom's point about storing meta-data directly in DataArrays sounds interesting for attributes that make sense when columns are taken in isolation (i.e. for label, unit...). It would not make much sense for column groupings, since a group index taken alone does not mean much. But that may not be an issue: if you take a column out of it's original DataFrame, you know that you're breaking its grouping with other columns.

I kind of like this solution: it means the meta-data would be preserved when passing the DataArray directly to a function, which could happen in many cases.

from dataframes.jl.

johnmyleswhite commented on May 11, 2024

Here's the metadata I'm on board with adding permanently:

Nullable: Is this column a Vector or a DataVector? (Note that, if we make the changes described in a recent discussion regarding problems with PDA's never being able to capture all properties of categorical data, we'll only have Vector or DataVector going forward.)
Column label/description: An arbitrarily length string describing the contents of that column in natural language.

Here's the metadata I like, but don't feel comfortable committing to just yet:

Units of measurement: Saying whether a vector is measured in inches or feet or meters seems really awesome, but it seems like it might be done rarely enough that I'm not ready to commit to it just yet. Let's shoot for working this idea out for after the 0.3 release.

FWIW, I'm used to people storing a description of the levels of cryptic enums in the description field of column tables in RDBMS.

from dataframes.jl.

HarlanH commented on May 11, 2024

PDAs were originally intended to be a performance/memory optimization, not
(just) a representation for categorical data. I missed the discussion of
their limitations -- would you point me at that?

On Wed, Jan 29, 2014 at 4:22 PM, John Myles White
[email protected]:

Here's the metadata I'm on board with adding permanently:

Nullable: Is this column a Vector or a DataVector? (Note that, if we
make the changes described in a recent discussion regarding problems with
PDA's never being able to capture all properties of categorical data, we'll
only have Vector or DataVector going forward.)

Column label/description: An arbitrarily length string describing
the contents of that column in natural language.

Here's the metadata I like, but don't feel comfortable committing to just
yet:

Units of measurement: Saying whether a vector is measured in inches
or feet or meters seems really awesome, but it seems like it might be done
rarely enough that I'm not ready to commit to it just yet. Let's shoot for
working this idea out for after the 0.3 release.

FWIW, I'm used to people storing a description of the levels of cryptic
enums in the description field of column tables in RDBMS.

Reply to this email directly or view it on GitHubhttps://github.com//issues/35#issuecomment-33632269
.

from dataframes.jl.

johnmyleswhite commented on May 11, 2024

Agreed that PDA's were an optimization, but they've gotten used as factors.

I wrote about the limitations of PDA's after "an epiphany" described in JuliaStats/DataArrays.jl#50.

Summary: R gets a lot of mileage out of storing information about factor levels in vectors, but that's because each subset (including singleton-elements) retains information about the vector as a whole. Since Julia has proper scalars, factors need to be represented using a new scalar type, which will probably end up looking like Enum's.

from dataframes.jl.

nalimilan commented on May 11, 2024

I don't get why we would need a Nullable attribute: shouldn't this be inferred from the type of the column vector (i.e. Array or DataArray)?

Starting with column labels and not supporting units is reasonable. The essential point is to make the system extendable so that new attributes can be added in the future (custom attributes too?).

Finally, there's the question of whether some meta-data should be stored in DataArrays directly. For factors, the levels will have to. Conceptually, a variable label is also attached to the column rather than to the DataFrame. The problem is that standard Arrays do not support meta-data.

from dataframes.jl.

tshort commented on May 11, 2024

Regarding "Nullable", can't you just use colwise and extract that from the column type? Arrays can't have missing data and DataArrays can. Actually Arrays could have missing data if the Arrays holds a type that can be an NA. In any case, you should still be able to tell by the type of Array{T,N} using T.

As far as what we store in metadata, maybe we can use a Dict for that to allow storing different fields, and standardize on a few common names.

Regarding where to store the metadata, in this thread above, I outlined adding metadata to the columns. That helps with df[:newcol] = df2[:othercol]. But, what do you do with df[:col] + 1?

So, if we stick metadata in the DataFrame (or Index), we might need a structure that carries the data and metadata to handle the df[:newcol] = df2[:othercol] case. Or, we can just require the user to do metadata(df)[:newcol] = metadata(df2)[:othercol].

from dataframes.jl.

johnmyleswhite commented on May 11, 2024

We don't need to store a nullable attribute. We just need to expose that information through an interface. But it might be faster to check a BitVector than to check the type tag of each column. Let's worry about implementation later and focus on design first.

I'm not really ready to embrace custom attributes just yet, since it fragments the community if some people's DataFrames have properties that other DataFrames don't share. Let's think about whether we should have them later.

For factors, the levels of the factor will be stored in the type system, not in a DataArray.

from dataframes.jl.

johnmyleswhite commented on May 11, 2024

I don't think we should store metadata in columns. AFAIK, RDBMS systems don't do that: these properties are attached to the specific table and don't come along for the ride with the values in that table. So I'd rather require the user to do metadata(df)[:newcol] = metadata(df2)[:othercol].

from dataframes.jl.

tshort commented on May 11, 2024

Sounds good, John.

from dataframes.jl.

nalimilan commented on May 11, 2024

I'm not very familiar with database management systems, but it seems to me it would be convenient and completely logical to preserve column labels if you copy a column to another DataFrame, which is what df[:newcol] = df2[:othercol] is about. That said, a special function to copy a column could also be added if needed, which would handle this special case.

A more general issue I'm thinking about is that if meta-data is attached to the DataFrame and not the vector, then an (imaginary) call like plot(df[:col1], df[:col2]) will not be able to access the column label to find a meaningful default axis labels. An interface dedicated to DataFrames will have to be used, something like plot(~ col1 + col2, df). This sounds fine to me (and even better than the first form), but it's worth checking it would work in all common cases.

from dataframes.jl.

johnmyleswhite commented on May 11, 2024

I think people should always specify their axes labels manually if they don't want defaults.

from dataframes.jl.

nalimilan commented on May 11, 2024

Of course, I don't deny that. I was talking about the impossibility for plot(df[:col1], df[:col2]) to offer reasonable defaults when the user does not set them explicitly.

from dataframes.jl.

johnmyleswhite commented on May 11, 2024

That's true. But we'll never offer anything as smooth as R's kind of defaults, where you have access to information about the calling context. I think people can get used to explicitness.

from dataframes.jl.

nalimilan commented on May 11, 2024

With column labels, we can actually offer something much more useful than R's defaults. In R most of the time the default axis label is ugly or useless, e.g. df[["datebrth"]] or even x[[3]]. With DataFrames it could be Date of birth instead.

from dataframes.jl.

johnmyleswhite commented on May 11, 2024

We can use column labels for interfaces that take in DataFrames as arguments in the way that Gadfly does. For things that work with vectors, they should not assume labels will exist. Otherwise they're broken for normal Arrays.

from dataframes.jl.

pdeffebach commented on May 11, 2024

I agree with the above. If the goal is to have easy plotting (automatic labels), and easy table creation, forcing a long list of packages to interact with a third labeling package would be far more difficult to maintain than incorporating metadata into dataframes.

from dataframes.jl.

bkamins commented on May 11, 2024

Closed with #3055

from dataframes.jl.

Metadata for columns and/or DataFrames about dataframes.jl HOT 36 CLOSED

Comments (36)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs