GithubHelp home page GithubHelp logo

Comments (14)

mhogeweg avatar mhogeweg commented on July 19, 2024

It's unclear to me what the purpose is of the size field. Especially when working with API and web services, 'size' depends on the specific request for (a subset of) the data and the format the data is returned in. Is it the size of a zipped file that's made available or the unzipped data? what is the size of the Landsat archive (millions of images collected over several decades) vs a picture of NDVI generated from this archive (on-the-fly as part of the web service request) for a small portion of the US?

from project-open-data.github.io.

seanherron avatar seanherron commented on July 19, 2024

Good point - I thought about this myself. Government still distributes tons of data via raw file, probably way more often than via APIs or web services. In many circumstances, raw file access is probably the best way to do this, and accessURL (which size is linked to) is inclusive of direct download to raw files.

When accessURL is linked to a raw file, I would say that showing the size of the file is a good practice. If, as an extreme example, we linked to a gigantic zip file of the entire Landsat archive (but probably more realistically something like FDA SPL data, which is distributed in CSV), people should know the size of the file before they click on it, in particular if someone on a mobile connection wants to quickly check out some tabular data but doesn't realize the file is actually 300,000 records in a 40mb file or something.

As a side note, this is only useful if size changes as the file itself changes, which would necessitate either human intervention or server-side automation by agencies to update on a regular basis.

from project-open-data.github.io.

mhogeweg avatar mhogeweg commented on July 19, 2024

Just your last point would make me concerned about relying on the currency/accuracy of the size attribute whenever I see it. People just don't (manually) update this type of metadata. This is speaking from over 10 years working with Geospatial One-Stop and Data.gov in the US and various National Spatial Data Infrastructures globally.

On the web, whenever I click a link to download a file, my browser tells me how much bytes I'm about to download. That's directly associated with the actual file/stream/thing I'm about to download. Isn't that enough information for someone to decide to continue or not?

You describe a use case where someone is on a mobile device wanting to get some data. Do you know if there's an activity related to Data.gov to collect/define/design the various use cases? What IS the expected use of Data.gov in that sense? Are there apps (mobile/web/desktop/...) that people are building using datasets/services found at Data.gov that would then be used for the things you describe? Would those apps be findable at Data.gov?

from project-open-data.github.io.

seanherron avatar seanherron commented on July 19, 2024

I agree with your point that this is not something we can reasonably expect that people will manually update, hence my point about it (hopefully) being automated.

I'm not aware of an activity for data.gov to collect use cases, though I believe http://next.data.gov/ is hoping to achieve that to some extent.

Hopefully one of the authors of the schema can chime in here on why they felt size should be included. I'm with you on a lot of your points, and I admit that mobile downloads of data is pretty edge use case, and most other use cases I can think of (either bandwidth-constrained, bandwidth-capped, or storage constrained environments) would be negated by the fact that we don't really have a way of ensuring this value is correct in the first place.

from project-open-data.github.io.

MarionRoyal avatar MarionRoyal commented on July 19, 2024

The field SIZE was used in the standard Data.gov Metadata template in the
manner that you have presumed and was probably just carried over into this
schema. Originally, it was to provide the user an idea of the amount of
resources needed before making a choice to download a block of data (disk
space, time, ...) I could probably argue that this is good to know before
my browser informs me. It could be checked on a mobile app, before taking
some action. We (at data.gov) have never used "size" as a metric of our
progress in achieving open data and I don't believe it is a valid metric
going forward. Points well made on not being applicable to API's and web
services. So "size" probably rightfully deserves to carry on in the
Required if Applicable section. However, it will be applicable to the vast
majority of records.

With regards to changing the name of the field: As I age, I am becoming
less concerned or at least ambivalent on the nouns chosen to express a
concept (object) as long as the word is easily understood within a context
(or namespace if you will) and mappable to others. I am confident that
"size" in the context of this schema will not be confused with "dimension".
Having said that, it would probably be an improvement to recognize
DCAT:byteSize in future revisions. That, of course, unless we invent a new
noun to represent mass on a storage device.

On Wed, Jul 31, 2013 at 1:41 AM, Sean Herron [email protected]:

I agree with your point that this is not something we can reasonably
expect that people will manually update, hence my point about it
(hopefully) being automated.

I'm not aware of an activity for data.gov to collect use cases, though I
believe http://next.data.gov/ is hoping to achieve that to some extent.

Hopefully one of the authors of the schema can chime in here on why they
felt size should be included. I'm with you on a lot of your points, and I
admit that mobile downloads of data is pretty edge use case, and most other
use cases I can think of (either bandwidth-constrained, bandwidth-capped,
or storage constrained environments) would be negated by the fact that we
don't really have a way of ensuring this value is correct in the first
place.


Reply to this email directly or view it on GitHubhttps://github.com//issues/101#issuecomment-21841371
.

Marion A. Royal PMP
Program Director, DataGov
GSA Office of Citizen Services and Innovative Technologies
202.302.4634

from project-open-data.github.io.

seanherron avatar seanherron commented on July 19, 2024

@MarionRoyal: Thanks for the background. In regards to converting size to recognize bytesize, I'm imagining that the schema ought mandate values be given in bytes rather than just allowing for byte values, otherwise we still have the issues I brought up in the original post, right?

from project-open-data.github.io.

MarionRoyal avatar MarionRoyal commented on July 19, 2024

@seanherron: If you are asking me if I think we also need a sizeUnit, I
would say no. I think the existing field is a text field rather than a
decimal field - which means that a valid entry could include the number of
bytes (if less than a kilobyte) or could include a set of alphanumeric
characters which would most likely include letters K, M, G, T, P and could
easily be grokked by an app (and maybe even a human). The problem with
have a sizeUnit for this purpose is that it would suggest a need for
controlled vocabulary for this new field, which I think we are trying to
avoid.

so, I would agree with changing the field name to byteSize (since it
matches DCAT) and would have no objection to fileSize (since it is a
recognized PHP term), but would leave sizeUnit to other more precise
domains.

On Wed, Jul 31, 2013 at 11:19 AM, Sean Herron [email protected]:

@MarionRoyal https://github.com/MarionRoyal: Thanks for the background.
In regards to converting size to recognize bytesize, I'm imagining that the
schema ought mandate values be given in bytes rather than just allowing for
byte values, otherwise we still have the issues I brought up in the
original post, right?


Reply to this email directly or view it on GitHubhttps://github.com//issues/101#issuecomment-21870776
.

Marion A. Royal PMP
Program Director, DataGov
GSA Office of Citizen Services and Innovative Technologies
202.302.4634

from project-open-data.github.io.

MarinaNitze avatar MarinaNitze commented on July 19, 2024

I like @MarionRoyal's idea to adopt DCAT:byteSize -- but flag that not everyone knows what a byte is, so we should link to some sort of basic calculator folks can use to convert from more-familiar KB/MB/GB.

This topic has come up a lot. Ultimately, the size field is not a deeply reliable measure if we are asking people to populate it by hand, because file size changes if so much as a punctuation mark is edited in the source file, and is largely meaningless when applied to APIs, as outlined above. I think those of us who are more technical appreciate this, but we could stand to be clearer to the less-technical folks that they should not be using this field for any sort of precise measurement or for compliance purposes.

Since it's not precise, I am less inclined to make it fully machine-readable with separate size and sizeUnit is overkill, because if you're machine-reading you can probably also automatically calculate files' true sizes.

from project-open-data.github.io.

skybristol avatar skybristol commented on July 19, 2024

I think the only thing that scales at the relatively crude level of discovery metadata currently being discussed is to do as @MarionRoyal suggests and leave it as a rough textual notification to downstream users. Best practice would be to include some type of units or explanation in the attribute so a human reading it might have a clue on what they are getting into. Otherwise, we'd need to look across various standards on how the magnitude of a given asset might be described and account for all the specifics.

from project-open-data.github.io.

gbinal avatar gbinal commented on July 19, 2024

+1 for keeping this more textual and the use of letters K, M, G, T, P. I'm envisioning the spectrum of catalog creators and think that the low bar is appropriate here. I also don't think there'll be many use cases for machine-consumption of this field.

If so, wouldn't it then be best to stick with filesize so as to avoid the need for everyone to go to a filesize catalog each time?

from project-open-data.github.io.

seanherron avatar seanherron commented on July 19, 2024

It seems like we're all in agreement that the field isn't particularly useful or relevant, so I'm going to go against my original idea and say we just leave as is to prevent complication. Maybe in the future if we look to pare down the schema this would be a good field to deprecate.

from project-open-data.github.io.

jpmckinney avatar jpmckinney commented on July 19, 2024

Is this a duplicate of #55?

from project-open-data.github.io.

seanherron avatar seanherron commented on July 19, 2024

Yes, looks like it. I can close this and reference 55 if you'd like. Didn't come across it when I was posting.

from project-open-data.github.io.

jpmckinney avatar jpmckinney commented on July 19, 2024

@seanherron I've only skimmed the discussion in this thread, but makes sense!

from project-open-data.github.io.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.