GithubHelp home page GithubHelp logo

Comments (13)

fossygirl avatar fossygirl commented on June 10, 2024

Sorry, I don't know much about casing. @MichaelTsengLZ @jeffmcaffer or @jeffmendoza may be able to help here.

from curated-data.

sschuberth avatar sschuberth commented on June 10, 2024

I've proposed a PR at #7838.

from curated-data.

jeffmcaffer avatar jeffmcaffer commented on June 10, 2024

Thanks @sschuberth for finding this. I vaguely recall doing a deep dive on case preservation and sensitivity and feel like there's a doc or issue around that talks about our handling strategy. Seem to recall that Dan did this work back in the day. Can't find it right now. On the surface it would appear that this curation is the right thing to do. I am however concerned about the root cause and whether this this change will either perpetuate or exacerbate the problem (the latter seems unlikely).

It is entirely possible that the actual package changed their casing at some point in time. That seems especially possible in the NuGet world (seen it before). In that case (hah! pun) we'd have to fall back on the aforementioned casing analysis. My vague recollection is that we ended up at case preserving but case insensitive. The basic assumption being that we did not find any providers that were case sensitive (There were some old cases in npm land I think but they were old and appeared to addressed).

/cc @nellshamrell

from curated-data.

jeffmcaffer avatar jeffmcaffer commented on June 10, 2024

clearlydefined/service#157 talks about this somewhat at the implementation level. Still feel like there was an analysis of all the supported (and some prime potential) package managers with some summarization / discussion.

from curated-data.

sschuberth avatar sschuberth commented on June 10, 2024

At least for NuGet, given that the API requires the package ID to be lower case, maybe it makes sense to align on all lower case file names?

Or actually, does the casing of the file name ever matter? I guess not, as the name from the coordinates in the YAML should be taken for any lookups. Which means we could always (and for any package manager) align on lower case file names.

from curated-data.

jeffmcaffer avatar jeffmcaffer commented on June 10, 2024

I'm hesitant to make any concrete/general call on this without consulting the prior research. Casing is fraught with oddities. For this particular PR it should be possible to validate that with the changes merged, the curations are still applied to the relevant component/version.

Stepping back a bit, @sschuberth, how did this curation come about with different casing? Is this coming from ORT tooling using lowercase? or some other workflow?

from curated-data.

sschuberth avatar sschuberth commented on June 10, 2024

how did this curation come about with different casing?

I have no idea. To the best of my knowledge none of these curations were initially created by ORT or by myself in person. My brief checks just showed that the initial commit each was done by @clearlydefinedbot.

from curated-data.

jeffmcaffer avatar jeffmcaffer commented on June 10, 2024

yeah, that likely means that the PR was entered via the website and we are unable to do proper attribution (there was a technical challenge IIRC)

Anyway, I believe that NuGet is case preserving but not case sensitive. The metadata for DiscUtils.Core has mixed case (see below) while the protocol/service, as you point out, lowercases everything. So the user view of the package is mixed case.

<package xmlns="http://schemas.microsoft.com/packaging/2013/05/nuspec.xsd">
  <metadata>
    <id>DiscUtils.Core</id>
    <version>0.15.1-ci0007</version>

Reaching way back I believe we our goal was to present users with the most authentic naming so are attempting to preserve the casing. So while the protocol is all lower, the package itself and the user view is mixed.

That still leaves us with the challenge of what casing to use in the underlying implementation given that Git is case-sensitive (your point here). I would have said that if the system is case IN-sensitive, we should toLower() the coordinates for storage. Again, reaching back, I recall there may have been a few places where we look at the storage layer and reverse engineer coordinates. That's not a great and we may have stopped doing that.

HAH! I found the doc
https://github.com/clearlydefined/clearlydefined/blob/533afead4c95e3ace9dbd727e2a166d89d229af8/docs/providers.md
Unfortunately that doesn't quite help in as much it captures observed behavior, not quite the reasoning why.

Path forward.

  • This is an awesome discussion and it's likely a good thing to be revisited and properly documented (with reasoning). That should likely be done in either a service or clearlydefined (near the doc) repo issue. Might get lost here
  • The PR cited here IMO should NOT be merged as it would vary from the stated behavior. Better to be consistently wrong (if we are at all) than inconsistently right. While that may make it LOOK like we split the curation, if we go by the casing in the package itself, we'll always have curation casing that matches the targeted package's casing.

from curated-data.

sschuberth avatar sschuberth commented on June 10, 2024

I would have said that if the system is case IN-sensitive, we should toLower() the coordinates for storage. Again, reaching back, I recall there may have been a few places where we look at the storage layer and reverse engineer coordinates. That's not a great and we may have stopped doing that.

So, what's missing for me in the path forward is:

  1. Find out whether there still are place where you reverse engineer coordinates from file path in the storage layer.
  2. If there are such places, get rid of them by looking at the file contents instead of the file path to deduce coordinates.
  3. Align on all lower case paths, but keep file contents to be case preserving.
  4. Ensure that no conflicting casing can be added via the website. That is, existing curations must be amended even if the casing is different but the system is case-preserving and case-insensitive.

The PR cited here IMO should NOT be merged

Then please feel free to close my PR (with a rationale pointing back to here) so that it does not accidentally get merged.

from curated-data.

ariel11 avatar ariel11 commented on June 10, 2024

@nellshamrell - unclear if this Issue can be closed?

from curated-data.

sschuberth avatar sschuberth commented on June 10, 2024

It's easy to verify the problem persists. Just try to clone this repo on Windows and you'll still get

Updating files: 100% (6359/6359), done.
warning: the following paths have collided (e.g. case-sensitive paths
on a case-insensitive filesystem) and only one from the same
colliding group is in the working tree:

  'curations/git/github/Azure/azure-sdk-for-go.yaml'
  'curations/git/github/azure/azure-sdk-for-go.yaml'
  'curations/nuget/nuget/-/7Zip4PowerShell.yaml'
  'curations/nuget/nuget/-/7Zip4Powershell.yaml'
  'curations/nuget/nuget/-/CodeProject.ObjectPool.yaml'
  'curations/nuget/nuget/-/codeproject.objectpool.yaml'
  'curations/nuget/nuget/-/DiscUtils.yaml'
  'curations/nuget/nuget/-/Discutils.yaml'
  'curations/nuget/nuget/-/Elmah.MVC.yaml'
  'curations/nuget/nuget/-/Elmah.Mvc.yaml'
  'curations/nuget/nuget/-/EMGU.CV.yaml'
  'curations/nuget/nuget/-/Emgu.CV.yaml'
  'curations/nuget/nuget/-/EnvDTE.yaml'
  'curations/nuget/nuget/-/envdte.yaml'
  'curations/nuget/nuget/-/EnvDTE100.yaml'
  'curations/nuget/nuget/-/envdte100.yaml'
  'curations/nuget/nuget/-/EnvDTE80.yaml'
  'curations/nuget/nuget/-/envdte80.yaml'
  'curations/nuget/nuget/-/EnvDTE90.yaml'
  'curations/nuget/nuget/-/envdte90.yaml'
  'curations/nuget/nuget/-/EnvDTE90a.yaml'
  'curations/nuget/nuget/-/envdte90a.yaml'
  'curations/nuget/nuget/-/GitLink.yaml'
  'curations/nuget/nuget/-/gitlink.yaml'
  'curations/nuget/nuget/-/Jint.yaml'
  'curations/nuget/nuget/-/jint.yaml'
  'curations/nuget/nuget/-/LINQKit.yaml'
  'curations/nuget/nuget/-/LinqKit.yaml'
  'curations/nuget/nuget/-/Microsoft.AspNetCore.Authentication.JwtBearer.yaml'
  'curations/nuget/nuget/-/microsoft.aspnetcore.authentication.jwtbearer.yaml'
  'curations/nuget/nuget/-/MongoDB.LibMongocrypt.yaml'
  'curations/nuget/nuget/-/MongoDB.Libmongocrypt.yaml'
  'curations/nuget/nuget/-/MSBuildTasks.yaml'
  'curations/nuget/nuget/-/MsBuildTasks.yaml'
  'curations/nuget/nuget/-/PDFsharp-MigraDoc-GDI.yaml'
  'curations/nuget/nuget/-/PDFsharp-MigraDoc-gdi.yaml'
  'curations/nuget/nuget/-/React.Core.yaml'
  'curations/nuget/nuget/-/react.core.yaml'
  'curations/nuget/nuget/-/Semver.yaml'
  'curations/nuget/nuget/-/semver.yaml'
  'curations/nuget/nuget/-/StructureMap.yaml'
  'curations/nuget/nuget/-/structuremap.yaml'
  'curations/nuget/nuget/-/System.Data.HashFunction.SpookyHash.yaml'
  'curations/nuget/nuget/-/system.data.hashfunction.spookyhash.yaml'
  'curations/nuget/nuget/-/System.Windows.Interactivity.WPF.yaml'
  'curations/nuget/nuget/-/system.windows.interactivity.wpf.yaml'
  'curations/nuget/nuget/-/TinyMCE.JQuery.yaml'
  'curations/nuget/nuget/-/TinyMCE.jQuery.yaml'
  'curations/nuget/nuget/-/ValueInjecter.yaml'
  'curations/nuget/nuget/-/valueinjecter.yaml'
  'curations/nuget/nuget/-/VsWhere.yaml'
  'curations/nuget/nuget/-/vswhere.yaml'
  'curations/nuget/nuget/-/Xamarin.GooglePlayServices.Gcm.yaml'
  'curations/nuget/nuget/-/xamarin.googleplayservices.gcm.yaml'
  'curations/pypi/pypi/-/Pillow.yaml'
  'curations/pypi/pypi/-/pillow.yaml'
  'curations/pypi/pypi/-/Resource.yaml'
  'curations/pypi/pypi/-/resource.yaml'
  'curations/pypi/pypi/-/wxPython.yaml'
  'curations/pypi/pypi/-/wxpython.yaml'

from curated-data.

nellshamrell avatar nellshamrell commented on June 10, 2024

@ariel11 let's keep it open. I don't know when I will have time to address this, but it's good to keep it on our backlog.

from curated-data.

Littlerock505 avatar Littlerock505 commented on June 10, 2024

From what I do I use lower case for my files one it's faster to write and to the mainframe will except lower case for filing. And it keeps it from working harder. You start doing fancy stuff then there become problems.as long as there is great security for the files

from curated-data.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.