GithubHelp home page GithubHelp logo

Comments (4)

shyuep avatar shyuep commented on July 24, 2024

Note that I don't think (1) is ultimately an issue. It is merely a scaling factor and has no effect on most trend analysis. That said, I agree it is usually normalized by number density.

More generally, I find this to be a very poor and overly complicated implementation of a RDF. You do not actually need to do the whole complicated dist_rdf[dist] += 1 (first off, a better implementation of that would use a Counter). You simply need to create a sequence of all distances with no counting at all, then use numpy.histogram to bin all the distances by dist, i.e., somehting like

# Use Structure.get_all_neighbors instead of get_neighbors
# Compile the distances into a list.
# Concatenate the distances into one long sequence.
# Create histogram of distances (numpy will even normalize it properly for you if you specify normed=True)

I think this will accomplish the RDF in less than 5 lines, and will probably be 10x or more faster because you use the inherent efficiencies in get_all_distances and numpy's histogramming (rather than the very inefficient manual historgramming).

from matminer.

computron avatar computron commented on July 24, 2024

Hi all,

Just catching up with this thread. I haven't been actively developing or reviewing matminer development because we're trying to push through "atomate" at the moment and my full focus (when it's possible for me to code...) is on atomate.

@WardLT - thanks for getting involved with matminer and for your suggestion about normalization. For now I have pulled your PR especially since it helps with setup. In the future it would suggest to separate the PR for setup.py vs the partial RDF but again I am glad that you are taking the time to contribute.

@shyuep - I agree with your implementation suggestions (actually, the get_all_neighbors + numpy histogram method was my suggestion to the original implementor - I guess it was ignored). If @WardLT wants to try that implementation I would be all for it. Otherwise, I will definitely make sure to clean up the code when I get around to making a push for matminer.

from matminer.

WardLT avatar WardLT commented on July 24, 2024

I agree, the scaling factor isn't important if you use the RDF as a descriptor for trend analysis/machine learning. However, in some cases (e.g., using the RDF to estimate coordination numbers) the scaling factor is important and I don't want matminer users to be surprised that the scaling is off from what they expected.

I also agree that the implementation leaves some room for improvement. Beyond the performance issues you point out, having a floating point as the key to the dictionary leads to problematic issues with floating point equality.

I'll go ahead and revamp both the RDF and PRDF code to use this histogram method (thanks for the advice). Any objections to me changing both methods to return a numpy array rather than a dictionary?

@computron - Sorry for conflating the edits to setup.py and the PRDF addition. I'll make sure to keep things separate in the future.

from matminer.

computron avatar computron commented on July 24, 2024

Hi all,

I'm closing this issue for now. Feel free to reopen if needed

from matminer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.