GithubHelp home page GithubHelp logo

Comments (5)

achaikou avatar achaikou commented on August 17, 2024 1

The frame.curves() function is very slow to use in a loop for.

I think you shouldn't use frame.curves() inside the loop. You should call it just once after user indicated their wish to load any curve, then cache the result all_curves = frame.curves() and use this full all_curves ndarray as a source to populate your 1-D pandas frame if you need to.

Single call to frame.curves() will load all the curves in the frame. Extracting curves separately (like 1-D, 1-D, 1-D...) channel-by-channel with frame.channels[i].curves() is very likely to be much slower:

Due to the memory-layout of dlis-files, reading a single channel from disk and reading the entire frame is almost equally fast. That means reading channels from the same frame one-by-one with this method is way slower than reading the entire frame with Frame.curves() and then indexing on the channels-of-interest.

I think we actually read all the curves together anyway, even if just one was requested, we just return only one curve value.
So one call to frame.curves() seems to be the only option in your case.

from dlisio.

lucasblanes avatar lucasblanes commented on August 17, 2024 1

I did it here and it worked perfectly! I recover all curves from an image log DLIS in 0.62 minutes, compared to 18 minutes previously. Thank you very much!

from dlisio.

lucasblanes avatar lucasblanes commented on August 17, 2024 1

Just for the record for other colleagues, the code I use to extract these curves is as follows:

def summary_curve_all(df_in, frame_in, nan_value=-999.25):
    curves = frame.curves()
    curves = curves.tolist()
    
    curve = []
    all_curves = []
    all_mins   = []
    all_maxs   = []
    all_means  = []
    all_median = []
    
    for i in range(1,len(curves[0])):
        for j in range(len(curves)):
            curve.append(curves[j][i])
        if df_in.loc[i-1,'Unidade'] == 'meters':
            curve_to_append = np.array(curve) * 0.00254
            if 'int' in str(curve_to_append.dtype):
                curve_to_append = np.float64(curve_to_append)
            curve_to_append[curve_to_append == nan_value] = np.nan
            all_curves.append(             curve_to_append)
            all_mins.append(  np.nanmin(   curve_to_append))
            all_maxs.append(  np.nanmax(   curve_to_append))
            all_means.append( np.nanmean(  curve_to_append))
            all_median.append(np.nanmedian(curve_to_append))
        else:
            curve_to_append = np.array(curve)
            if 'int' in str(curve_to_append.dtype):
                curve_to_append = np.float64(curve_to_append)
            curve_to_append[curve_to_append == nan_value] = np.nan
            all_curves.append(             curve_to_append)
            all_mins.append(  np.nanmin(   curve_to_append))
            all_maxs.append(  np.nanmax(   curve_to_append))
            all_means.append( np.nanmean(  curve_to_append))
            all_median.append(np.nanmedian(curve_to_append))
        curve = []
    df_in['Curvas']  = all_curves
    df_in['Mínimo']  = all_mins
    df_in['Máximo']  = all_maxs
    df_in['Média']   = all_means
    df_in['Mediana'] = all_median

from dlisio.

achaikou avatar achaikou commented on August 17, 2024

Hi!

No, I don't think this question has been asked before.

You are right, it seems impossible to easily read all multidimensional curves into the same pandas dataframe. From here:

Note that pandas (and CSV) only supports scalar sample values. I.e. frames containing one or more channels that have none-scalar sample values cannot be converted to pandas.DataFrame or CSV directly.

I am not a pandas specialist, so I don't know what are possible workarounds for this.

However you say that numpy array is enough.
frame.curves() already returns numpy.ndarray. Is there something preventing you from using it directly, without any additional conversion?

from dlisio.

lucasblanes avatar lucasblanes commented on August 17, 2024

Thanks for the answer Alena. I am currently being able to save the numpy array that the frame.curves() returns in a pandas dataframe cell:

2024-06-25 09_53_23-DataFrame editor

The problem is that I do this iteratively and this is quite time consuming for DLIS that contain many multidimensional arrays, such as image logs or wireline formation test. My code is this one:

def summary_curve_values(df_in, curve_index='Curvas', unit_index='Unidade', nan_value=-999.25, verbose=True):
    values = []
    mins   = []
    maxs   = []
    means  = []
    median = []
    for i in range(len(df_in)):
        if verbose:
            print('Starting curve ' + str(i+1) + ' of ' + str(len(df_in)) + '.')
        if df_in.loc[i,unit_index] == 'meters':
            curve = df_in.loc[i,curve_index]() * 0.00254
            curve[curve == nan_value] = np.nan
            values.append(curve)
            mins.append(  np.nanmin( curve))
            maxs.append(  np.nanmax( curve))
            means.append( np.nanmean(curve))
            median.append(np.nanmedian(curve))
        else:
            curve = df_in.loc[i,curve_index]()
            if 'int' in str(curve.dtype):
                curve = np.float64(curve)
            curve[curve == nan_value] = np.nan
            values.append(curve)
            mins.append(  np.nanmin( curve))
            maxs.append(  np.nanmax( curve))
            means.append( np.nanmean(curve))
            median.append(np.nanmedian(curve))
    df_in['Curvas']  = values
    df_in['Mínimo']  = mins
    df_in['Máximo']  = maxs
    df_in['Média']   = means
    df_in['Mediana'] = median

A Comment - I first get the frame.curves object in the Dataframe cell and then, if the user wants, I open it through the frame.curves() function. I do this because I first bring the DLIS information to the user and if they want to load the curves, it runs the function above to extract the curves. Thus the code does not spend time if the user does not want to load the curves.

In short, my problem is now related to efficiency. The frame.curves() function is very slow to use in a loop for. I would need code that was faster. Does anyone have an idea of ​​using pd.DataFrame(Frame.curves()) to extract the 1D curves and then run only in the N-D curves to take the multidimensional curves?

from dlisio.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.