GithubHelp home page GithubHelp logo

datawhale_ml-datamining's Introduction

datawhale project issue

brief record and essence worth written down.

Task 2 Data EDA

target:

familiar with the data-set

relation between datas

preparing for data-engineering

main-content:

data importing and visualization

outline of data(pd.describe(),pd.info())

NAN value and inconsistent value

distribute ofd ata

task 3 feature engineering

main-content:

1.exception data

2.data normalization

3.data bucket

4.NAN-value process

5.Feature construction

6.Feature screening

7.PCA,LDA,ICA

useful functions: df.groupby() pd.cut() for bucket sklearn.preprocessing pd.get_dummies()

task 4 build model and hyper-parameter tunning

main-content:

1.linear-regression:

feature requriement:

change object-type data

data of the same distribution

no NAN value

handle long-tail distribution

2.embedding feature-selection

lasso regression(L1 regulation)

ridge regression(L2 regulation)

decision tree

3.hyper parameter-tunning

beyes optimization

grid search

greedy

useful func:

def reduce_mem_usage(df): """ iterate through all the columns of a dataframe and modify the data type to reduce memory usage.
""" start_mem = df.memory_usage().sum() print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

for col in df.columns:
    col_type = df[col].dtype
    
    if col_type != object:
        c_min = df[col].min()
        c_max = df[col].max()
        if str(col_type)[:3] == 'int':
            if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                df[col] = df[col].astype(np.int8)
            elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                df[col] = df[col].astype(np.int16)
            elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                df[col] = df[col].astype(np.int32)
            elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                df[col] = df[col].astype(np.int64)  
        else:
            if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                df[col] = df[col].astype(np.float16)
            elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                df[col] = df[col].astype(np.float32)
            else:
                df[col] = df[col].astype(np.float64)
    else:
        df[col] = df[col].astype('category')

end_mem = df.memory_usage().sum() 
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
return df

task 5

main content: 1.stacking

2.blending

3.bagging/boosting

web crawler

http

request

datawhale_ml-datamining's People

Contributors

pjssaber avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.