GithubHelp home page GithubHelp logo

luthfianto / dmc-2016 Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 1.0 3.74 MB

Team: Uni Gadjah Mada 1. Our attempts and solutions for prudsys' Data Mining Cup 2016

Jupyter Notebook 96.17% Python 3.82% GCC Machine Description 0.01%

dmc-2016's People

Contributors

amirahff avatar luthfianto avatar meisyarahd avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

amirahff

dmc-2016's Issues

voucherID: Missing value

ID mah sekalian drop aja kali ya

kalaupun misalnya mau dikasih ID sendiri, digrup berdasarkan nilainya dulu (voucherAmount)

tapi kayanya mah drop aja

n_estimators

Defaultnya n_estimators = 10, terlalu sedikit deh

Seleksi Fitur Polynomial

Seleksi Fitur dr generate-an PolynomialFeatures. Kan banyak banget tuh fitur yg didapet dr PF, nah diseleksi aja mana yang paling ngaruh**

** nah untuk yg ini aku baca (di Introduction for Statistical Learning), kalau misal nih fitur X1 x X2 (baca: interaksi X1 dan X2) itu berpengaruh, kita ngga boleh nge-exclude fitur X1 dan X2.
soalnya itu ada di hierarchical principle, "if we include interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant".
alasannya karena kalau X1 x X2 'penting' maka X1 atau X2 secara terpisah itu 'penting' atau tidaknya tidak terlalu ngaruh (sangat eksplanatif), dan ada penjelasan lain kalau X1 x X2 itu berkorelasi sama X1 dan X2 shg kalau kita tinggalkan berpotensi akan mengubah 'meaning of interaction'

Benerin probabilty biar jadi kumulatif?

Kalau kata bahrun #2 itu bagusnya kumulatif? Mengapa, karena secara intuitif, kita tak bisa memprediksi masa sekarang dengan data masa depan.
Misal hari ini adalah N, besok N+1, lusa N+2. Kita tak dapat memprediksi N+1 dengan data N+2, hanya data di hari ke-N yang dapat kita gunakan

Sedangkan, return_prob di data testing pakai probabilitas terakhir (yaitu ekuivalen dengan implementasi sekarang yang menjumlah seluruh baris)

Istilah yang lebih tepat adalah prior probability.

Bagaimana menurut kalian? Diskusikan!

Contoh argumen terhadap isu ini:

  • pro: seperti argumen bahrun
  • kontra: apakah model machine learning dapat mempelajari makna dari probabilitas prior?(probabilitas yang dipelajari dari masa lalu; berubah seiring waktu )

Langkah benerin?

  • ganti sum() jadi cumsum()
  • pastikan data testing/validation pakai probabilitas kumulatif terakhir

Ekstraksi Fitur

  • Return Spend
    • cumsum: return_spend. Berapakah total nilai pembelian customer yang dikembalikan
  • Non Return Spend
    • cumsum: nonreturn_spend. Berapakah total nilai pembelian customer tanpa pengembalian
  • Size biasanya customer.
    • customerID + productGroup + sizeCode --> cari cumprob nya (sudah dicari @amirahff).

Analisis @amirahff tentang productGroup terhadap persebaran sizeCode-nya:

  • ternyata ada 3 grup yg sizenya A semua: 43, 45, 90.
  • lalu yg grup 2 itu sizenya 24-34.
  • grup 7 itu selain 34-44 ada juga S, M, L.
  • grup 9 32-44
  • grup 17 macam2 ada 42-44 75-100 A I L M S XL XS
  • trs grup 26 hanya size 40
  • utk null size nya cuma A

mean, variance, skewness, kurtosis

image

  • mean
  • variance
  • skewness
  • kurtosis

Coba bikin untuk kombinasi fitur. misal: articleID + price = mean_article_price, median_article_price, min_article_price, max_article_price, skew_article_price, kurtosis_article_price, var_article_price

Kira-kira masuk akal ga ya? Atau pure nonsense

Pembelian produk mahal pada selasa/rabu

  • Beberapa orang "beli" produk mahal buat dipake pesta doang, terus di-refund
  • asumsikan produk mahal itu rrp-nya tinggi
  • gabung fitur hari tuesday/wednesday dengan articleID/harga/productGroup

Coba dicek siapa tau efektif

productGroup: simpan, buang, probability? atau... Impute!!!!

Tidak terlalu signifikan. Mau diapakan?

feature_importances_:
 ('quantity', 0.007251995957035318),
 ('voucherAmount', 0.0084270098393323216),
 ('productGroup', 0.013338614492624339),
 ('voucherID', 0.016040803567248609),
 ('paymentMethod', 0.019941250968533431),
 ('deviceID', 0.020316516646024803),
 ('months', 0.026122648399999011),
 ('sizeCode', 0.026651668213207254),
 ('rrp', 0.031491169954646278),
 ('choice_order', 0.039189931417207585),
 ('price', 0.047673059358639566),
 ('order_order', 0.064799658865175275),
 ('colorCode', 0.068958628472941763),
 ('mmdd', 0.072462020638408065),
 ('articleID', 0.073478648965403653),
 ('orderDate', 0.073569468144725342),
 ('total_order', 0.079232118466960516),
 ('after_voucher', 0.082577205489615293),
 ('budget', 0.087331963723963874),
 ('customerID', 0.087644589636466194)]

cc: @amirahff @meisyarahd @rochanaph

Return Probabilities

Cara: groupby suatu kolom, lalu ambil proporsi grup tersebut terhadap returnQuantity/Quantity

Todo:

  • article_ID_prob
  • colorCode_prob
  • sizeCode_prob
  • article_color_prob (thanks @amirahff!)
  • article_size_prob (thanks @amirahff!)

Nggak efektif:

  • paymentMethod_prob

Tambahan feature extraction?

Sumber: https://github.com/xydrolase/dmc-2014/blob/master/featgen%2Ffeat_gen.R

Catatan:

  • cid = customer id
  • iid = item id

List:

  • mindisc.by.iid = min(disc , na.rm = T) ,
  • meandisc.by.iid = mean(disc , na.rm = T)
  • minprice.by.iid = min(price) ,
  • maxprice.by.iid = max(price) ,
  • meanprice.by.iid = mean(price)). kita: usual_unit_price
  • keverage #considering the local(the nearest 15 order) disc/price trend for items smooth with neighbor to be 15
  • localdisc
  • localprice
  • pricediff. kita: price_diff
  • discdiff
  • outday.by.iid
  • deal
  • nlowprice.by.cid = sum(price < 100) ,
  • nlowdisc.by.cid = sum(disc < 0.8) ,
  • ndeal.by.cid = sum(deal) ,
  • norder.by.cid = n(). kita: order_order
  • nreturn.by.cid = sum(return , na.rm = T) ,
  • totalspend.by.cid = sum(price) ,
  • meanspend.by.cid = mean(price). kita: customer_budget
  • nonreturnspend.by.cid = sum(price * (return==0) , na.rm = T) ,
  • returnspend.by.cid = sum(price * (return == 1) , na.rm = T))

some by.batch.cid feature(xin have done this)

  • mbspend = mean(price) , bsize = n()) %.%
  • nbc = length(unique(date)) , noc = n() , bspendc = sum(price))
  • #outseason.by.iid
  • check_order_before
  • check_keep_future
  • check_return_future
  • check_order_future
  • #ob.by.cid.iid.price
  • #of.by.cid.iid.price
  • rb.by.cid.iid.price = check_return_before(date,return),
  • kb.by.cid.iid.price = check_keep_before(date,return),
  • rf.by.cid.iid.price = check_return_future(date,return),
  • kf.by.cid.iid.price = check_keep_future(date,return)) %.%

If a cid returned/kept/ordered an exactly same item before/in the future

  • #ob.by.cid.iid.color.size
  • #of.by.cid.iid.color.size
  • rb.by.cid.iid.color.size = check_return_before(date,return),
  • kb.by.cid.iid.color.size = check_keep_before(date,return),
  • rf.by.cid.iid.color.size = check_return_future(date,return),
  • kf.by.cid.iid.color.size = check_keep_future(date,return)) %.%

If a cid returned/kept/ordered a same iid before/in the future

  • rb.by.cid.iid = check_return_before(date,return),
  • ob.by.cid.iid = check_order_before(date,return),
  • kb.by.cid.iid = check_keep_before(date,return),
  • rf.by.cid.iid = check_return_future(date,return),
  • of.by.cid.iid = check_order_future(date,return),
  • kf.by.cid.iid = check_keep_future(date,return)) %.%

If a cid returned/kept/ordered a same item with same price before/in the future

  • #ob.by.cid.price
  • #of.by.cid.price
  • rb.by.cid.price = check_return_before(date,return),
  • kb.by.cid.price = check_keep_before(date,return),
  • rf.by.cid.price = check_return_future(date,return),
  • kf.by.cid.price = check_keep_future(date,return)
  • rankprice.by.cid.iid = rank(price)) %.% #the price rank of price of the iid ordered by a cid
  • ntotal = nrow(train)
  • #llr.by.price
  • ##return+1/#keep+1
  • llr.by.price = log((sum(return , na.rm = T)+0.5) / (0.5+length(return)-sum(return,na.rm=T)))) %.%
  • k = sum(return , rm.na = T)))$k
  • n = n()))$n
  • llr.by.cid.price = log((sum(return , na.rm = T)+0.5) / (0.5+length(return)-sum(return,na.rm=T)))) %.%
  • rm(train)

item freshness

  • raw.tr$f1w <- fan.feats$outday.by.iid <= 7
  • raw.tr$f2w <- fan.feats$outday.by.iid <= 14
  • raw.tr$f1m <- fan.feats$outday.by.iid <= 30
  • raw.tr$f3m <- fan.feats$outday.by.iid <= 90
  • raw.tr$f6m <- fan.feats$outday.by.iid <= 180
  • raw.tr$oseas <- fan.feats$outseason.by.iid
  • raw.tr$isdisc <- fan.feats$disc < 1
  • raw.tr$deal <- fan.feats$deal
  • raw.tr$lowdisc <- fan.feats$disc <= 0.8

price ranges

  • raw.tr$pb25 <- raw.tr$price < 25
  • raw.tr$pb50 <- raw.tr$price < 50
  • raw.tr$pb100 <- raw.tr$price < 100
  • raw.tr$pb200 <- raw.tr$price < 200

compute counts and LLRs for given "feats", the combation of features.

  • counts.and.llrs <- function(df, feats, c1=0.5, c2=0.5) {

3way interaction: color_state_iid

  • .feats = counts.and.llrs(raw.tr, c("state", "iid", "color"))
  • names(.feats) <- c("all.cnt.state_iid_color", "all.llr.state_iid_color")
  • all.feats <- cbind(all.feats, .feats)
  • .feats = counts.and.llrs(raw.tr, c("state", "mid", "color"))
  • names(.feats) <- c("all.cnt.state_mid_color", "all.llr.state_mid_color")
  • all.feats <- cbind(all.feats, .feats)

ratio of low price / low discount

  • fan.feats$rlowprice.by.cid <- fan.feats$nlowprice.by.cid /
  • all.feats$all.cnt.cid
  • fan.feats$rlowprice.by.cid[all.feats$all.cnt.cid == 0] <- 0
  • fan.feats$rlowdisc.by.cid <- fan.feats$nlowdisc.by.cid /
  • all.feats$all.cnt.cid
  • fan.feats$rlowdisc.by.cid[all.feats$all.cnt.cid == 0] <- 0

batch features with some selected interactions

  • bat.n=length(oid),
  • bat.uniq.iid=length(unique(iid)),
  • bat.uniq.mid=length(unique(mid)),
  • bat.uniq.size=length(unique(size)),
  • bat.uniq.color=length(unique(color)),
  • bat.uniq.ztype=length(unique(ztype)),
  • bat.uniq.zsize=length(unique(zsize)),
  • bat.uniq.mid_zsize=nrow(unique(cbind(mid, zsize))),
  • bat.uniq.ztype_zsize=nrow(unique(cbind(ztype, zsize))),
  • bat.uniq.iid_color=nrow(unique(cbind(iid, color))),
  • bat.uniq.mid_color=nrow(unique(cbind(mid, color)))

batch counts / other counts / max counts

customer per batch features

  • rrate=sum(return)/length(return),
  • krate=1-sum(return)/length(return)) %.%

only set the first order of each batch to be the true rate, others set to be NA

  • srrate=c(sum(return, na.rm=T)/length(return), rep(NA, length(return)-1)),
  • skrate=c(1-sum(return, na.rm=T)/length(return), rep(NA, length(return)-1))) %.%
  • cb.ret.rates$srrate <- cb.ret.srates$srrate
  • cb.ret.rates$skrate <- cb.ret.srates$skrate

average return/keep rate, weighted and unweighted,

  • cbat.wavg.rrate=mean(rrate, na.rm=T),
  • cbat.wavg.krate=mean(krate, na.rm=T),

simple averages

  • cbat.avg.rrate=mean(srrate, na.rm=T),
  • cbat.avg.krate=mean(skrate, na.rm=T),
  • cbat.sum.rrate=sum(srrate, na.rm=T),
  • cbat.sum.krate=sum(skrate, na.rm=T))
  • # log-likelihood ratio of return over kept

Apakah perlu membuat fitur sebanyak-banyaknya? (Lalu di-reduce)

Lihat slide Tim 1 dari Iowa State University. Mereka menerapkan metode-metode yang general ke kolom yang ada sehingga menghasilkan 1000 fitur (slide halaman 17). Pastinya fitur-fitur tersebut di-reduce kemudian.

Apakah hal seperti ini perlu diterapkan? Bagaimana caranya?

TPOT & for-in model2

Untuk pemilihan model:

  • bisa nyoba pake TPOT, nanti TPOT bakal milihin metode apa yg paling bagus, ngebuatin scriptnya juga
  • bisa nyoba bermacam-macam model, pake for in. nanti diliat model mana yg menghasilkan error yg paling kecil

rrp: missing values

beberapa pilihan cara asal-asalan:

  • rrp dijadikan label untuk diprediksi dengan regresi (tapi pilihin parameternya dulu)
  • mean/median dari productGroup

kalau ada usul lebih bagus, boleh juga

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.