Comparative study of frequent pattern mining algorithm on Adult Census Data

Goal

Implement and compare the Apriori and FP Growth frequent pattern mining algorithms. Introduce improvements on Apriori algorithm and compare the results with the original algorithm.

Dataset Description:

Dataset used: UCI Adult Census Dataset (Source: https://archive.ics.uci.edu/ml/datasets/Adult)

Attributes: Total number of attributes – 15

• Age: Numeric data with ages.

• Work-class: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.

• Fnlwgt: continuous data.

• Education: Level of education. Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.

• Education-num: continuous data.

• Marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Marriedspouse-absent, Married-AF-spouse.

• Occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Privhouse-serv, Protective-serv, Armed-Forces.

• Relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. • Race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

• Sex: Female, Male.

• Capital-gain: continuous data.

• Capital-loss: continuous data.

• Hours-per-week: continuous data.

• Native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, OutlyingUS(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, ElSalvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

• Income: Income categorized as greater than 50k and less than or equal to 50k. >50k, <= 50k.

Missing Values – Yes, represented by “?”.

Irrelevant Attributes – Yes, there are two attributes (‘fnlwgt’ and ‘edunum’) which do not provide any significant information.

Number of rows: 32560

Data Pre-Processing:

Data Head:

The data was pre-processed to be used for frequent pattern mining. As can be seen, data contains a few irrelevant attributes like fnlwgt and Edunum. These attributes were dropped out from the data set. Number of attributes after dropping irrelevant attributes = 13. It also has missing values represented by question mark. Thus, rows containing missing values were dropped out. Number of rows after removing missing values = 30161. There are certain attributes which are continuous numeric data and should be quantized into bins before applying the algorithm. Such attributes were converted to categorical data as follows:

Age: 0 – 25 -> Young 25 – 40 -> Middle-aged 40 – above -> Old

Capital Gain: 0 -> No Gain More than 0 -> Gain

Capital Loss: 0 -> No Loss More than 0 -> Loss

Hours per Week: 0 – 5 -> Less 5 – 20 -> Medium 20 – 60 -> Reasonable 60 – above -> High

Data head after pre-processing:

Apriori Frequent Pattern Mining Overview:

Apriori algorithm is based on the usage of prior knowledge of frequent item sets to generate frequent patterns in a data set. It mainly consists of two steps: joining and pruning. In the join step, the candidate k itemsets are generated by joining L k-1 itemsets with themselves. In the prune step, the itemsets generated from join step are scanned to determine the count of each item and all the subsets are checked against L k-1. Those subsets which are not present in Lk-1 are removed from Ck.

FP Growth Frequent Pattern Mining Overview:

FP growth adopts divide and conquer strategy to mine frequent patterns. It generates a FP Tree which represents frequent itemsets. And then using this tree, frequent patterns are generated.

Improvement on Apriori Frequent Pattern Mining Algorithm using Sampling:

Sampling improvement on Apriori algorithm is based on computing apriori frequent pattern mining on a sample of the entire dataset instead of scanning entire dataset again and again. In this way, the entire dataset is only scanned once that is while generating the sample dataset. Since the algorithm is applied on a reduced dataset, it drastically reduces the computation time. With an optimum sampling factor and support count we can generate all the frequent patterns that are global frequent patterns thus having a high efficiency as well as accuracy. But most of the times when sampling improvement is used, accuracy is compromised to increase efficiency since algorithm is applied only on a subset of bigger dataset. To increases the accuracy, often a lower support threshold is used instead of minimum support threshold. This improvement is most beneficial to applications where efficiency is critical.

Improvement on Apriori Frequent Pattern Mining Algorithm using Transaction Reduction:

A transaction that does not contain any frequent itemset is useless in subsequent scans and hence thode items can be removed thus reducing the transaction. This can improve the computationa and overall efficiency of the apriori algorithm.

Comparison of the first three algorithms:

Support (Percent)	Apriori	Sampled				FP Growth
		SF = 0.3	SF = 0.4	SF = 0.5	SF = 0.6
80	47.006	1.508	2.008	2.518	3.026	31.290
70	115.461	1.525	2.021	39.082	42.076	43.117
60	429.883	1.513	2.073	97.932	98.143	45.932
50	833.279	95.084	97.54	549.787	748.642	47.364
40	1294.586	1013.177	1033.06	1041.036	1162.078	68.771

OBSERVATIONS:

• It was observed that FP Growth is fastest of all the algorithms.

• It was also observed that sampling variation improves the efficiency of the apriori algorithm by a huge amount as the time taken for execution reduces drastically.

• From the outputs shown in previous sections, it was observed that the sampling variation misses out on a few of the global frequent patterns. Thus, its accuracy is lower than the original apriori algorithm. This implies that sampling variation should be used where time is a critical issue or where patterns must be mined frequently. Thus, accuracy is compromised to increase efficiency.

• It wass also observed that as sampling factor increases, time taken by the algorithm also increases.

srishtisingh3895 / frequentpatternmining Goto Github PK

frequentpatternmining's Introduction

Comparative study of frequent pattern mining algorithm on Adult Census Data

Goal

Dataset Description:

Data Pre-Processing:

Apriori Frequent Pattern Mining Overview:

FP Growth Frequent Pattern Mining Overview:

Improvement on Apriori Frequent Pattern Mining Algorithm using Sampling:

Improvement on Apriori Frequent Pattern Mining Algorithm using Transaction Reduction:

Comparison of the first three algorithms:

OBSERVATIONS:

frequentpatternmining's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs