Implement and compare the Apriori and FP Growth frequent pattern mining algorithms. Introduce improvements on Apriori algorithm and compare the results with the original algorithm.
Dataset used: UCI Adult Census Dataset (Source: https://archive.ics.uci.edu/ml/datasets/Adult)
Attributes: Total number of attributes – 15
• Age: Numeric data with ages.
• Work-class: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
• Fnlwgt: continuous data.
• Education: Level of education. Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
• Education-num: continuous data.
• Marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Marriedspouse-absent, Married-AF-spouse.
• Occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Privhouse-serv, Protective-serv, Armed-Forces.
• Relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. • Race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
• Sex: Female, Male.
• Capital-gain: continuous data.
• Capital-loss: continuous data.
• Hours-per-week: continuous data.
• Native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, OutlyingUS(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, ElSalvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
• Income: Income categorized as greater than 50k and less than or equal to 50k. >50k, <= 50k.
Missing Values – Yes, represented by “?”.
Irrelevant Attributes – Yes, there are two attributes (‘fnlwgt’ and ‘edunum’) which do not provide any significant information.
Number of rows: 32560
Data Head:
The data was pre-processed to be used for frequent pattern mining. As can be seen, data contains a few irrelevant attributes like fnlwgt and Edunum. These attributes were dropped out from the data set. Number of attributes after dropping irrelevant attributes = 13. It also has missing values represented by question mark. Thus, rows containing missing values were dropped out. Number of rows after removing missing values = 30161. There are certain attributes which are continuous numeric data and should be quantized into bins before applying the algorithm. Such attributes were converted to categorical data as follows:
Age: 0 – 25 -> Young 25 – 40 -> Middle-aged 40 – above -> Old
Capital Gain: 0 -> No Gain More than 0 -> Gain
Capital Loss: 0 -> No Loss More than 0 -> Loss
Hours per Week: 0 – 5 -> Less 5 – 20 -> Medium 20 – 60 -> Reasonable 60 – above -> High
Data head after pre-processing:
Apriori algorithm is based on the usage of prior knowledge of frequent item sets to generate frequent patterns in a data set. It mainly consists of two steps: joining and pruning. In the join step, the candidate k itemsets are generated by joining L k-1 itemsets with themselves. In the prune step, the itemsets generated from join step are scanned to determine the count of each item and all the subsets are checked against L k-1. Those subsets which are not present in Lk-1 are removed from Ck.
FP growth adopts divide and conquer strategy to mine frequent patterns. It generates a FP Tree which represents frequent itemsets. And then using this tree, frequent patterns are generated.
Sampling improvement on Apriori algorithm is based on computing apriori frequent pattern mining on a sample of the entire dataset instead of scanning entire dataset again and again. In this way, the entire dataset is only scanned once that is while generating the sample dataset. Since the algorithm is applied on a reduced dataset, it drastically reduces the computation time. With an optimum sampling factor and support count we can generate all the frequent patterns that are global frequent patterns thus having a high efficiency as well as accuracy. But most of the times when sampling improvement is used, accuracy is compromised to increase efficiency since algorithm is applied only on a subset of bigger dataset. To increases the accuracy, often a lower support threshold is used instead of minimum support threshold. This improvement is most beneficial to applications where efficiency is critical.
A transaction that does not contain any frequent itemset is useless in subsequent scans and hence thode items can be removed thus reducing the transaction. This can improve the computationa and overall efficiency of the apriori algorithm.
Support (Percent) | Apriori | Sampled | FP Growth | |||
---|---|---|---|---|---|---|
SF = 0.3 | SF = 0.4 | SF = 0.5 | SF = 0.6 | |||
80 | 47.006 | 1.508 | 2.008 | 2.518 | 3.026 | 31.290 |
70 | 115.461 | 1.525 | 2.021 | 39.082 | 42.076 | 43.117 |
60 | 429.883 | 1.513 | 2.073 | 97.932 | 98.143 | 45.932 |
50 | 833.279 | 95.084 | 97.54 | 549.787 | 748.642 | 47.364 |
40 | 1294.586 | 1013.177 | 1033.06 | 1041.036 | 1162.078 | 68.771 |
• It was observed that FP Growth is fastest of all the algorithms.
• It was also observed that sampling variation improves the efficiency of the apriori algorithm by a huge amount as the time taken for execution reduces drastically.
• From the outputs shown in previous sections, it was observed that the sampling variation misses out on a few of the global frequent patterns. Thus, its accuracy is lower than the original apriori algorithm. This implies that sampling variation should be used where time is a critical issue or where patterns must be mined frequently. Thus, accuracy is compromised to increase efficiency.
• It wass also observed that as sampling factor increases, time taken by the algorithm also increases.