Blog for Data Mining Course @ NCU: Data Mining Chapter 6 Overview

本章介紹另一種廣泛應用於商業領域的探勘技術－「Association Mining」，Association Mining透過設定support threshold過濾掉不重要的itemset，限制confidence threshold以得到較高準確度的關聯規則。Association Mining的主要研究方向為如何有「效率」地找出frequent itemset及association rule，特別是在如何產生frequent itemset的研究議題中，有許多論文使用特殊的資料結構或演算法加快速度，其中最有名的演算法是Apriori。Apriori使用anti-monotone的特性，避免產生不必要的candidate itemset以減小search space。另一種不須列舉candidate itemset的方法是FP-Growth Algorithm，它能直接從FP-Tree擷取frequent itemset，但若各交易中的項目差異較大時，它須耗費相當大的記憶體，所以使用何種方法需評估目前的資料分佈情況。

一般而言，使用support及confidence仍會產生大量的pattern，而許多pattern卻不是我們想要的，甚至會誤導我們，所以另一個研究主題為如何篩選出有意義的pattern。6.7節介紹許多domain-independent的measure，例如Interest Factor、IS Measure…等，並將這些measure依Symmetric or Asymmetric、Inversion Property、Null Addition Property及Scaling Property特性加以分類。沒有一個measure能適用於所有情況，所以我們應了解各個measure的特性及優缺點，如此才能選擇適當的measure來篩選出有意義的pattern。

本章最後敘述我們在做Association Mining時，兩個可能發生的問題－「Simpson’s Paradox」及「Skewed Support Distribution」。發生Simpson’s Paradox時，表示我們忽略的某個因素，導致我們被得出的pattern所誤導。若要避免Simpson’s Paradox，可能需要擁有domain knowledge解讀關聯規則的能力，或是對該關聯規則下的資料分佈再加以分析。若發生Skewed Support Distribution時，一般會直覺地調降support threshold，但除了導致增加計算時間之外，也會產生相當多的cross-support pattern。6.8節提供一個具有anti-monotone特性的measure－「h-confidence」來過濾cross-support pattern，減少因過低的support threshold而產生的大量pattern。

學習目標

1. 了解各種產生frequent itemset的方法
2. 了解各種measure的特性
3. 了解何謂「Simpson’s Paradox」、「Skewed Support Distribution」、「cross-support pattern」

Blog for Data Mining Course @ NCU

Monday, May 7, 2007

Data Mining Chapter 6 Overview

No comments:

Course Information

KDD Conferences

DM Courses Online

Related Links

Blog Archive

Link to DM Forum

Followers

Contributors

Response