學習目標
- 了解何謂data matrix, transactional data, sequential data, sequence data及time series data
- 區分不同的資料屬性based on 4 properties: distinctness, order, addition, multiplication
- 資料品質的問題:outlier&noise, missing data, duplicate
- 資料前置處理工作:
- Data Reduction by Samping
- Dimension Reduction via PCA
- Feature Selection by specific filtering algorithms
- Feature Creation by feature extraction, mapping to new space, feature construction
- Normalization
- Discretization by equal width, equal depth, or clustering
- Proximity Measure (distance and similarity)
- Similarity: Jaccard measure, cosine measure, pearson's correlatio
- Distance: Euclidean, mahalanobis distance
- 其他可能遭遇問題:例如
- 資料量過多或過少時
- Attributes超過演算法處理上限
- 使用Regression時只能輸入數值型屬性的資料
No comments:
Post a Comment