- While in many cases Apriori has good performance, it can still suffer from nontrivial costs. What are those costs?
There are two nontrivial costs that the Apriori method can suffer from. First, it may need to generate an enormous amount of candidate sets. The more frequent 1-itemsets there are, the more candidate 2-itemsets it will need to generate. Second, everytime it generates an itemset it needs to scan the database and check via pattern matching. This can lead to a huge amount of scans, which can be very costly and reduce effectiveness.
- What is the essence of the frequent pattern growth, or FP-growth method?
It uses a divide and conquer strategy to tackle the efficiency problem of the Apriori method. First, it creates a frequent pattern tree (FP-tree), which contains the itemset association information. Then it divides this compressed database into a set of conditional databases, each is associated with one frequent item, and then mines each of them separately. For each frequent item fragment, only its associated data sets are inspected.
- What is the difference between the horizontal data format and the vertical data format?
For example, we have a database of purchases and items. In a horizontal data format, we would have [purchaseID : itemset] where purchaseID is the key and itemset is the set of items bought in that purchase. In a vertical data format, we would have [item : purchaseID_set] where item is the key and purchaseID_set is a set of all the purchases that include that item.
- Why, in practice, is it more desirable in most cases to mine the set of closed frequent itemsets rather than the set of all frequent itemsets?
Mining the set of all frequent itemsets in a large collection of items can lead to an absurdly high amount of derivations, becoming extremely expensive. For this reason, mining the set of closed frequent itemsets is preferred.
- What methods can be used to mine closed frequent itemsets?
Item merging: If every transaction that contains a frequent itemset X also contains an itemset Y but not any proper superset of Y, then X U Y results in a frequent closed itemset, therefore there is no need to look for any itemset containing X but no Y.
Sub-itemset pruning: If a frequent itemset X is a proper subset of previously discovered frequent closed itemset Y and the support count for both X and Y are equal, then X and all of X’s descendants in the set enumeration tree can’t be frequent closed itemsets and can be removed.
Item skipping: When doing depth-first mining of closed itemsets, on each level there will be a prefix itemset X associated with a header table and a projected database. If a local frequent item p has the same support in several header tables and on different levels, we can remove p from the header tables on the higher levels.