Online Association Rule Mining

Background

Mining for association rules is a form of data mining.The prototypical example is based on a list of purchases in a store. An association rule for this list is a rule such as ``65% of all customers who buy pasta and tomato sauce also buy parmesan cheese and red wine" along with a statement such as "the product combination pasta, tomato, parmesan chees, red wine occurs in 0.7% of all purchases". The "65%" is called the confidence and the "0.7%" the support of the rule. Discovering such customer buying patterns is useful for customer segmentation, cross-marketing, catalog design and product placement.

Finding association rules is a very CPU and memory intensive task, since the list of purchases as well as the number of different products maybe huge. Most algorithms are related to the Apriori algorithm due to Agrawal & Srikant, c.f. [AS94]. All traditional algorithms operate offline: given minimal values for the support and the confidence the algorithms scan and rescan the list of purchases, often several times, and eventually produce all association rules. However in general, the user does not know the appropriate thresholds in advance. An inappropriate choice yields, after a long wait, either too many or too few association rules.

Carma

Inspired by online aggregation, we designed a 2-scan algorithm called Carma (Continuous Association Rule Mining Algorithm). Carma brings association rule mining online, in that it gives continuous feedback, is user controllable and yields deterministic and accurate results:

continuous feedback: It continuously produces association rules, while the list of purchases is scanned. For each rule it maintains shrinking, deterministic intervals on its support and confidence.

user controllable: During the first scan the user is free to change the support and confidence thresholds "on the fly". In combination with the continuous feedback, the user can therefore determine the "right" thresholds interactively.

deterministic and accurate results: Carma guarantees that it produces all association rules after at most 2 scans and for each rule its precise support and confidence value.

An interesting feature of Carma is that the second scan of the list is not needed, whenerver the shrinking, deterministic intervals on the produced rules support and confidence suffice. Thus Carma can be used to continuously produce association rules from a list read from a network. For a detailed description of Carma see [Hid99].

Performance

To compare the performance of Carma to traditional algorithms we implemented Carma along with Apriori and the DIC algorithm due to Brin et al, c.f. [BMUT97]. While not being faster in general Carma outperformed Apriori and DIC on low support thresholds.
Moreover, Carma is typically by an order of magnitude more memory efficient and readily computes association rules in cases which are intractable for Apriori or DIC.

References

[AS94] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. of the 20th Int'l Conf. on Very Large Databases, Santiago, Chile, Sept. 1994.

[BMUT97] Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, and Shalom Tsur. Dynamic itemset counting and implication rules for market basket data. In Proceedings of the ACM SIGMOD International Conference on Manag ement of Data, volume 26,2 of SIGMOD Record, pages 255-264, New York, May 13th-15th 1997. ACM Press.

[Hid99] C. Hidber Online Association Rule Mining. In Proceedings of the ACM SIGMOD International Conference on Manag ement of Data, 1999 (to appear).