Online Association Rule Mining
Background
Mining for association rules is a form of data mining.The
prototypical example is based on a list of purchases in a store. An association
rule for this list is a rule such as ``65% of all customers who buy pasta
and tomato sauce also buy parmesan cheese and red wine" along with a statement
such as "the product combination pasta, tomato, parmesan chees, red wine
occurs in 0.7% of all purchases". The "65%" is called the confidence
and the "0.7%" the support of the rule. Discovering such customer
buying patterns is useful for customer segmentation, cross-marketing, catalog
design and product placement.
Finding association rules is a very CPU and memory intensive
task, since the list of purchases as well as the number of different products
maybe huge. Most algorithms are related to the Apriori algorithm due to
Agrawal & Srikant, c.f. [AS94]. All traditional
algorithms operate offline: given minimal values for the support and the
confidence the algorithms scan and rescan the list of purchases, often
several times, and eventually produce all association rules. However in
general, the user does not know the appropriate thresholds in advance.
An inappropriate choice yields, after a long wait, either too many or too
few association rules.
Carma
Inspired by online aggregation, we designed a 2-scan algorithm
called Carma (Continuous Association Rule Mining Algorithm). Carma brings
association rule mining online, in that it gives continuous feedback, is
user controllable and yields deterministic and accurate results:
-
continuous feedback: It continuously produces association
rules, while the list of purchases is scanned. For each rule it maintains
shrinking, deterministic intervals on its support and confidence.
-
user controllable: During the first scan the user
is free to change the support and confidence thresholds "on the fly". In
combination with the continuous feedback, the user can therefore determine
the "right" thresholds interactively.
-
deterministic and accurate results: Carma guarantees
that it produces all association rules after at most 2 scans and for each
rule its precise support and confidence value.
An interesting feature of Carma is that the second scan of
the list is not needed, whenerver the shrinking, deterministic intervals
on the produced rules support and confidence suffice. Thus Carma can be
used to continuously produce association rules from a list read from a
network. For a detailed description of Carma see [Hid99].
Performance
To compare the performance of Carma to traditional algorithms
we implemented Carma along with Apriori and the DIC algorithm due to Brin
et al, c.f. [BMUT97]. While not being faster in general
Carma outperformed Apriori and DIC on low support thresholds.
Moreover, Carma is typically by an order of magnitude
more memory efficient and readily computes association rules in cases which
are intractable for Apriori or DIC.
References
[AS94]
R. Agrawal and R. Srikant. Fast algorithms for mining association rules.
In Proc. of the 20th Int'l Conf. on Very Large Databases, Santiago, Chile,
Sept. 1994.
[BMUT97]
Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, and Shalom Tsur. Dynamic
itemset counting and implication rules for market basket data. In Proceedings
of the ACM SIGMOD International Conference on Manag ement
of Data, volume 26,2 of SIGMOD Record, pages 255-264, New York, May 13th-15th
1997. ACM Press.
[Hid99]
C. Hidber Online Association Rule Mining. In Proceedings of the ACM SIGMOD
International Conference on Manag ement of Data, 1999
(to appear).