- Paik, M – A graphic representation of a three-way contingency table – Simpson\’s Paradox and correlation – 1985
- Ghilagaber, G – On the problem of identification in multiplicative intensity-rate models with multiple interactions – 1999
- Wainer, H and Brown, LM – Two statistical paradoxes in the interpretation of group differences – 2004
- Beh, EJ – Simple correspondance analysis of nominal-ordinal contingency tables – 2008
- Fiedler, K – The ultimate sampling dilemma in experience-based decision making – 2008
The simplest association analysis is often referred to as market basket analysis. Within Rattle this is enabled when the Baskets button is checked. In this case, the data is thought of as representing shopping baskets (or any other type of collection of items, such as a basket of medical tests, a basket of medicines prescribed to a patient, a basket of stocks held by an investor, and so on). Each basket has a unique identifier, and the variable specified as an Ident variable in the Data tab is taken as the identifier of a shopping basket. The contents of the basket are then the items contained in the column of data identified as the target variable. For market basket analysis, these are the only two variables used.
To illustrate market basket analysis with Rattle, we will use a very simple dataset consisting of the DVD movies purchased by customers. Suppose the data is stored in the filedvdtrans.csv and consists of the following:
ID,Item 1,Sixth Sense 1,LOTR1 1,Harry Potter1 1,Green Mile 1,LOTR2 2,Gladiator 2,Patriot 2,Braveheart 3,LOTR1 3,LOTR2 4,Gladiator 4,Patriot 4,Sixth Sense 5,Gladiator 5,Patriot 5,Sixth Sense 6,Gladiator 6,Patriot 6,Sixth Sense 7,Harry Potter1 7,Harry Potter2 8,Gladiator 8,Patriot 9,Gladiator 9,Patriot 9,Sixth Sense 10,Sixth Sense 10,LOTR 10,Galdiator 10,Green Mile
We load this data into Rattle and choose the appropriate variable roles. In this case it is quite simple:
On the Associate tab (of the Unsupervised paradigm) ensure the Baskets check box is checked. Click the Execute button to identify the associations:
Here we see a summary of the associations found. There were 38 association rules that met the criteria of having a minimum support of 0.1 and a minimum confidence of 0.1. Of these, 9 were of length 1 (i.e., a single item that has occurred frequently enough in the data), 20 were of length 2 and another 9 of length 3. Across the rules the support ranges from 0.11 up to 0.56. Confidence ranges from 0.11 up to 1.0, and lift from 0.9 up to 9.0.
The lower part of the same textview contains information about the running of the algorithm:
We can see the variable settings used, noting that Rattle only provides access to a smaller set of settings (support and confidence). The output includes timing information fore the various phases of the algorithm. For such a small dataset, the times are of course essentially 0!
- Download R 3.2.0 for Windows.
- Run the installer – accepting all default options.
- Download RStudio 0.98.1103 for Windows.
- Run the installer.
- Run R.
- To install Rattle 3.4.1 and all other packages that Rattle uses at once – enter install.packages(“rattle”, dep=c(“Suggests”)) at the command prompt > – (this is a rather long install!)
- To run Rattle, enter
> library(rattle) > rattle()
- The RGtk2 package has yet to be installed – You’ll get an error popup indicating that libatk-1.0-0.dll is missing – click on the OK -then you’ll be asked if you would like to install GTK+ – click OK to download the appropriate GTK+ libraries for your computer- when this is done, exit and restart R so that it can find the newly installed libraries. (Note: GTK+ has been configured to use the Microsoft Windows theme engine – so Rattle will look like other Windows applications in terms of colour and style).
If you have a specific R related question, here are some quick resources to help you find an answer:
- RStudio’s online training guide –http://www.rstudio.com/training/online.html
RStudio has curated links and sources for learning R and its extensions, including links to online and in-person courses for structured learning.
- RSeek meta search engine – http://www.rseek.org/
The RSeek meta search engine, provides a unified interface for searching the various sources of online R information. If an answer to your question is already available online, RSeek can help you locate it.
- Stack Overflow – http://stackoverflow.com/questions/tagged/r
The R tag on Stack Overflow is becoming an increasingly important resource for seeking answers to R related questions. You can search the R tag in general, or refine your search to another tag such as ggplot2 or sweave.
- R-help mailing list – https://stat.ethz.ch/mailman/listinfo/r-help
R-help list archives –
The R-help mailing list is a very active list with questions and answers about problems and solutions using R. Before posting to the list, it is recommended to search the list archives to see if an answer already exists.
- CrossValidated Q&A community – http://stats.stackexchange.com/
For more statistics related questions, the CrossValidated Q&A community is a great resource with lots of R users active on the site.
If you are experiencing difficulties using RStudio, the following articles describe how to troubleshoot common problems:Racine, JS – RStudio – A platform-independent IDE for R and Sweave – 2012
R Commander vs. RStudioWilson, J – Statistical computing with R – Selecting the right tool for the job – R Commander or something else – 2012