Simpson’s Paradox and other possible anomalies with 3+ way contingency tables

Basket Analysis – Rattle

Basket Analysis

The simplest association analysis is often referred to as market basket analysis. Within Rattle this is enabled when the Baskets button is checked. In this case, the data is thought of as representing shopping baskets (or any other type of collection of items, such as a basket of medical tests, a basket of medicines prescribed to a patient, a basket of stocks held by an investor, and so on). Each basket has a unique identifier, and the variable specified as an Ident variable in the Data tab is taken as the identifier of a shopping basket. The contents of the basket are then the items contained in the column of data identified as the target variable. For market basket analysis, these are the only two variables used.

To illustrate market basket analysis with Rattle, we will use a very simple dataset consisting of the DVD movies purchased by customers. Suppose the data is stored in the filedvdtrans.csv and consists of the following:

ID,Item
1,Sixth Sense
1,LOTR1
1,Harry Potter1
1,Green Mile
1,LOTR2
2,Gladiator
2,Patriot
2,Braveheart
3,LOTR1
3,LOTR2
4,Gladiator
4,Patriot
4,Sixth Sense
5,Gladiator
5,Patriot
5,Sixth Sense
6,Gladiator
6,Patriot
6,Sixth Sense
7,Harry Potter1
7,Harry Potter2
8,Gladiator
8,Patriot
9,Gladiator
9,Patriot
9,Sixth Sense
10,Sixth Sense
10,LOTR
10,Galdiator
10,Green Mile

We load this data into Rattle and choose the appropriate variable roles. In this case it is quite simple:

Togaware rattle-dvd-variables

On the Associate tab (of the Unsupervised paradigm) ensure the Baskets check box is checked. Click the Execute button to identify the associations:

Togaware rattle-dvd-associate-top

Here we see a summary of the associations found. There were 38 association rules that met the criteria of having a minimum support of 0.1 and a minimum confidence of 0.1. Of these, 9 were of length 1 (i.e., a single item that has occurred frequently enough in the data), 20 were of length 2 and another 9 of length 3. Across the rules the support ranges from 0.11 up to 0.56. Confidence ranges from 0.11 up to 1.0, and lift from 0.9 up to 9.0.

The lower part of the same textview contains information about the running of the algorithm:

Togaware rattle-dvd-associate-bot

We can see the variable settings used, noting that Rattle only provides access to a smaller set of settings (support and confidence). The output includes timing information fore the various phases of the algorithm. For such a small dataset, the times are of course essentially 0!

Installing R, RStudio, and Rattle for Windows

Installing R

  1. Download R 3.2.0 for Windows.
  2. Run the installer – accepting all default options.

Installing RStudio

  1. Download RStudio 0.98.1103 for Windows.
  2. Run the installer.

Installing Rattle

  1. Run R.
  2. To install Rattle 3.4.1 and all other packages that Rattle uses at once – enter install.packages(“rattle”, dep=c(“Suggests”)) at the command prompt > – (this is a rather long install!)
  3. To run Rattle, enter
  • > library(rattle)
    > rattle()
  • The RGtk2 package has yet to be installed – You’ll get an error popup indicating that libatk-1.0-0.dll is missing – click on the OK -then you’ll be asked if you would like to install GTK+ – click OK to download the appropriate GTK+ libraries for your computer- when this is done, exit and restart R so that it can find the newly installed libraries. (Note: GTK+ has been configured to use the Microsoft Windows theme engine – so Rattle will look like other Windows applications in terms of colour and style).

Resources

R

R Homepage
The Comprehensive R Archive Network (CRAN)
R-Windows FAQS
R Manuals
Contributed Documentation
The R Journal

If you have a specific R related question, here are some quick resources to help you find an answer:

  • RStudio’s online training guide –http://www.rstudio.com/training/online.html
    RStudio has curated links and sources for learning R and its extensions, including links to online and in-person courses for structured learning.
  • RSeek meta search enginehttp://www.rseek.org/
    The RSeek meta search engine, provides a unified interface for searching the various sources of online R information. If an answer to your question is already available online, RSeek can help you locate it.
  • Stack Overflowhttp://stackoverflow.com/questions/tagged/r
    The R tag on Stack Overflow is becoming an increasingly important resource for seeking answers to R related questions. You can search the R tag in general, or refine your search to another tag such as ggplot2 or sweave.
  • R-help mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
    R-help list archives
    The R-help mailing list is a very active list with questions and answers about problems and solutions using R. Before posting to the list, it is recommended to search the list archives to see if an answer already exists.
  • CrossValidated Q&A communityhttp://stats.stackexchange.com/
    For more statistics related questions, the CrossValidated Q&A community is a great resource with lots of R users active on the site.

RStudio

If you are experiencing difficulties using RStudio, the following articles describe how to troubleshoot common problems:

RStudio Will Not Start
RStudio Crashed
Problem Installing Packages
Problem with Plots or Graphics Device
Problems with TeX or Sweave
R Code is Not Working

Racine, JS – RStudio – A platform-independent IDE for R and Sweave – 2012

R Commander vs. RStudio

Wilson, J – Statistical computing with R – Selecting the right tool for the job – R Commander or something else – 2012

Rattle

Rattle: A Graphical User Interface for Data Mining using R