# A department store manager

FIND A SOLUTION AT Academic Writers Bay

Q1:
A department store manager (e.g. Carrefour, Loulou, …etc.) heard about data mining and was excited about applying data mining to boost the store sales. Initially the store is collecting only transactions. Each transaction consists of the date of the transaction, the customer ID and all the item IDs bought (the same transaction can have more than one item).
As a data-mining expert, propose two data mining tasks to the store manager and illustrate how they are relevant and can benefit the department store
After careful inspection, the manager believes better prediction can be achieved if the data is augmented with information about items and customers. So the dataset will have three tables: customers, items, and transactions. Each customer record consists of the customer ID, age, nationality, and occupation. Each item record consists of the item ID, the price of the item, and the item category.
We studied different types of datasets depending on the relationships between the objects in the dataset: datasets with independent objects, sequential datasets, spatial datasets, and graph datasets. Which type is the department store’s new dataset? Explain.
Solution
1 Many data mining tasks we studied can be applied to this domain including:
rule discovery and pattern recognition: data mining can help the manager know which items are usually bought together. This can help the manager optimizing discounts and sales. Algorithm to be used APRIORI
prediction: by analyzing customer’s data set and applying prediction techniques (such as K-nearest neighbors or decision trees), data mining can make a recommendation to a customer of a new product.
visualization: the data set of customers can be viewed as a complex network interconnecting customers to items. Recent visualization software can be helpful in understanding such complex networks, such as JUNG or NWB.
2 This is a graph dataset, as every customer and item becomes a node, while a link between customer i and item j represents the transactions (customer i bought item j) and the weight can represent the  quantity of purchase (strength of the link).
Q2:
A software company is trying to cut its expenses. One of its major expenses is caused by developing prototypes for customers that are not serious and never commit to finalizing the prototype. The company maintains a dataset of customers. Each customer object has a list of variables specifying the size of the customer’s company, the annual company earnings, the age of the company, etc. Furthermore, each customer object has a (class) variable indicating whether the customer approved the prototype project committed to a full project.
(a) What visualization techniques would you suggest. Describe briefly two of them
Solution
(a) Scatter plots and histograms are useful for small number of variables. Possible visualization techniques for large number of variables (more than 4) include parallel coordinates, projection using either principal component analysis or multi-dimensional scaling, and scatter plot arrays. Explaining  any two of them would do.
Tutorial 1
Data Preprocessing in RapidMiner[1]
This tutorial illustrates some of the basic data preprocessing operations that can be performed using RapidMiner. The sample data set used for this example is the “bank data” available in comma-separated format on the blackboard.
Preparation
1- Download and install RapidMiner 5 (no need to do so in the lab) from here:
http://sourceforge.net/projects/rapidminer/
Note that there is a more recent version of RapidMiner (Studio 6) but it has usage restrictions.
There is a lot of documentation and video tutorials available for RapidMiner. A good starting points is the official user manual

READ ALSO...   attracts the eye of the targeted audience

Click to access rapidminer-5.0-manual-english_v1.0.pdf

2- Run RapidMiner and if not selected already, choose the design perspective. You can do this simply by clicking on the pad icon at the middle top.
3- We will be using some bank data. The data file is available on the blackboard (along with the tutorial). Download the file and save it on your computer. The data is in CSV format. Open the file using notepad or any text editor. The first row of a typical CSV file contains the attribute names (separated by commas) followed by each data row with attribute values listed in the same order (also separated by commas). The data file we have contains the following fields. Note that attribute “pep” is the target or the label attribute (the one we want to predict.
Id
a unique identification number
Age
age of customer in years (numeric)
Sex
MALE / FEMALE
Region
inner_city/rural/suburban/town
Income
income of customer (numeric)
Married
is the customer married (YES/NO)
Children
number of children (numeric)
Car
does the customer own a car (YES/NO)
save_acct
does the customer have a saving account (YES/NO)
current_acct
does the customer have a current account (YES/NO)
Mortgage
does the customer have a mortgage (YES/NO)
Pep
did the customer buy a PEP (Personal Equity Plan) after the last mailing (YES/NO)
Go to Operators->import->data->Read CSV. Add the operator to the “process” tab (drag & drop)
You will notice a small red circle in the operator with a warning sign. This means there is an error. Look at the “problems” tab at the bottom. We need to specify the CSV file.
Click on the operator and look at the “parameters” tab at the right.
Click on the folder button to specify the CSV file.
Specify the column separator to be “,”
The error should now disappear.

Connect the output of the Read CSV operator to res
Click the play button on the top. This should run your first RapidMiner program and automatically switch to the “results” perspective.
Click on the “example set” tab to view the data read.
Click on the “Meta Data View” option. This displays all the data fields and their types (automatically detected by RapidMiner) along with some statistics.
Click on the “Plot View”. Choose “age” and “income” as x and y Axes, respectively. What do you see?

Selecting attributes manually
In our sample data file, each record is uniquely identified by a customer id (the “id” attribute). We need to remove this attribute before mining the data. We can do this by using the Attribute filters
Go back to the “design” perspective.
Type “select attributes” in the operators search field. Drag and drop the operator.
Connect output of the CSV read operator to the input of the select attributes operator. Connect output of select attributes operator to the results connection (the “res” connection on the top right)
Click on the select attributes operator, under type select “single”, in the attribute field, select id, check “invert selection”. This basically selects all attributes except id.
Click play and check that ID is now removed.
Discretization
Some techniques, such as association rule mining, can only be performed on categorical data. This requires performing discretization on numeric or continuous attributes. There are 3 such attributes in this data set: “age”, “income”, and “children”.
Use the “discretize by frequency” operator.
Use “nominal to binomial” operator after the discretize operator.
Connect both “exa” and “ori” to “res”
Click play. What is the result?

READ ALSO...   Social performance of organizations | BUS 475 Business and Society | Strayer University - Washington, DC

Tutorial 2: Visualization & statistics
We will use sample data collected from BUiD students. That dataset comprised 86 students. A CSV file is available on the blackboard where the first row show self-explanatory attribute names.
Tutorial steps
1- Open the csv file (using the operator we learned in previous tutorial, and connect out to res). Run the process and explore the data statistics (meta view in version 5.3).
What is the most common age group?
What is the least common programme?
2- Click on Charts (plot view in version 5.3). Choose histogram, then choose the “nationality” attribute. What is the second most-common nationality?
3- Two attributes allow multiple values to be entered by the student, separated by commas (which ones?). This is hard to analyze using rapid miner. You can use “Split” operator to convert each of these two attribute to multiple attributes. Make sure to select “unordered split” . Rerun the process.
Look at statistics. What is the most common source of knowing about BUiD?
What are the top 3 reasons for joining BUiD?
4- Use scatter plot (under plot-view or chart) to plot programme name vs knowing about BUiD from Alumni. You may need a little “jitter”
What are the two programs that rely on this source the most?
5- Feel free exploring with RapidMiner.
Clustering
Introduction
In this tutorial we will have a more practical look at the clustering analysis of a real data set. Download the data file ad_text_short.csv from blackboard. This data file describes  real estate classifieds in the UAE. The data consists of 1000 classifieds, pure text (one attribute, description).
0- make sure the text mining extension is installed: click extensions->market place then search for “text processing” extension (for RapidMiner 5: click on “update extensions” under menu item “help” to install it otherwise)
1- Open the file in rapidminer using the usual csv operator. Make sure you unselect “use quotes” and that the type of the “Description” attribute is polynomial. Now add operator “nominal to text”.
2- Search operators using “process documents”, you will find multiple options. Choose “… from data”. Connect to the output of “nominal to text” operator to input “exa” of this operator.
3- The “process doc..” operator computes the TF-IDF for the passed data. First we need to tokenize text (generate keywords).
double click on the “process doc…” operator. This will open the subprocess.
Add “tokenize” operator. Connect it. Follow that by “Transform Cases” operator to ensure all words are lowercase
This may generate lots of meaningless words. Add the Filter Stopwords operator (after the tokenizer), which removes trivial words (make sure English is selected).
The subprocess should look like figure below.
Go one level up. Connect the output of the process doc process to res. Run and observe the results.
4- Now we add the “k-means” operator. Choose K to equal 10 and max iterations to be 10 as well. Connect it (both outputs to the results). Also make sure the “wor” output from the “process doc…” operator is connected to a “res” connection.
5- Run the process. It may take sometime (monitor the progress at the bottom of the window).
6- look the output of the clustering. You can see the centroids, and the grouping of the documents/classifieds. Sort the attributes of the centroid by click on the cluster name. Does the clusters make sense?
7- It is helpful to get rid of some of some of (keyword) attributes. Add “remove correlated attributes” operator between “process doc…” and “clustering …” operators. Your final process should look like this
8- Run again and repeat step 6. Which clusters can you recognize? Now change k to only 5. Which clusters can you recognize?
9- exercise: try other clustering algorithms (e.g. hierarchical clustering)
classification
Introduction
In this tutorial we will start the fun part of the course. As we learn about different data mining algorithms, we will try them using different data sets.