FIND A SOLUTION AT Academic Writers Bay

1

CHAPTER 7

CLASSIFICATIONS

2

Topics

What is classification? What is prediction?

Issues regarding classification and prediction

Classification by decision tree induction

Bayesian Classification

Classification by Neural Networks

Classification by Support Vector Machines (SVM)

Classification based on concepts from association rule

mining

Other Classification Methods

Prediction

Classification accuracy

Summary

3

Classification:

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training

set and the values (class labels) in a classifying attribute

and uses it in classifying new data

Prediction:

models continuous-valued functions, i.e., predicts

unknown or missing values

Typical Applications

credit approval

target marketing

medical diagnosis

treatment effectiveness analysis

Classification vs. Prediction

4

Classification—A Two-Step Process

Model construction: describing a set of predetermined classes

Each tuple/sample is assumed to belong to a predefined class,

as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules, decision trees,

or mathematical formulae

Model usage: for classifying future or unknown objects

Estimate accuracy of the model

The known label of test sample is compared with the

classified result from the model

Accuracy rate is the percentage of test set samples that are

correctly classified by the model

Test set is independent of training set, otherwise over-fitting

will occur

If the accuracy is acceptable, use the model to classify data

tuples whose class labels are not known

5

Classification Process (1): Model

Construction

Training

Data

NAME RANK YEARS TENURED

Mike Assistant Prof 3 no

Mary Assistant Prof 7 yes

Bill Professor 2 yes

Jim Associate Prof 7 yes

Dave Assistant Prof 6 no

Anne Associate Prof 3 no

Classification

Algorithms

IF rank = ‘professor’

OR years > 6

THEN tenured = ‘yes’

Classifier

(Model)

6

Classification Process (2): Use the

Model in Prediction

Classifier

Testing

Data

NAME RANK YEARS TENURED

Tom Assistant Prof 2 no

Merlisa Associate Prof 7 no

George Professor 5 yes

Joseph Assistant Prof 7 yes

Unseen Data

(Jeff, Professor, 4)

Tenured?

7

Supervised vs. Unsupervised

Learning

Supervised learning (classification)

Supervision: The training data (observations,

measurements, etc.) are accompanied by labels

indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements, observations, etc. with

the aim of establishing the existence of classes or

clusters in the data

8

ISSUES REGARDING

CLASSIFICATION AND

PREDICTION

9

Issues Regarding Classification and Prediction

(1): Data Preparation

Data cleaning

Preprocess data in order to reduce noise and handle

missing values

Relevance analysis (feature selection)

Remove the irrelevant or redundant attributes

Data transformation

Generalize and/or normalize data

10

Issues regarding classification and prediction

(2): Evaluating Classification Methods

Predictive accuracy

Speed and scalability

time to construct the model

time to use the model

Robustness

handling noise and missing values

Scalability

efficiency in disk-resident databases

Interpretability:

understanding and insight provided by the model

Goodness of rules

decision tree size

compactness of classification rules

11

CLASSIFICATION BY DECISION

TREE INDUCTION

12

Training Dataset

age income student credit_rating buys_computer

<=30 high no fair no

<=30 high no excellent no

31…40 high no fair yes

40 medium no fair yes

40 low yes fair yes

40 low yes excellent no

31…40 low yes excellent yes

<=30 medium no fair no <=30 low yes fair yes 40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes 40 medium no excellent no This follows an example from Quinlan’s ID3 13 Output: A Decision Tree for “buys_computer” age? overcast student? credit rating? no yes excellent fair <=30 >40

no yes no yes

yes

30..40

14

Algorithm for Decision Tree Induction

Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer

manner

At start, all the training examples are at the root

Attributes are categorical (if continuous-valued, they are

discretized in advance)

Examples are partitioned recursively based on selected attributes

Test attributes are selected on the basis of a heuristic or statistical

measure (e.g., information gain)

Conditions for stopping partitioning

All samples for a given node belong to the same class

There are no remaining attributes for further partitioning –

majority voting is employed for classifying the leaf

There are no samples left

15

Attribute Selection Measure:

Information Gain (ID3/C4.5)

Select the attribute with the highest information gain

S contains si tuples of class Ci for i = 1, …, m

information measures info required to classify any

arbitrary tuple

entropy of attribute A with values a1,a2,…,av

information gained by branching on attribute A

s

log s

s

I( s ,s ,…,s ) s m i

i

i

1 2 m 2

1

I( s ,…,s )

s

E(A) s … s j mj

v

j

j mj

1

1

1

Gain(A) I(s1,s 2,…,sm) E(A)

16

Attribute Selection by Information

Gain Computation

Class P: buys_computer = “yes”

Class N: buys_computer = “no”

I(p, n) = I(9, 5) =0.940

Compute the entropy for age:

means “age <=30” has 5 out of 14 samples, with 2 yes’es and 3 no’s. Hence Similarly, age pi ni I(pi, ni) <=30 2 3 0.971 30…40 4 0 0 40 3 2 0.971 (3,2) 0.694 14 5 (4,0) 14 (2,3) 4 14 ( ) 5 I E age I I ( _ ) 0.048 ( ) 0.151 ( ) 0.029 Gain credit rating Gain student Gain income age income student credit_rating buys_computer Gain(age) I ( p,n) E(age) 0.246 <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes 40 medium no fair yes 40 low yes fair yes 40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes 40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes 40 medium no excellent no (2,3) 14 5 I 17 Other Attribute Selection Measures Gini index (CART, IBM IntelligentMiner) All attributes are assumed continuous-valued Assume there exist several possible split values for each attribute May need other tools, such as clustering, to get the possible split values Can be modified for categorical attributes 18 Gini Index (IBM IntelligentMiner) If a data set T contains examples from n classes, gini index, gini(T) is defined as where pj is the relative frequency of class j in T. If a data set T is split into two subsets T1 and T2 with sizes N1 and N2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as The attribute provides the smallest ginisplit(T) is chosen to split the node (need to enumerate all possible splitting points for each attribute). n j gini T p j 1 ( ) 1 2 ( ) ( ) ( 2) 2 1 1 gini T N gini T N N gini T N split 19 Extracting Classification Rules from Trees Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction The leaf node holds the class prediction Rules are easier for humans to understand Example IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “excellent” THEN buys_computer =

“yes”

IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no” 20 Avoid Overfitting in Classification Overfitting: An induced tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliers Poor accuracy for unseen samples Two approaches to avoid overfitting Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees Use a set of data different from the training data to decide which is the “best pruned tree” 21 Approaches to Determine the Final Tree Size Separate training (2/3) and testing (1/3) sets Use cross validation, e.g., 10-fold cross validation Use all the data for training but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution Use minimum description length (MDL) principle halting growth of the tree when the encoding is minimized 22 Enhancements to basic decision tree induction Allow for continuous-valued attributes Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals Handle missing attribute values Assign the most common value of the attribute Assign probability to each of the possible values Attribute construction Create new attributes based on existing ones that are sparsely represented This reduces fragmentation, repetition, and replication 23 Classification in Large Databases Classification—a classical problem extensively studied by statisticians and machine learning researchers Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed Why decision tree induction in data mining? relatively faster learning speed (than other classification methods) convertible to simple and easy to understand classification rules can use SQL queries for accessing databases comparable classification accuracy with other methods 24 Scalable Decision Tree Induction Methods in Data Mining Studies SLIQ (EDBT’96 — Mehta et al.) builds an index for each attribute and only class list and the current attribute list reside in memory SPRINT (VLDB’96 — J. Shafer et al.) constructs an attribute list data structure PUBLIC (VLDB’98 — Rastogi & Shim) integrates tree splitting and tree pruning: stop growing the tree earlier RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti) separates the scalability aspects from the criteria that determine the quality of the tree builds an AVC-list (attribute, value, class label) 25 Data Cube-Based Decision-Tree Induction Integration of generalization with decision-tree induction (Kamber et al’97). Classification at primitive concept levels E.g., precise temperature, humidity, outlook, etc. Low-level concepts, scattered classes, bushy classification-trees Semantic interpretation problems. Cube-based multi-level classification Relevance analysis at multi-levels. Information-gain analysis with dimension + level. 26 Presentation of Classification Results 27 Visualization of a Decision Tree in SGI/MineSet 3.0 28 Interactive Visual Mining by Perception-Based Classification (PBC) 29 BAYESIAN CLASSIFICATION 30 Bayesian Classification: Why? Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data. Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured 31 Bayesian Theorem: Basics Let X be a data sample whose class label is unknown Let H be a hypothesis that X belongs to class C For classification problems, determine P(H/X): the probability that the hypothesis holds given the observed data sample X P(H): prior probability of hypothesis H (i.e. the initial probability before we observe any data, reflects the background knowledge) P(X): probability that sample data is observed P(X|H) : probability of observing the sample X, given that the hypothesis holds 32 Bayesian Theorem Given training data X, posteriori probability of a hypothesis H, P(H|X) follows the Bayes theorem Informally, this can be written as posterior =likelihood x prior / evidence MAP (maximum posteriori) hypothesis Practical difficulty: require initial knowledge of many probabilities, significant computational cost ( ) ( | ) ( | ) ( ) P X P H X P X H P H argmax ( | ) argmaxP(D|h)P(h). h H P h D MAP h H h 33 Naïve Bayes Classifier A simplified assumption: attributes are conditionally independent: The product of occurrence of say 2 elements x1 and x2, given the current class is C, is the product of the probabilities of each element taken separately, given the same class P([y1,y2],C) = P(y1,C) * P(y2,C) No dependence relation between attributes Greatly reduces the computation cost, only count the class distribution. Once the probability P(X|Ci) is known, assign X to the class with maximum P(X|Ci)P(Ci) n k P X Ci P xk Ci 1 ( | ) ( | ) 34 Training dataset age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 30…40 high no fair yes 40 medium no fair yes 40 low yes fair yes 40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes 40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes 40 medium no excellent no Class: C1:buys_computer= ‘yes’ C2:buys_computer= ‘no’ Data sample X =(age<=30, Income=medium, Student=yes Credit_rating= Fair) 35 Naïve Bayesian Classifier: Example Compute P(X/Ci) for each class P(age=“<30” | buys_computer=“yes”) = 2/9=0.222 P(age=“<30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“yes)= 6/9 =0.667 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4 X=(age<=30 ,income =medium, student=yes,credit_rating=fair) P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044 P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X|Ci)P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028 P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.007 X belongs to class “buys_computer=yes” 36 Naïve Bayesian Classifier: Comments Advantages : Easy to implement Good results obtained in most of the cases Disadvantages Assumption: class conditional independence , therefore loss of accuracy Practically, dependencies exist among variables E.g., hospitals: patients: Profile: age, family history etc Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc Dependencies among these cannot be modeled by Naïve Bayesian Classifier How to deal with these dependencies? Bayesian Belief Networks 37 Bayesian Networks Bayesian belief network allows a subset of the variables conditionally independent A graphical model of causal relationships Represents dependency among the variables Gives a specification of joint probability distribution X Y Z P Nodes: random variables Links: dependency X,Y are the parents of Z, and Y is the parent of P No dependency between Z and P Has no loops or cycles 38 Bayesian Belief Network: An Example Family History LungCancer PositiveXRay Smoker Emphysema Dyspnea LC ~LC (FH, S) (FH, ~S) (~FH, S) (~FH, ~S) 0.8 0.2 0.5 0.5 0.7 0.3 0.1 0.9 Bayesian Belief Networks The conditional probability table for the variable LungCancer: Shows the conditional probability for each possible combination of its parents n i P z zn P zi Parents Zi 1 ( 1,…, ) ( | ( )) 39 Learning Bayesian Networks Several cases Given both the network structure and all variables observable: learn only the CPTs Network structure known, some hidden variables: method of gradient descent, analogous to neural network learning Network structure unknown, all variables observable: search through the model space to reconstruct graph topology Unknown structure, all hidden variables: no good algorithms known for this purpose D. Heckerman, Bayesian networks for data mining 40 CLASSIFICATION BY NEURAL NETWORKS 41 Classification: predicts categorical class labels Typical Applications credit history, salary-> credit approval ( Yes/No)

Temp, Humidity –> Rain (Yes/No)

Classification

Mathematically

( )

:

0,1 , 0,1

y h x

h X Y

x X n y Y

42

Linear Classification

Binary Classification

problem

The data above the red

line belongs to class ‘x’

The data below red line

belongs to class ‘o’

Examples – SVM,

Perceptron, Probabilistic

Classifiers

x

x x

x

x x

x

x

x

x oo

o

o

o

o

o

o

o o

o

o

o

43

Discriminative Classifiers

Advantages

prediction accuracy is generally high

(as compared to Bayesian methods – in general)

robust, works when training examples contain errors

fast evaluation of the learned target function

(Bayesian networks are normally slow)

Criticism

long training time

difficult to understand the learned function (weights)

(Bayesian networks can be used easily for pattern discovery)

not easy to incorporate domain knowledge

(easy in the form of priors on the data or distributions)

44

Neural Networks

Analogy to Biological Systems (Indeed a great example

of a good learning system)

Massive Parallelism allowing for computational

efficiency

The first learning algorithm came in 1959 (Rosenblatt)

who suggested that if a target output value is provided

for a single neuron with fixed inputs, one can

incrementally change weights to learn to produce these

outputs using the perceptron learning rule

45

A Neuron

The n-dimensional input vector x is mapped into

variable y by means of the scalar product and a

nonlinear function mapping

k

f

weighted

sum

Input

vector x

output y

Activation

function

weight

vector w

w0

w1

wn

x0

x1

xn

46

A Neuron

k

f

weighted

sum

Input

vector x

output y

Activation

function

weight

vector w

w0

w1

wn

x0

x1

xn

y sign( )

For Example

n

i 0

i i k x w

Multi-Layer Perceptron

Output nodes

Input nodes

Hidden nodes

Output vector

Input vector: xi

wij

i

I j wijOi j

j I j e

O

1

1

Errj Oj (1Oj )(Tj Oj )

jk

k

Errj Oj (1Oj )Errkw

wij wij (l)ErrjOi

j j (l)Errj

Network Training

The ultimate objective of training

obtain a set of weights that makes almost all the

tuples in the training data classified correctly

Steps

Initialize weights with random values

Feed the input tuples into the network one by one

For each unit

Compute the net input to the unit as a linear combination

of all the inputs to the unit

Compute the output value using the activation function

Compute the error

Update the weights and the bias

50

CLASSIFICATION BY

SUPPORT VECTOR MACHINES (SVM)

SVM – Support Vector Machines

Support Vectors

Small Margin Large Margin

52

SVM – Cont.

Linear Support Vector Machine

Given a set of points with label

The SVM finds a hyperplane defined by the pair (w,b)

(where w is the normal to the plane and b is the distance from the

origin)

s.t.

y x w b i N i i ( ) 1 1,…,

n

i x y 1,1 i

x – feature vector, b- bias, y- class label, ||w|| – margin

53

SVM – Cont.

What if the data is not linearly separable?

Project the data to high dimensional space where it is

linearly separable and then we can use linear SVM –

(Using Kernels)

-1 0 +1

– +

(0,0) (1,0)

(0,1) +

+

54

Non-Linear SVM

?

b

w

x

i

Classification using SVM (w,b)

( , ) 0

?

K x w b i

In non linear case we can see this as

Kernel – Can be thought of as doing dot product

in some high dimensional space

55

Example of Non-linear SVM

56

Results

57

SVM vs. Neural Network

SVM

Relatively new concept

Nice Generalization

properties

Hard to learn – learned

in batch mode using

quadratic programming

techniques

Using kernels can learn

very complex functions

Neural Network

Quiet Old

Generalizes well but

doesn’t have strong

mathematical foundation

Can easily be learned in

incremental fashion

To learn complex

functions – use

multilayer perceptron

(not that trivial)

58

SVM Related Links

http://svm.dcs.rhbnc.ac.uk/

http://www.kernel-machines.org/

C. J. C. Burges.

A Tutorial on Support Vector Machines for Pattern Recognitio

n.

Knowledge Discovery and Data Mining , 2(2), 1998.

SVMlight – Software (in C)

http://ais.gmd.de/~thorsten/svm_light

BOOK: An Introduction to Support Vector Machines

N. Cristianini and J. Shawe-Taylor

Cambridge University Press

59

CLASSIFICATION BASED ON

CONCEPTS FROM

ASSOCIATION RULE MINING

60

Association-Based Classification

Several methods for association-based classification

ARCS: Quantitative association mining and clustering

of association rules (Lent et al’97)

It beats C4.5 in (mainly) scalability and also accuracy

Associative classification: (Liu et al’98)

It mines high support and high confidence rules in the form of

“cond_set => y”, where y is a class label

CAEP (Classification by aggregating emerging patterns)

(Dong et al’99)

Emerging patterns (EPs): the itemsets whose support

increases significantly from one class to another

Mine Eps based on minimum support and growth rate

61

OTHER CLASSIFICATION METHODS

62

Other Classification Methods

k-nearest neighbor classifier

case-based reasoning

Genetic algorithm

Rough set approach

Fuzzy set approaches

63

Instance-Based Methods

Instance-based learning:

Store training examples and delay the processing

(“lazy evaluation”) until a new instance must be

classified

Typical approaches

k-nearest neighbor approach

Instances represented as points in a Euclidean

space.

Locally weighted regression

Constructs local approximation

Case-based reasoning

Uses symbolic representations and knowledgebased

inference

64

The k-Nearest Neighbor Algorithm

All instances correspond to points in the n-D space.

The nearest neighbor are defined in terms of

Euclidean distance.

The target function could be discrete- or real- valued.

For discrete-valued, the k-NN returns the most

common value among the k training examples nearest

to xq.

Vonoroi diagram: the decision surface induced by 1-

NN for a typical set of training examples.

.

_

+

_ xq

+

_ _

+

_

_

+

.

.

.

. .

65

Discussion on the k-NN Algorithm

The k-NN algorithm for continuous-valued target functions

Calculate the mean values of the k nearest neighbors

Distance-weighted nearest neighbor algorithm

Weight the contribution of each of the k neighbors

according to their distance to the query point xq

giving greater weight to closer neighbors

Similarly, for real-valued target functions

Robust to noisy data by averaging k-nearest neighbors

Curse of dimensionality: distance between neighbors could

be dominated by irrelevant attributes.

To overcome it, axes stretch or elimination of the least

relevant attributes.

w

d xq xi

1

( , )2

66

Case-Based Reasoning

Also uses: lazy evaluation + analyze similar instances

Difference: Instances are not “points in a Euclidean space”

Example: Water faucet problem in CADET (Sycara et al’92)

Methodology

Instances represented by rich symbolic descriptions

(e.g., function graphs)

Multiple retrieved cases may be combined

Tight coupling between case retrieval, knowledge-based

reasoning, and problem solving

Research issues

Indexing based on syntactic similarity measure, and

when failure, backtracking, and adapting to additional

cases

67

Remarks on Lazy vs. Eager Learning

Instance-based learning: lazy evaluation

Decision-tree and Bayesian classification: eager evaluation

Key differences

Lazy method may consider query instance xq when deciding how to

generalize beyond the training data D

Eager method cannot since they have already chosen global

approximation when seeing the query

Efficiency: Lazy – less time training but more time predicting

Accuracy

Lazy method effectively uses a richer hypothesis space since it uses

many local linear functions to form its implicit global approximation

to the target function

Eager: must commit to a single hypothesis that covers the entire

instance space

68

Genetic Algorithms

GA: based on an analogy to biological evolution

Each rule is represented by a string of bits

An initial population is created consisting of randomly

generated rules

e.g., IF A1 and Not A2 then C2 can be encoded as 100

Based on the notion of survival of the fittest, a new

population is formed to consists of the fittest rules and

their offsprings

The fitness of a rule is represented by its classification

accuracy on a set of training examples

Offsprings are generated by crossover and mutation

69

Rough Set Approach

Rough sets are used to approximately or “roughly”

define equivalent classes

A rough set for a given class C is approximated by two

sets: a lower approximation (certain to be in C) and an

upper approximation (cannot be described as not

belonging to C)

Finding the minimal subsets (reducts) of attributes (for

feature reduction) is NP-hard but a discernibility matrix

is used to reduce the computation intensity

70

Fuzzy Set

Approaches

Fuzzy logic uses truth values between 0.0 and 1.0 to

represent the degree of membership (such as using fuzzy

membership graph)

Attribute values are converted to fuzzy values

e.g., income is mapped into the discrete categories

low, medium, high with fuzzy values calculated

For a given new sample, more than one fuzzy value may

apply

Each applicable rule contributes a vote for membership in

the categories

Typically, the truth values for each predicted category are

summed

71

PREDICTION

72

What Is Prediction?

Prediction is similar to classification

First, construct a model

Second, use model to predict unknown value

Major method for prediction is regression

Linear and multiple regression

Non-linear regression

Prediction is different from classification

Classification refers to predict categorical class label

Prediction models continuous-valued functions

73

Predictive modeling: Predict data values or construct

generalized linear models based on the database data.

One can only predict value ranges or category distributions

Method outline:

Minimal generalization

Attribute relevance analysis

Generalized linear model construction

Prediction

Determine the major factors which influence the prediction

Data relevance analysis: uncertainty measurement,

entropy analysis, expert judgement, etc.

Multi-level prediction: drill-down and roll-up analysis

Predictive Modeling in Databases

74

Linear regression: Y = + X

Two parameters , and specify the line and are to

be estimated by using the data at hand.

using the least squares criterion to the known values of

Y1, Y2, …, X1, X2, ….

Multiple regression: Y = b0 + b1 X1 + b2 X2.

Many nonlinear functions can be transformed into the

above.

Log-linear models:

The multi-way table of joint probabilities is

approximated by a product of lower-order tables.

Probability: p(a, b, c, d) = ab acad bcd

Regress Analysis and Log-Linear

Models in Prediction

75

Locally Weighted Regression

Construct an explicit approximation to f over a local region

surrounding query instance xq.

Locally weighted linear regression:

The target function f is approximated near xq using the

linear function:

minimize the squared error: distance-decreasing weight K

the gradient descent training rule:

In most cases, the target function is approximated by a

constant, linear, or quadratic function.

f(x) w w a (x) wnan(x) 0 1 1

E xq f x f x

x k nearest neighbors of xq

( ) ( ( ) ( )) K d xq x

( ( , ))

1

2

2

wj K d xq x f x f x a j x

x k nearest neighbors of xq

( ( , ))(( ( ) ( )) ( )

76

Prediction: Numerical Data

77

Prediction: Categorical Data

78

CLASSIFICATION ACCURACY

79

Classification Accuracy: Estimating Error

Rates

Partition: Training-and-testing

use two independent data sets, e.g., training set

(2/3), test set(1/3)

used for data set with large number of samples

Cross-validation

divide the data set into k subsamples

use k-1 subsamples as training data and one subsample

as test data—k-fold cross-validation

for data set with moderate size

Bootstrapping (leave-one-out)

for small size data

80

Bagging and Boosting

General idea

Training data

Altered Training data

Altered Training data

……..

Aggregation ….

Classifier C

Classification method (CM)

CM

Classifier C1

CM

Classifier C2

Classifier C*

81

Bagging

Given a set S of s samples

Generate a bootstrap sample T from S. Cases in S may not

appear in T or may appear more than once.

Repeat this sampling procedure, getting a sequence of k

independent training sets

A corresponding sequence of classifiers C1,C2,…,Ck is

constructed for each of these training sets, by using the

same classification algorithm

To classify an unknown sample X,let each classifier predict

or vote

The Bagged Classifier C* counts the votes and assigns X to

the class with the “most” votes

82

Boosting Technique — Algorithm

Assign every example an equal weight 1/N

For t = 1, 2, …, T Do

Obtain a hypothesis (classifier) h(t) under w(t)

Calculate the error of h(t) and re-weight the examples

based on the error . Each classifier is dependent on the

previous ones. Samples that are incorrectly predicted

are weighted more heavily

Normalize w(t+1) to sum to 1 (weights assigned to

different classifiers sum to 1)

Output a weighted sum of all the hypothesis, with each

hypothesis weighted according to its accuracy on the

training set

83

Bagging and Boosting

Experiments with a new boosting algorithm,

freund et al (AdaBoost )

Bagging Predictors, Brieman

Boosting Naïve Bayesian Learning on large subset

of MEDLINE, W. Wilbur

84

Summary

Classification is an extensively studied problem (mainly in

statistics, machine learning & neural networks)

Classification is probably one of the most widely used

data mining techniques with a lot of extensions

Scalability is still an important issue for database

applications: thus combining classification with database

techniques should be a promising topic

Research directions: classification of non-relational data,

e.g., text, spatial, multimedia, etc..

May 27, 2020 Data Mining: Concepts and Techniques 85

Thank you !!!

Questions

86

Based on the figure below, what is the valid production

rule for the decision tree.

Teaching Job

Application?

Yes No

Temp

above

90?

Decision=wear

formal dress

No Yes

Decision=wear slacks Decision=wear jeans

87

Make a decision tree with root node Type from the data in the table below. The

first row contains attribute names. Each row after the first represents the values for

one data instance. The output attribute is Class.

Scale Type Color Texture Class

One One Light slick A

Two One Dark slick A

Two Two Dark slick B

Two One Light slick B

One Two Light slick C

88

Unsupervised evaluation can be internal or external. Why is it that Comparing the sum of

squared error differences between instances and their corresponding cluster centers for each

alternative clustering is an internal method for evaluating alternative clustering produced by the

K-Means algorithm? Explain your answer.

Why is it that sensitivity analysis is a neural network explanation technique used to determine

the relative importance of individual input attributes.

Explain why a feed-forward neural network is said to be fully connected when all nodes at one

layer are connected to all nodes in the next higher layer.

The probability of 60% that a person owns a sports car is given a subscription at least one

automotive magazine. 5% of the adult population subscribes at least one automotive magazine.

35% is the probability of a person owning a sports car given that they don’t subscribe any

automotive magazine. Use the information given together with Bayes theorem to compute the

probability that a person subscribes at least one automotive magazine owned a sports car.

- Assignment status: Already Solved By Our Experts
*(USA, AUS, UK & CA PhD. Writers)***CLICK HERE TO GET A PROFESSIONAL WRITER TO WORK ON THIS PAPER AND OTHER SIMILAR PAPERS, GET A NON PLAGIARIZED PAPER FROM OUR EXPERTS**

QUALITY: 100% ORIGINAL PAPER – **NO PLAGIARISM** – CUSTOM PAPER