Density-based clustering methods are data-driven, they partition the set of data objects and then adapt to the distribution of said objects in the embedding space. To find clusters of arbitrary shape, clusters are modeled as dense regions in the data space, separated by

]]>**What is the essence of density-based methods?**

Density-based clustering methods are data-driven, they partition the set of data objects and then adapt to the distribution of said objects in the embedding space. To find clusters of arbitrary shape, clusters are modeled as dense regions in the data space, separated by sparse regions. The density of an object can be measured by the number of objects close to it.

**What is the essence of grid-based methods?**

Grid-based clustering methods take a space-driven approach by partitioning the embedding space into cells which are independent of the distribution of the input objects. These methods divide the object space into a finite number of cells that form a grid structure. All of the operations for clustering are performed on this grid structure, resulting in a fast processing time.

**What is clustering tendency assessment?**

Applying a clustering method to a data set will return clusters, but these clusters may be meaningless and random. With clustering tendency assessment we can determine if a data set has a non-random structure, which may lead to meaningful clusters.

**How can the number of clusters be determined?**

Determining the number of clusters is not easy, it depends on the distribution’s shape and scale in the data set, and the clustering resolution that the user requires. A simple estimate is to set the number of clusters to about sqrt(n/2) for a data set of n points.

]]>

In hierarchical clustering, the data is not partitioned into a particular cluster in a single step. Instead, a series of partitions takes place, which may run from a single cluster containing all objects to n clusters that each contain a single object. Hierarchical

In hierarchical clustering, the data is not partitioned into a particular cluster in a single step. Instead, a series of partitions takes place, which may run from a single cluster containing all objects to n clusters that each contain a single object. Hierarchical Clustering is subdivided into agglomerative methods and divisive methods. Agglomerative techniques are more commonly used. Hierarchical clustering may be represented by a two-dimensional diagram known as a dendrogram, which illustrates the fusions or divisions made at each successive stage of analysis.

*Advantages:*

- Easy to implement

- Hierarchical clustering outputs a hierarchy, ie a structure that is more informative than the unstructured set of flat clusters returned by k-means. Therefore, it is easier to decide on the number of clusters by looking at the dendrogram.

*Disadvantages:*

- Very sensitive to outliers.

-The order of the data has an impact on the final results.

- Time complexity: not suitable for large datasets.

- It is not possible to undo the previous step: once the instances have been assigned to a cluster, they can no longer be moved around.

If this information is useful to you, please click on one of my ads to help me with the cost of the website or so I can pay my classes. The ads are usually located on the right. Please ❤️

2. **Contrast/compare agglomerative hierarchical clustering methods vs. divisive hierarchical clustering methods.**

Agglomerative Hierarchical clustering method allows the clusters to be read from bottom to top and it follows this approach so that the program always reads from the sub-component first then moves to the parent whereas, a divisive uses top-bottom approach in which the parent is visited first then the child.

The agglomerative hierarchical method consists of objects in which each object creates its own clusters and these clusters are grouped together to create a large cluster. It defines a process of merging that carries on till all the single clusters are merged together into a complete big cluster that will consist of all the objects of child clusters whereas, in divisive the parent cluster is divided into a smaller cluster and it keeps on dividing till each cluster has a single object to represent.

• Cluster: a collection of data objects

– Similar to one another within the same cluster

– Dissimilar to the objects in other clusters

• Cluster analysis

– Grouping a set of data objects into clusters

• Clustering is unsupervised classification: no predefined classes

• Typical applications

– As a stand-alone tool to get]]>

**What is Cluster Analysis?**

• Cluster: a collection of data objects

– Similar to one another within the same cluster

– Dissimilar to the objects in other clusters

• Cluster analysis

– Grouping a set of data objects into clusters

• Clustering is unsupervised classification: no predefined classes

• Typical applications

– As a stand-alone tool to get insight into data distribution

– As a pre-processing step for other algorithms

**What are some of the typical requirements of clustering in data mining?**

The following points throw light on why clustering is required in data mining −

- Scalability − We need highly scalable clustering algorithms to deal with large databases.
- Ability to deal with different kinds of attributes − Algorithms should be capable to be applied on any kind of data such as interval-based (numerical) data, categorical, and binary data.
- Discovery of clusters with attribute shape − The clustering algorithm should be capable of detecting clusters of arbitrary shape. They should not be bounded to only distance measures that tend to find spherical cluster of small sizes.
- High dimensionality − The clustering algorithm should not only be able to handle low-dimensional data but also the high dimensional space.
- Ability to deal with noisy data − Databases contain noisy, missing or erroneous data. Some algorithms are sensitive to such data and may lead to poor quality clusters.
- Interpretability − The clustering results should be interpretable, comprehensible, and usable.

**How is clustering used in applications? (discuss one application in your post)**

Clustering can be used in a shopping application to group customers based on their shopping habits. Once the clusters are formed we can observe what are the most often shopped items for said group and we can supply them with coupons and special offers for those items. This can result in higher sale rates of the items and increased customer retention rate.

]]>Most partitioning methods are distance-based. The clusters are formed in an optimized way such that the objects within a cluster are “close”, meaning that they are related to each other, while objects in different clusters are “far apart”, they are very

]]>Most partitioning methods are distance-based. The clusters are formed in an optimized way such that the objects within a cluster are “close”, meaning that they are related to each other, while objects in different clusters are “far apart”, they are very different.

**What is the k-medoids method?**

It is a modification of the k-means algorithm that diminished sensitivity to outliers. Instead of using the mean value of the objects in a cluster as a reference point, it picks actual objects to represent the clusters. Each remaining object gets placed in the cluster with the representative object that is most similar to it. The result is that it groups n objects into k clusters by minimizing the absolute error.

**Which method is more robust (k-means or k-medoids) and why?**

If there is noise and outliers then k-medoids is more robust than k-means because a medoid is less influenced by them. However, each iteration in the k-medoids algorithm is of complexity O(k(n-k)^2), which becomes very costly for large values of n and k. In such situation, it’s better to use the k-means method.

]]>Ensemble methods are used for increasing classification accuracy. An ensemble combines a series of k learned models, with the goal of creating an improved classification model. The individual classifiers vote and the ensemble returns a class label prediction based

]]>Ensemble methods are used for increasing classification accuracy. An ensemble combines a series of k learned models, with the goal of creating an improved classification model. The individual classifiers vote and the ensemble returns a class label prediction based on the votes. Boosting, bagging, and random forests are popular ensemble methods.

**What is a neural network?**

A neural network is a set of connected input/output units where each connection has a weight associated with it. During the learning phase, the network learns by adjusting the weights in order to be able to predict the correct class label for the input tuples. Neural networks can take a long time to train and therefore should be left for applications where this isn’t a concern.

**What are the advantages and disadvantages of neural networks?**

Advantages: High tolerance to noisy data, ability to classify patterns that were not trained, well suited for continuous-valued inputs and outputs, and have been successful on real-world data such as pathology and handwritten character recognition.

Disadvantages: Long training times, poor interpretability by humans, require a number of parameters that are typically best determined empirically such as the network structure.

]]>Bayesian classifiers are statistically based classifiers which can predict the class

label probabilities that the data belongs in that label. It is based on Bayes' theorem

and these algorithms are comparable in performance with decision trees and neural network classifiers. They have high accuracy and speed]]>

**What are Bayesian classifiers?**

Bayesian classifiers are statistically based classifiers which can predict the class

label probabilities that the data belongs in that label. It is based on Bayes' theorem

and these algorithms are comparable in performance with decision trees and neural network classifiers. They have high accuracy and speed on large data sets.

**How does the naïve Bayesian classifier work?**

The simple bayesian classifier works by representing each data point as a set of vectors. If there are m classes, given the the vector the classifier will predict that it belongs to label m_i. We try to predict the probability for every class label and then choosing the highest probability.

**How effective are Bayesian classifiers?**

Like mentioned earlier bayesian classifiers are comparable to decision trees and neural networks. This varies by domain but in theory bayesian classifiers have the least error rate compared to all other classification approaches. This can all of course vary depending on the quality of the data set given.

**What is rule-based classification?**

Rule based classification is a classifier represented by IF-THEN statements which is a good way to represent information or bits of knowledge. An example of such rule goes as follows:

R1:IF age = youth AND student = yes THEN buys_computer = yes.

The if side of the rule is called the rule antecedent also known as the precondition. The then side is the rule consequent.

]]>Decision trees are used by providing a test data set where we are trying to predict the

class label. The data is then tested between each non-leaf node where the path is traced from the root to

Decision trees are used by providing a test data set where we are trying to predict the

class label. The data is then tested between each non-leaf node where the path is traced from the root to the leaf which determines the class label. Decision trees are popular because it does not require any domain knowledge regarding the data set or have to go through setting any parameter weights therefore it is very useful to discover more information about your data set. They are also popular because they can handle multi dimensional data sets and are very easy to understand when compared to other models.

**What do we understand by "decision tree induction"?**

When discussing decision tree induction we are talking about the learning of

decision trees from labeled training sets. A decision tree is a flowchart tree structure

where each non-leaf node performs a test on each attribute and each branch represents an outcome of the test and the leaf nodes contain the class labels.

**What is "tree pruning" and how does it work?**

Tree pruning address the problem with overfitting the data. Some methods use statistical approaches in order to remove least-reliable branches. There are two common methods used to perform tree pruning. One of these methods is prepruning. Prepruning halts the construction of the tree early and decides to not further expand the current node therefore making it a leaf node. Using statistical approaches, information gain, Gini index, etc we can decide to prune or not. Postpruning is the more common approach which removes the subtrees from the entire tree and replacing them with leaves.

**What is the Gini index and how is it calculated?**

The Gini index is used in the Cart algorithm in order to measure the impurity of the

training data. In order to calculate the Gini index you must subtract the sum of of the

probability that the attribute belongs in the each class from 1.

Classification is a data analysis task in which a classifier (a model) is created to predict categorical labels (class labels), such as “popular” or “unpopular”, “safe” or “risky”.

**How does classification work?**

Classification is a two-step process. First, the classification model is built,

]]>Classification is a data analysis task in which a classifier (a model) is created to predict categorical labels (class labels), such as “popular” or “unpopular”, “safe” or “risky”.

**How does classification work?**

Classification is a two-step process. First, the classification model is built, which describes a predetermined set of concepts or data classes (called the learning step). Secondly, the model is applied to the given data to predict class labels for it (called the classification step).

**Compare supervised learning vs. unsupervised learning.**

In supervised learning, the classifier learns by being told which class each training tuple belongs to. In unsupervised learning, the class label of each training tuple is not known. Instead, clustering is used to determine groups of similar tuples.

**What is a training set and how is it used?**

A training set is a set of training tuples which is made up of database tuples and their class labels. The classification algorithm analyzes it during the learning step to “learn from” it and build the classifier.

**What is a test set and how is it used?**

A test set consists of test tuples and their class labels, which are independent of the training tuples. It is used to check the accuracy of a classifier, which is the percentage of test set tuples that are correctly classified by it.

]]>An uninteresting association rule is a misleading “strong” association rule. Just because an association rule meets the minimum support and confidence thresholds and have high values doesn’t necessarily mean the rule is interesting. In fact, it might be an outlier due

]]>An uninteresting association rule is a misleading “strong” association rule. Just because an association rule meets the minimum support and confidence thresholds and have high values doesn’t necessarily mean the rule is interesting. In fact, it might be an outlier due to the data of each individual item.

**Why can the confidence of a rule A -> B be sometimes deceiving?**

It doesn’t measure the lack of strength of the correlation and implication between A and B. It reflects an uninteresting association rule.

If this information is useful to you, please click on one of my ads to help me with the cost of the website or so I can pay my classes. The ads are usually located on the right. Please ❤️

**What is a correlation rule?**

A correlation rule is a strong association rule that has been modified with a correlation measure. It is a rule that not only is it measured by its support and confidence, but also by the correlation between the items. We can do correlation analysis by using the X^2 measure or Lift measure. The lift measure states that the the occurrence of itemset A is independent of the occurrence of itemset B if P(A U B) = P(A)P(B). Otherwise, itemsets A and B are dependent and correlated.

**Discuss any one of the pattern evaluation measures.**

Max_Confidence measure: Given two itemsets (A and B), the max_confidence measure of A and B is max_conf(A, B) = max{P(A|B), P(B|A)}. The max_conf measure refers to the maximum confidence of the two association rules, A->B and B->A.

]]>There are two nontrivial costs that the Apriori method can suffer from. First, it may need to generate an enormous amount of candidate sets. The more frequent 1-itemsets there are, the

]]>There are two nontrivial costs that the Apriori method can suffer from. First, it may need to generate an enormous amount of candidate sets. The more frequent 1-itemsets there are, the more candidate 2-itemsets it will need to generate. Second, everytime it generates an itemset it needs to scan the database and check via pattern matching. This can lead to a huge amount of scans, which can be very costly and reduce effectiveness.

**What is the essence of the frequent pattern growth, or FP-growth method?**

It uses a divide and conquer strategy to tackle the efficiency problem of the Apriori method. First, it creates a frequent pattern tree (FP-tree), which contains the itemset association information. Then it divides this compressed database into a set of conditional databases, each is associated with one frequent item, and then mines each of them separately. For each frequent item fragment, only its associated data sets are inspected.

**What is the difference between the horizontal data format and the vertical data format?**

For example, we have a database of purchases and items. In a horizontal data format, we would have [purchaseID : itemset] where purchaseID is the key and itemset is the set of items bought in that purchase. In a vertical data format, we would have [item : purchaseID_set] where item is the key and purchaseID_set is a set of all the purchases that include that item.

**Why, in practice, is it more desirable in most cases to mine the set of closed frequent itemsets rather than the set of all frequent itemsets?**

Mining the set of all frequent itemsets in a large collection of items can lead to an absurdly high amount of derivations, becoming extremely expensive. For this reason, mining the set of closed frequent itemsets is preferred.

**What methods can be used to mine closed frequent itemsets?**

Item merging: If every transaction that contains a frequent itemset X also contains an itemset Y but not any proper superset of Y, then X U Y results in a frequent closed itemset, therefore there is no need to look for any itemset containing X but no Y.

Sub-itemset pruning: If a frequent itemset X is a proper subset of previously discovered frequent closed itemset Y and the support count for both X and Y are equal, then X and all of X’s descendants in the set enumeration tree can’t be frequent closed itemsets and can be removed.

Item skipping: When doing depth-first mining of closed itemsets, on each level there will be a prefix itemset X associated with a header table and a projected database. If a local frequent item p has the same support in several header tables and on different levels, we can remove p from the header tables on the higher levels.

]]>The Apriori property states that "all nonempty subsets of a frequent itemset must also be frequent". To elaborate, if an itemset called I does satisfy the minimum support threshold then I is said

]]>**-What is the Apriori property and how is it employed to improve the efficiency of the algorithm?**

The Apriori property states that "all nonempty subsets of a frequent itemset must also be frequent". To elaborate, if an itemset called I does satisfy the minimum support threshold then I is said to be not frequent. If an item called A is added to I, then the resulting itemset cannot be more frequent than I. This means that I ∪ A must therefore not be frequent either. This property belongs to the antimonotonicity category of properties, which means that if a set cannot pass this property's test, then all of its supersets will also fail that test. Furthermore, the Apriori property uses two steps to improve the efficiency of the algorithm using join and prune actions.

**-What are some of the techniques that are used to improve the efficiency of the Apriori algorithm?**

Some techniques used to improve efficiency include the hash-based technique and transaction reduction. In the hash-based technique, during the scanning of each transaction in the database to create the 1-itemset, a hash table is formed using the 2-itemsets of each transaction while increasing the bucket count. A 2-itemset hash table where the bucket count is below the support threshold cannot be frequent and therefore should be removed from the possible candidate set.

Transaction reduction, is the removal of transactions that do not contain any frequent k-itemsets because they cannot contain (k+1)-itemsets as well. This removes all the unrelated itemsets from consideration.

]]>Frequent patterns are relationships within a given data set that repeatedly show up. They are used in data mining to discover associations and correlations between items in data sets. The market basket analysis

]]>**What do we understand by “frequent patterns”? How are they used in data mining? Please provide examples.**

Frequent patterns are relationships within a given data set that repeatedly show up. They are used in data mining to discover associations and correlations between items in data sets. The market basket analysis is an example of frequent pattern mining. For example, if a customer buys milk, how often do they buy cereal, and if they do, what kind of cereal do they buy? The discovery of this pattern can be used to increase sales by creating special promotions for the purchase of such products together. If a customer buys ice cream and cookies, placing a stand with cookies near the ice cream will increase the sale of both products.

**What is an association rule? What do the concepts of support and confidence associated to association rules mean? Please provide examples.**

An association rule is a way to represent the relation between items that are frequently associated or purchased together, with both support and confidence measures (which are measures of interestingness). For example: bananas => ice cream [support = 10%, confidence 20%]. This support value means that 10% of the time, ice cream and bananas are bought together. This confidence value means that 20% of the time, customers who buy bananas also buy ice cream.

**What is P(A|B) -probability of A given B-?**

It is equivalent to confidence(B=>A), meaning what is the percentage of transactions containing B that also contain A.

**Explain using examples the definitions of closed itemset, closed frequent itemset, and maximal frequent itemset.**

Closed itemset: If an itemset is frequent, each of its subsets is also frequent. An itemset X in data set D is closed if there isn’t an itemset Y in D such that Y and X share the same support count.

Closed frequent itemset: An itemset X is a closed frequent itemset in data set D if it’s both closed and frequent in D.

Maximal frequent itemset: An itemset X is a maximal frequent itemset in data set D if X is frequent and there isn’t an itemset Y such that X is a subset of Y and Y is frequent in D.

**What are the steps of association rule mining?**

- Find all frequent itemsets, each of these itemsets will show up at least as frequently as the minimum support count (min_sup).
- Generate strong association rules from the frequent itemsets, these rules must meet the minimum support and minimum confidence.

The first step is to choose a business process to model. Depending on whether the business process is organizational multiple complex collections one would choose a data warehouse model. If it’s departmental and analyzes only one kind of

]]>**Discuss the steps associated to the design of a data warehouse.**

The first step is to choose a business process to model. Depending on whether the business process is organizational multiple complex collections one would choose a data warehouse model. If it’s departmental and analyzes only one kind of business process, then a data mart model would instead be chosen.

The next step is to choose the business process grain, which is essentially the atomic level of data to be present. This refers to the data that will be represented in the fact table.

Then it’s time to choose the dimensions that will apply to each fact table record.

Lastly, you choose the measures that will populate each fact table record. Often they are numeric additive quantities.

**Compare the waterfall and the spiral methods as methodologies to develop a data warehouse.**

The waterfall method used to design a data warehouse performs structured and systematic analysis at each step before proceeding to then next, which is how it gets the name, “waterfall.” In comparison, the spiral method involves the rapid generation of increasingly functional systems with short intervals between successive releases.

**Compare/contrast the three main types of data warehouse usage: information processing, analytical processing, and data mining.**

All three of the main types of data warehouse usage have a common factor: they all analyze data in some way. For example, information processing supports the use of querying, statistical analysis, and reports using tables, charts, and/or graphs. On the other hand, analytical processing generally operates on historic data. The benefit it has over information processing is the multidimensional data analysis of data warehouse data. Lastly, data mining is different from the two in that it supports knowledge discovery by finding hidden patterns and associations by constructing analytical models.

**Please discuss the following statement given on page 155 of our textbook: “among the many different paradigms and architectures of data mining systems, multidimensional data mining is particularly important”.**

The statement alludes to the explanation given as to why multidimensional data mining is so important. The book begins by explaining how high-quality data is stored within data warehouses because it has already gone through preprocessing steps to ensure quality. Another important part of multidimensional data mining is how data analysis infrastructures have been (or will be) systematically constructed surrounding data warehouses. It also provides an online selection of data mining functions. Because users may not always know the specific kinds of knowledge they want to mine, integrating OLAP with various mining functions provides users with the flexibility to select the desired mining functions and swap data mining tasks dynamically.

]]>A multidimensional data model is a model for databases which are generally themed (e.g. sales) and have multiple dimensions of study like * time*,

**What do we understand by “multidimensional data model”? What is a “data cube”?**

A multidimensional data model is a model for databases which are generally themed (e.g. sales) and have multiple dimensions of study like * time*,

**Explain in your own words the following concepts and use an example to illustrate your explanations: snowflake schema, fact constellation, and star schema.**

Star schema: This schema uses a data warehouse with a large fact table which contains the bulk, unique data and a set of dimension tables. To picture it, the schema represents a starburst, with the dimension tables displayed in a radial pattern around the central fact table.

Snowflake schema: A snowflake schema is a variant of the star schema. Some dimension tables are normalized in this schema, which splits the data into additional tables. The resulting graph ends up forming a shape like that of a snowflake.

Fact constellation: A fact constellation lets you share dimension tables between fact tables. This schema can be imagined as a collection of stars (hence the name fact * constellation*).

**What is a data cube measure? Any examples?**

A data cube measure is a numeric function that can be evaluated at each point in the data cube space. A value is computed for a given point by aggregating the data corresponding to the respective dimension. An example of a data cube measure is a distributive aggregate function.

**Explain and provide an example of an OLAP operation for multidimensional data.**

OLAP operations exist to materialize the different views available in a multidimensional model. This allows for interactive querying and analysis of the data at hand. An example of an OLAP operation would be the roll-up operation. The roll-up operation performs aggregation on a data cube, either by climbing up a concept hierarchy for a dimension, or by dimension reduction.

]]>