1. What’s an attribute? What’s a data instance?

- What’s noise? How can noise be reduced in a dataset?
- Define outlier. Describe 2 different approaches to detect outliers in a dataset.
- Describe 3 different techniques to deal with missing values in a dataset. Explain when each of these techniques would be most appropriate.
- Given a sample dataset with missing values, apply an appropriate technique to deal with them.
- Give 2 examples in which aggregation is useful.
- Given a sample dataset, apply aggregation of data values.
- What’s sampling?
- What’s simple random sampling? Is it possible to sample data instances using a distribution different from the uniform distribution? If so, give an example of a probability distribution of the data instances that is different from uniform (i.e., equal probability).
- What’s stratified sampling?
- What’s “the curse of dimensionality”?
- Provide a brief description of what Principal Components Analysis (PCA) does. [Hint: See Appendix A and your lecture notes.] State what’s the input and what the output of PCA is.
- What’s the difference between dimensionality reduction and feature selection?
- Describe in detail 2 different techniques for feature selection.
- Given a sample dataset (represented by a set of attributes, a correlation matrix, a co-variance matrix, …), apply feature selection techniques to select the best attributes to keep (or equivalently, the best attributes to remove).
- What’s the difference between feature selection and feature extraction?
- Give two examples of data in which feature extraction would be useful.
- Given a sample dataset, apply feature extraction.
- What’s data discretization and when is it needed?
- What’s the difference between supervised and unsupervised discretization?

- Given a sample dataset, apply unsupervised (e.g., equal width, equal frequency) discretization, or supervised discretization (e.g., using entropy).
- Describe 2 approaches to handle nominal attributes with too many values.
- Given a dataset, apply variable transformation: Either a simple given function, normalization, or standardization.
- Definition of Correlation and Covariance, and how to use them in data pre-processing (see pp. 76-78).