Oris teme
Exploratory Analysis and Clustering
Attribute-based data sets. Preparing and loading the data into Orange Data Mining software. Data analysis workflows. Scatterplot and box plot. Hierarchical clustering: distances between data items, distances between clusters, agglomerative approach to data clustering. Cluster explanation.
Video lectures: Orange workflows, data exploration, workflow management, your own data, clustering-theory, clustering in 2d, clustering of multi-dimensional data, and clustering of zoo data set.Regression Models and Regularization
Linear regression. The shape of the model. Optimization function. Polynomial expansion. Overfitting. Regularization. Accuracy on training and test set. Evaluating the accuracy of regression models. Feature scoring and selection.
Video lectures: introduction to regression, linear regression, overfitting, regularization, training and test sets, L1 and L2 regularization, and model scoring with RMSE and R2.
Classification Models
Prediction models and how they differ from clusterings. Classification trees as an example of an intuitive, early prediction model. Naive Bayesian model as efficient, yet limited model. Linear models, e.g. logistic regression.- (In the lecture, we used the term "classification tree" to avoid confusion with another, unrelated trees of the same name.)
- Quinlan is the author of one of the first and most influential algorithm for induction of classification trees. The article is more of historical interest, but it shows the thinking of the pioneers of AI. After some philosophy in the first two sections, it explains the reasoning behind the tree-induction algorithms.
More mathematical (compared to our lecture), but still friendly explanation of logistic regression. I recommend reading the first 6 pages, that is, section 12.1 and the (complete) section 12.2.
(This is Chapter 12 from Advanced Data Analysis from an Elementary Point of View. You can download the draft from the author's site.)
- A quick derivation of the Naive Bayesian classifier, and derivation and explanation of nomograms.
Model Evaluation
Performance scores: classification accuracy, sensitivity and specificity, precision and recall, ... Performance curves and related score(s). Cross validation.- I recommend reading (at least) the introduction.
- A Wikipedia page with a list of scores. Useful for lookup.
- A very accessible paper about ROC curves. You'll need to read (I recommend the first seven sections) of this paper to solve your homework.
Data Projection and Embedding. Image Analytics.
A short and simple read about image analytics with construction of image maps (MDS, could use t-SNE instead) and classification of images. Both image maps, classification, and all other application of machine learning on images were possible through embedding of images into vector spaces.