Tutorials

Tutorials

Data mining tutorials across interdisciplinary topics. From time to time we add new content to or edit these tutorials. Please be sure to read or download their most recent versions.

Data Mining

A Unicode Mnemonic for Vowels with Diacritics

A mnemonic for easily recalling the unicode entities of vowels with grave accents, acute accents, and circumflexes is described.

Building Curated Collections

A brief illustrated guide to building curated collections with Minerazzi. An application example to the Panama Papers.

Cosine Similarity Tutorial

This is a tutorial on the cosine similarity measure. Its meaning in the context of uncorrelated and orthogonal variables is examined.

Levenshtein Distance Tutorial

Did you know that Levenshtein Distance is at the heart of sequence analysis and text mining-based technologies? It is so simple, elegant, and relevant to many research fields.

Distance and Similarity Tutorial

Learn the difference between these two association measures as used in data mining and information retrieval.

Statistics

On the Nonadditivity of Correlation Coefficients
Part 3: On the Bias & Nature of Correlation Coefficients

This is Part 3 of a tutorial series on the nonadditivity of correlation coefficients. The bias & nature of correlation coefficients, their transformations, and assumptions of normality are discussed. Did you know that score-to-rank transformations can change the sampling distribution of a statistic like a correlation coefficient, and that Fisher transformations are sensitive to normality violations? Combining both types of transformations is a recipe for a statistical disaster.

On the Nonadditivity of Correlation Coefficients
Part 2: Fisher Transformations

This is Part 2 of a tutorial series on the nonadditivity of correlation coefficients. We discuss Fisher r-to-Z and Z-to-r transformations and the risks of arbitrarily implementing these. These types of transformations are only valid if the x, y variates from which the correlations are computed are normally distributed (bivariate normality).

On the Nonadditivity of Correlation Coefficients
Part 1: Pearson's r and Spearman's rs

This is Part 1 of a tutorial series on the nonadditivity of correlation coefficients. We demonstrate why it is not possible to arithmetically add, subtract, and average Pearson's r or Spearman's rs. We show that since these are cosines they are not additive.

A Tutorial on Polynomial Regression through Linear Algebra

This is a tutorial on polynomial regression. Three different methods for fitting paired data to a polynomial are presented. The first two are based on linear algebra while the last one is a graphic solution. These methods can be easily implemented with Excel, by writing a computer program, or with a programmable calculator.

A Quantile-Quantile Plots Tutorial

Quantile analysis by means of constructing quantile-quantile plots (qq plots) is a technique for determining if different data sets originate from populations with a common distribution. It also works as a Normality Test. The technique is applicable to a wide range of data mining and engineering problems.

A Tutorial on Standard Errors

This is an introductory tutorial on standard errors. Every statistical estimate has its own Standard Error. Using an incorrect definition for a standard error invalidates the results of any study.

The Self-Weighting Model: Part 2

This is the second of a two-parts tutorial on the Self-Weighting Model (SWM). In this part, we derive the model and show how it could be used for a broad range of engineering, science, information retrieval, and data mining problems where conditional weighted means must be computed.

The Self-Weighting Model: Part 1

This is the first of a two-parts tutorial on the Self-Weighting Model (SWM), a framework for computing weighted means. In this part, we show how the model provides a solution to the problem of computing valid averages from nonadditive quantities.

Chemistry

Mnemonic and Heuristic for Estimating Spin-only Magnetic Moments

A mnemonic and heuristic for estimating spin-only magnetic moments of free atoms and ions, and based on a continued fractions algorithm, are presented. Chemistry teachers and students might find these useful as problem-solving strategies for lectures and test sessions where calculators might not be permitted or available.

A Novel Mnemonic for the Rydberg Rule

This tutorial presents a new mnemonic describing the filling order of atomic orbitals according to the Rydberg Rule. The mnemonic accounts for the reordering of atomic orbitals and the large orbital energy gaps responsible for the periodicity of elements. The tutorial concludes by suggesting the possibility of using the mnemonic for chemical properties classification; i.e. for the study of similarities and differences between elements located at both sides of the energy gaps.

Best Match Models

Development of BM25IR: A Best Match Model based on Inverse Regression

Power transformations can be used as a common framework for the derivation of local term weights. As a practical application, a Best Match model based on inverse regression (BM25IR) is derived from these transformations. Simulations suggest that BM25IR works fairly well for different BM25 parametric conditions and document lengths.

A Tutorial on the BM25F Model

This is a tutorial on the Best Match 25 Model with Extension to Multiple Weighted Fields, also known as the BM25F Model. Unlike BM25, the model is applicable to structured documents consisting of multiple fields. The model preserves term frequency nonlinearity and removes the independence assumption between same term occurrences.

A Tutorial on OKAPI BM25 Model

This is a tutorial on OKAPI BM25, a Best Match model where local weights are computed as parameterized frequencies and global weights as RSJ weights. Local weights are based on a 2-Poison model and the verbosity and scope hypotheses and global weights on the Robertson-Spärck-Jones Probabilistic Model.

Robertson-Spärck-Jones Probabilistic Model Tutorial

A tutorial on the Robertson-Spärck-Jones Probabilistic Model. This model computes global weights, known as RSJ weights, based on Independence Assumptions and Ordering Principles for probable relevance. The model subsumes IDF and IDFP as RSJ weights in the absence of relevance information.

Vector Space Models

The Extended Boolean Model

This is Part 6 of a tutorial series on Term Vector Theory. The Extended Boolean Model is discussed. By varying the model p-norm, from p = 1 to p = ∞, its ranking behavior can go from that of a vector space-like to a strict Boolean-like model.

An Introduction to Global Weight Models and MySQL Implementation

This is Part 5 of a tutorial series on Term Vector Theory. Several global weight models are discussed and a brief introduction to MySQL implementation of the Vector Space Model presented.

An Introduction to Local Weight Models

This is Part 4 of a tutorial series on Term Vector Theory. An introduction to several local weight models is presented.

The Classic TF-IDF Vector Space Model

This is Part 3 of an introductory tutorial series on Term Vector Theory. The classic term frequency-inverse document frequency model or TF-IDF, is discussed.

The Binary and Term Count Models

This is Part 2 of an introductory tutorial series on Term Vector Theory as used in Information Retrieval and Data Mining. The Binary (BNRY) and Term Count (FREQ) models are discussed.

Term Vector Theory and Keyword Weights

This is Part 1 of an introductory tutorial series on Term Vector Theory as used in Information Retrieval and Data Mining. The concept of local and global term weights is briefly presented and the idea of keyword density as a useful weighting scheme for ranking documents debunked.

A Linear Algebra Approach to the Vector Space Model

This is a fast track tutorial on vector space calculations. A linear algebra approach is used. The tutorial covers term-document and term-query matrices, matrix transposition, dot products, cosine similarities, and local and global weights.

Vector Space Calculations without Linear Algebra

This is an introductory tutorial for those interested in vector space models, but that lack of a linear algebra background. The calculations can be easily replicated with a spreadsheet, online calculator, or by hand.

SVD and PCA

Singular Value Decomposition (SVD)

This fast track tutorial provides instructions for decomposing a matrix using the singular value decomposition (SVD) algorithm. The tutorial covers singular values, right and left eigenvectors, and a shortcut for computing the full SVD of a matrix.

PCA and SPCA Tutorial

This is a tutorial on Principal Component Analysis (PCA) and one of its variants, Standardized PCA (SPCA). Both are techniques for identifying unknown trends in multidimensional data sets.

Internet Engineering

MTU and MSS Tutorial

This tutorial covers maximum transmission unit (MTU), maximum segment size (MSS), PING, NETSTAT, and fragmentation. It is aimed at those mining information/network security.

IP Packet Fragmentation Tutorial

This tutorial covers IP fragmentation, data payloads, IP packet and header lengths, maximum transmission unit (MTU), and fragmentation offset (FO). It is aimed at those mining information/network security.