Vector Space Explorer (VSE)

VSE is aimed at exploring combinations of local (e.g., FREQ, ATF1,...) and global (e.g., IDF, IDFP,...) term weighting schemes for documents and queries. The tool lists results in decreasing order of cosine similarity scores, with or without implementing stopwords removal and parametric corrections. VSE was developed for computer science students and those interested in information retrieval systems so they can learn how IR systems work and rank documents.

Other data mining tools:: Binary Distance; Binary Similarity; Cosine Similarity; Data Set Editor; and many more »

Instructions

Brief Introduction to Vector Space Models

In vector space models, documents and queries are represented as vectors of weights, where w_i,j and w_i,q are the weights of term i in document j and query q, respectively. As mentioned in the Introduction section, in IR a document can be a text file or any sort of text stream. How we define a document is irrelevant as these are reduced to a sequence of terms.
Documents can be ranked in decreasing order of dot products by defining a simple similarity coefficient coeff(d_j,q).

(1)

To correct for the length of the documents, (1) is divided by the vector lengths.

(2)

These scores are called cosine similarities sim(d_j,q) because are the cosine of the angle formed between document and query vectors. A geometric analysis of (1) and (2) is available (Jones and Furnas, 1987). As nowadays (2) is preferred over (1), these are the scores computed by our tool.

In (1) and (2), w_i,j and w_i,q are combinations of local L and global G weighting schemes.

w_i,j = L_i,jG_i(3)

w_i,q = L_i,qG_i(4)

Local weights are computed within a given document, whereas global weights are computed across the collection of documents D

D = {d₁, d₂,... d_N}(5)

where N is the number of documents in the collection.
Prior to computing cosine similarities, documents are reduced to a stream of unique terms through a transformation process that includes linearization (removal of tag elements, comments, scripts, style instructions,...), tokenization (lowercasing, removal of full/partial stops, or unwanted characters), and filtration (stopwords removal and deduplication).
Survival terms are then merged with survival terms from other documents, and a vocabulary of terms is generated. Documents and query are then represented by sparse vectors as only a few of the vocabulary terms are in any one document or query.

As a result, combinations of local and global weighting schemes, for documents and queries, give rise to different cosine similarity-based ranking functions. Our VSE tool helps precisely with the exploration of different combinations of weighting schemes.

It should be stressed that vector space models are not the only way of ranking documents. For instance, documents can be ranked using Best Match Models (i.e., BM25, BM15, BM11, BM1, and BM0). These models rank documents in the presence or absence of relevance information.
Before proceeding any further, please read the What is computed? section, which describes the several vector space implementations supported by the tool. For those interested in the theory behind vector space and best match models, useful tutorials are available in the Tutorials section of Minerazzi.

What is computed?

VSE implements many of the local and global weighting schemes used in the IR literature (Chisholm & Kolda, 1999; Salton & Buckley, 1987; Crestani, Sanderson, Ruthven, & Rijsbergen, 1995; Sanderson & Ruthven, 1996). These are given in Tables 1 and 2.
Parametric correction values k and K are given by default, but can be overwritten for exploratory and optimization purposes.

In Table 1, some local weights are normalized with document d_j sum of frequencies, average term frequency a_j, or maximum term frequency x_j.

Table 1. Local Weighting Schemes
Abbr.	Name	k	Remarks
FREQ	Frequency, Term Count	-	Favors long documents as these tend to repeat terms. Assumes that terms repeated x times weight x times more.
FREQA	Average-normalized FREQ	-	Favors long documents by normalizing term frequencies with d_j average term frequencies.
FREQX	Maximum frequency-normalized FREQ	-	Favors documents that mention more a given query term by normalizing term frequencies with d_j maximum term frequency.
BNRY	Binary	-	Favors vocabulary-rich documents as their retrieval possibilities increase. Insensitive to term repetition.
SQRT	Square Root	0.5	Resembles LOGA curve, but gives more weight for terms appearing only a few times in a document.
ATF1	Augmented Normalized Term Frequency	0.5	Awards weight to a term for appearing in a document, then awards additional weight for appearing frequently. Weight varies between 0.5 and 1. Reduces to FREQX for k = 0 and to BNRY for k = 1.
ATFC	Changed-Coefficient ATF1	0.2	Of same form as ATF1, but even more weight is given if a term appears frequently in a document, but less weight is given just for appearing. The maximum value for ATFC is one. Reduces to FREQX for k = 0 and to BNRY for k = 1.
ATFA	Augmented Average Term Frequency	0.9	Like ATF1, gives more weight to a term for just appearing in a document, but even less weight if a term appears frequently. It is also normalized differently and with d_j average term frequency. The average ATFA value is one. Reduces to FREQA for k = 0 and to BNRY for k = 1.
ALOG	Log Attenuated	-	Shifts the scale of frequencies by 1 and then log-transforms it so minimum local weight is 0.3. Attenuates weights up to one decade more than LOGA.
LOGA	Log	-	Shifts a log-transformed scale of frequencies so minimum local weight is 1.
LOGN	Normalized Log	-	As is normalized by the 1 + log(a_j) term, where a_j is d_j average term frequency, the weight given by LOGN will always be lower than that of LOGA. Normalization insures that the scale of local weights is upper bounded.
LOGG	Augmented Log	0.2	A variation of ATFC. Local weight can be greater than 1. Works well as a local query weight. For k = 0 reduces to ALOG whereas for k = 1 reduces to BNRY
LOGLN	Log-Length Normalized	-	Similar to ALOG, but attenuates local weights even more by normalizing values with the log of the length of d_j, taken as the number of unique terms.

Computing term weights by considering only term frequencies is not an effective strategy. For instance, if high frequency terms are prevalent across a collection of documents, then all documents are likely to be retrieved, affecting the precision of a search.

This can be avoided by incorporating global weighting schemes that account for terms concentrated in a few documents. These schemes are based on the notion of inverse document frequency (IDF).

Table 2 lists some of the best known global weighting schemes.

Table 2. Global Weighting Schemes
Abbr.	Name	K	Remarks
IDF	Inverse Document Frequency	-	Favors documents that mention a term in a few number of documents. Assigns low weight to terms prevalent across a collection, stopwords, and frequently used terms.
IDFP	Probabilistic IDF	-	Assigns a weight of zero to terms mentioned in 50% of the collection of documents and negative weights if they appear in more than 50% of the collection. Does not avoid negative weights.
IDFPS	Smoothed IDFP	0.5	IDFP with a smoothing correction. Resets the scale of weights to avoid a mathematical error when N = n_i. However, it still assigns a weight of zero to terms mentioned in 50% of the collection of documents. Does not avoid negative weights.
IDFPA	Adaptable IDFP	2	An adaptable weighting scheme that we derived purely from power transformation theory, initially referred to as IDFP+ (Garcia, 2009; 2011). For K = 0 reduces to IDFP, for K = 1 reduces to the classic IDF, and for K = 2 reduces to Lee's lift function that she derived using several layers of abstractions (Lee, 2007).
IGFF	Global Frequency IDF	-	Assigns a weight of 1 if a term appears once in every document or once in one document. Works best when combined with a different global weight on the query. Favors terms repeated often over a small set of documents.
IGFI	Incremental Global Frequency IDF	1	Like IGFF, but adding K improves its performance and favors even more terms repeated often over a small set of documents. For K = 0 reduces to IGFF.
IGFL	Log-Global Frequency IDF	1	Logarithmic version of IGFI.
IGFS	Square Root Global Frequency	0.9	Subtracting a value of K less than 1 was found to improve performance. Subtracting 1 could cause a global weight of zero for some terms.
CH-2	Croft-Harper Assumption 2	0.5	The assumption here is that all query terms share the same probability of occurring in a relevant document. Thus K is interpreted here as this probability. For K = 0.5 CH-2 reduces to IDFP, whereas for 0 < K < 1 the curve of IDFP values is lifted.

Avoiding Unrealistic Weighting Scenarios

Using VSE as a Plagiarism Checker

Who can use it?

Suggested Exercises

References

Vector Space Explorer (VSE)

Vector Space Explorer (VSE)

Feedback