Vector Space Explorer (VSE)
- Other data mining tools:
- Binary Distance
- Binary Similarity
- Cosine Similarity
- Data Set Editor
- and many more »
- The Vector Space Explorer (VSE) was developed to introduce computer science students and those interested in information retrieval (IR) to vector space models.
- To prevent your browser from crashing, enter a few records. For instance, use VSE with the top titles or text snippets from a search engine result pages.
For experimentation, a test collection no need to consists of full-text articles. It can be a set of titles, abstracts, paragraphs, sentences, phrases,.... Actually, some of the early IR implementations used collections of titles and abstracts (Harman, 2011). And for illustrative purposes, some modern IR textbooks treat short sentences as documents (Grossman & Frieder; 2004; Manning, Raghavan, & Schütze, 2008).
- To use this tool, enter the "documents" and query in textareas D and q, respectively.
Note: You must enter one text stream per line in textarea D, separating each document with a blank line. To do this, use the
Enter
key of your keyboard. This instructs the tool to recognize each text stream as a single entry. See the examples provided.The default settings consist in accepting default parameter values and the removal of stopwords. The default weighting scheme corresponds to the FREQ Model, also known as the Term Count Model. This is one of the simplest vector space implementations where term weights are just raw frequencies (term counts); i.e., no global weights are assigned.
-
First-time Users
First-time users may want to try the examples provided by pressing theTry This Example
button. It is possible to cycle through the examples by repetitive pressing this button.Some of the examples list top titles obtained by querying commercial search engines. Accepting the default settings instructs VSE to remove stopwords and implement the FREQ Model.
Brief Introduction to Vector Space Models
-
In vector space models, documents and queries are represented as vectors of weights, where wi,j and wi,q are the weights of term i in document j and query q, respectively.
As mentioned in the Introduction section, in IR a document can be a text file or any sort of text stream. How we define a document is irrelevant as these are reduced to a sequence of terms.
Documents can be ranked in decreasing order of dot products by defining a simple similarity coefficient coeff(dj,q).
(1)
To correct for the length of the documents, (1) is divided by the vector lengths.
(2)
These scores are called cosine similarities sim(dj,q) because are the cosine of the angle formed between document and query vectors. A geometric analysis of (1) and (2) is available (Jones and Furnas, 1987). As nowadays (2) is preferred over (1), these are the scores computed by our tool.
In (1) and (2), wi,j and wi,q are combinations of local L and global G weighting schemes.
wi,j = Li,jGi(3)
wi,q = Li,qGi(4)
Local weights are computed within a given document, whereas global weights are computed across the collection of documents D
D = {d1, d2,... dN}(5)
where N is the number of documents in the collection.
-
Prior to computing cosine similarities, documents are reduced to a stream of unique terms through a transformation process that includes linearization (removal of tag elements, comments, scripts, style instructions,...), tokenization (lowercasing, removal of full/partial stops, or unwanted characters), and filtration (stopwords removal and deduplication).
Survival terms are then merged with survival terms from other documents, and a vocabulary of terms is generated. Documents and query are then represented by sparse vectors as only a few of the vocabulary terms are in any one document or query.
As a result, combinations of local and global weighting schemes, for documents and queries, give rise to different cosine similarity-based ranking functions. Our VSE tool helps precisely with the exploration of different combinations of weighting schemes.
It should be stressed that vector space models are not the only way of ranking documents. For instance, documents can be ranked using Best Match Models (i.e., BM25, BM15, BM11, BM1, and BM0). These models rank documents in the presence or absence of relevance information.
- Before proceeding any further, please read the What is computed? section, which describes the several vector space implementations supported by the tool. For those interested in the theory behind vector space and best match models, useful tutorials are available in the Tutorials section of Minerazzi.
-
VSE implements many of the local and global weighting schemes used in the IR literature (Chisholm & Kolda, 1999; Salton & Buckley, 1987; Crestani, Sanderson, Ruthven, & Rijsbergen, 1995; Sanderson & Ruthven, 1996). These are given in Tables 1 and 2.
Parametric correction values k and K are given by default, but can be overwritten for exploratory and optimization purposes.
-
In Table 1, some local weights are normalized with document dj sum of frequencies, average term frequency aj, or maximum term frequency xj.
Table 1. Local Weighting Schemes Abbr. Name Equation k Remarks FREQ Frequency, Term Count - Favors long documents as these tend to repeat terms. Assumes that terms repeated x times weight x times more. FREQA Average-normalized FREQ - Favors long documents by normalizing term frequencies with dj average term frequencies. FREQX Maximum frequency-normalized FREQ - Favors documents that mention more a given query term by normalizing term frequencies with dj maximum term frequency. BNRY Binary - Favors vocabulary-rich documents as their retrieval possibilities increase. Insensitive to term repetition. SQRT Square Root 0.5 Resembles LOGA curve, but gives more weight for terms appearing only a few times in a document. ATF1 Augmented Normalized Term Frequency 0.5 Awards weight to a term for appearing in a document, then awards additional weight for appearing frequently. Weight varies between 0.5 and 1. Reduces to FREQX for k = 0 and to BNRY for k = 1. ATFC Changed-Coefficient ATF1 0.2 Of same form as ATF1, but even more weight is given if a term appears frequently in a document, but less weight is given just for appearing. The maximum value for ATFC is one. Reduces to FREQX for k = 0 and to BNRY for k = 1. ATFA Augmented Average Term Frequency 0.9 Like ATF1, gives more weight to a term for just appearing in a document, but even less weight if a term appears frequently. It is also normalized differently and with dj average term frequency. The average ATFA value is one. Reduces to FREQA for k = 0 and to BNRY for k = 1. ALOG Log Attenuated - Shifts the scale of frequencies by 1 and then log-transforms it so minimum local weight is 0.3. Attenuates weights up to one decade more than LOGA. LOGA Log - Shifts a log-transformed scale of frequencies so minimum local weight is 1. LOGN Normalized Log - As is normalized by the 1 + log(aj) term, where aj is dj average term frequency, the weight given by LOGN will always be lower than that of LOGA. Normalization insures that the scale of local weights is upper bounded. LOGG Augmented Log 0.2 A variation of ATFC. Local weight can be greater than 1. Works well as a local query weight. For k = 0 reduces to ALOG whereas for k = 1 reduces to BNRY LOGLN Log-Length Normalized - Similar to ALOG, but attenuates local weights even more by normalizing values with the log of the length of dj, taken as the number of unique terms. - Computing term weights by considering only term frequencies is not an effective strategy. For instance, if high frequency terms are prevalent across a collection of documents, then all documents are likely to be retrieved, affecting the precision of a search.
This can be avoided by incorporating global weighting schemes that account for terms concentrated in a few documents. These schemes are based on the notion of inverse document frequency (IDF).
Table 2 lists some of the best known global weighting schemes.
Table 2. Global Weighting Schemes Abbr. Name Equation K Remarks IDF Inverse Document Frequency - Favors documents that mention a term in a few number of documents. Assigns low weight to terms prevalent across a collection, stopwords, and frequently used terms. IDFP Probabilistic IDF - Assigns a weight of zero to terms mentioned in 50% of the collection of documents and negative weights if they appear in more than 50% of the collection. Does not avoid negative weights. IDFPS Smoothed IDFP 0.5 IDFP with a smoothing correction. Resets the scale of weights to avoid a mathematical error when N = ni. However, it still assigns a weight of zero to terms mentioned in 50% of the collection of documents. Does not avoid negative weights. IDFPA Adaptable IDFP 2 An adaptable weighting scheme that we derived purely from power transformation theory, initially referred to as IDFP+ (Garcia, 2009; 2011). For K = 0 reduces to IDFP, for K = 1 reduces to the classic IDF, and for K = 2 reduces to Lee's lift function that she derived using several layers of abstractions (Lee, 2007). IGFF Global Frequency IDF - Assigns a weight of 1 if a term appears once in every document or once in one document. Works best when combined with a different global weight on the query. Favors terms repeated often over a small set of documents. IGFI Incremental Global Frequency IDF 1 Like IGFF, but adding K improves its performance and favors even more terms repeated often over a small set of documents. For K = 0 reduces to IGFF. IGFL Log-Global Frequency IDF 1 Logarithmic version of IGFI. IGFS Square Root Global Frequency 0.9 Subtracting a value of K less than 1 was found to improve performance. Subtracting 1 could cause a global weight of zero for some terms. CH-2 Croft-Harper Assumption 2 0.5 The assumption here is that all query terms share the same probability of occurring in a relevant document. Thus K is interpreted here as this probability. For K = 0.5 CH-2 reduces to IDFP, whereas for 0 < K < 1 the curve of IDFP values is lifted.
Avoiding Unrealistic Weighting Scenarios
- Some times, the selection of a global weighting scheme can result in an unrealistic scenario where infinite and/or nonpositive weights are obtained:
- When 100% of the documents mention a query term, i.e., ni = N, IDF returns zero weights, IDFP returns infinite weights, and IDFPS returns negative weights.
- When 50% of the documents mention a query term, i.e., ni = 0.5N, IDFP and IDFPS return zero weights.
- When more than 50% of the documents mention a query term, i.e., ni > 0.5N, IDFP and IDFPS return negative weights.
Those scenarios can be avoided altogether with the IDFPA weighting scheme by setting K > 1; default value is K = 2.
Using VSE as a Plagiarism Checker
- Plagiarism checkers search for similarities across documents using different types of similarity measures, one being cosine similarity.
To use VSE as a plagiarism checker for any two pieces of text (e.g., two abstracts, two paragraphs,...), enter one piece of text in textarea D and the other in textarea q. In this case, do not enter blank lines in the textareas.
Take the computed cosine similarity for a measure of plagiarism.
- Researchers, teachers, students, or anyone interested in exploring vector space models.
- Users with a basic knowledge of information retrieval systems.
-
Combining FREQ with inverse document frequency (IDF) as a global weight defines one of the best known weighting schemes known as TF-IDF where
(6)
Rank documents with FREQ only and then with TF-IDF. In both cases, use FREQ for the query. Compare results.
- As noted by Baeza-Yates & Ribeiro-Neto (1999), Salton and Buckley (1987) suggested weighting query terms with (7), which is an ATF1-IDF weighting scheme.
(7)
Repeat previous exercise, this time using TF-IDF for the documents and ATF1-IDF for the query. Compare results.
- Baeza-Yates, R., & Ribeiro-Neto, B. A. (1999). Modern Information Retrieval. Addison-Wesley, N.Y. Chapter 2, p. 30.
- Chisholm, E. and Kolda, T. G. (1999). New Term Weighting Formulas for the Vector Space Method in Information Retrieval. Oak Ridge National Laboratory.
- Crestani, F., Sanderson, M., Ruthven, M. I., and Rijsbergen, C. J. (1995). The troubles with using a logical model of IR on a large collection of documents. Proceedings of the 4th TREC conference (TREC-4). NIST, Pages 509-526.
- Garcia, E. (2009). Robertson-Spärck-Jones Probabilistic Model Tutorial. Minerazzi.com.
- Garcia, E. (2011). A Tutorial on the BM25F Model. Minerazzi.com.
- Grossman, D.A., & Frieder, O. (2004). Information Retrieval: Algorithms and Heuristics. Springer.
- Harman, D. (2011). Document Retrieval Evaluation. Chapter 1, pp 16-24. Morgan & Claypool Publishers
- Jones, W. P., Furnas, G. W. (1987). Pictures of Relevance: A Geometric Analysis of Similarity Measures. JASIS, 38(6), 420-442.
- Lee, L. (2007). IDF Revisited: A Simple New Derivation within the Robertson-Spärck Jones Probabilistic Model. SIGIR 07, July 23-27, 2007. Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007.
- Manning, C.D., Raghavan, P., & Schütze, H. ( 2008). Introduction to Information Retrieval. Cambridge University Press.
- Salton, G., Buckley, C. (1987). Term Weighting Approaches in Automatic Text Retrieval 87-881. Cornell University.
- Sanderson, M. and Ruthven, I. (1996). Report on the Glasgow IR group (glair4) submission. Proceedings of the 5th TREC conference (TREC-5). NIST, Pages 517-520.
Feedback
Contact us for any suggestion or question regarding this tool.