XOR-XNOR

xor, xnor modes

XOR and XNOR Searches

XOR is a logical operation that outputs true whenever an uneven number of inputs is true (1). Its inverse operation, XNOR, outputs true whenever an even number of inputs is true.

Adapting these operations to IR, documents retrieved with the XOR mode form a superset of documents that match an uneven number of search terms (1, 3, 5, 7,...). By contrast, documents retrieved with the XNOR mode form a superset of documents that match an even number of search terms (0, 2, 4, 6,...).

We may refer to these as coordination levels between search terms and document terms. Each coordination level consists of several subsets.

Figure 1 illustrates these coordination levels and subsets for a query consisting of three terms w1, w2, and w3. The figure was shaped as a drum to indicate that a collection of documents was queried. The colored regions represent XOR matches and the white regions represent XNOR matches. In other words, XOR nonmatches are XNOR matches.

XOR and XNOR

Figure 1. Venn Diagram representation of XOR and XNOR searches.

When we extend this interpretation to queries consisting of more than 3 terms, a clearer picture emerges, showing the beauty of these search modes to their full potential.

The Beauty of XOR and XNOR: Co-Occurrence and LSI

A 5-term XOR search is capable of retrieving documents matching the subsets of terms given in the following table:

Table 1. Levels and subsets of a 5-term XOR search
Levels Subsets
1 {w1}, {w2}, {w3}, {w4}, {w5}
3 {w1, w2, w3}, {w1, w2, w4}, {w1, w2, w5}, {w1, w3, w4}, {w1, w3, w5}, {w1, w4, w5}, {w2, w3, w4}, {w2, w3, w5}, {w2, w4, w5}, {w3, w4, w5}
5 {w1, w2, w3, w4, w5}

By contrast, a 5-term XNOR search is capable of retrieving the following subsets:

Table 2. Levels and subsets of a 5-term XNOR search
Levels Subsets
0 { }
2 {w1, w2}, {w1, w3}, {w1, w4}, {w1, w5}, {w2, w3}, {w2, w4}, {w2, w5}, {w3, w4}, {w3, w5}, {w4, w5}
4 {w1, w2, w3, w4}, {w1, w3, w4, w5}, {w1, w2, w3, w5}, {w1, w2, w4, w5}, {w2, w3, w4, w5}

where { } represents the subset of documents matching none of the search terms. Wikipedia has an equivalent Venn Diagram representation of the nonempty subsets given in these tables.

Tables 1 and 2 reveal that any two terms have a first-order co-occurrence relationship when they belong to the same subset, but can exhibit a second-order co-occurrence when they belong to different subsets. This is illustrated in the following figure.

first-order co-occurrence second-order co-occurrence

Figure 2. First- and second-order co-occurrence scenarios.

The overlapping region at the left represents the {w1, w4} subset. This subset consists of documents where w1 co-occurs with w4. This is a first-order co-occurrence scenario.

The two overlapping regions at the right correspond to the {w1, w2} and {w2, w4} subsets. Notice that w1 co-occurs with w2 and w2 co-occurs with w4. Now w1 and w4 do not co-occur, but are related through the occurrence of another term, w2.

We may refer to w2 as an intermediary term. This is a second-order co-occurrence scenario between w1 and w4. Evidently, second-order co-occurrence between search terms is not defined for XOR searches with less than 4 terms or XNOR searches with less than 3 terms. Thus, the beauty of these search modes is best appreciated with longer queries.

As more intermediary terms are involved, co-occurrences of higher orders emerge. This is demonstrated in Figure 3, where now w1 and w4 exhibit a third-order co-occurrence relationship.

third-order co-occurrence

Figure 3. Third-order co-occurrence scenario between terms w1 and w4.

Certainly, w1 and w4, or any two terms, can exhibit a second-order or higher co-occurrence relationship through other search terms or through intermediary terms that are not part of the query, but that are present in the documents.

For instance in the following figure, colored circles represent document sets matching query terms and white circles represent document sets mentioning specific terms not present in the query. Notice that w1 and w4 exhibit a second-order co-occurrence through an intermediary term w* that is not in the query.


third-order co-occurrence

Figure 4. Co-occurrence through a term not present in a query.

Additional high-order co-occurrence cases can be derived from Figure 4. In other words, co-occurrence paths already exist in a collection before submitting any query. XOR and XNOR searches simply identify specific paths and the "actors" involved.

Co-occurrence is an important term-term relationship. In a series of articles, Kontostathis et al. have demonstrated that latent semantic indexing (LSI) information is mainly related to first- and second-order co-occurrences between terms (2 - 4). Thus, XOR and XNOR searches seem a good fit for latent semantic indexing studies and search engines based on LSI.

Possible Applications

An IR researcher can test the performance of an LSI algorithm with a sample of documents retrieved through XOR and XNOR searches. Said sample should be rich in co-occurrence cases. Using a similar procedure, search marketers or Web intelligence specialists can identify sets of documents that emphasize keywords somehow related through different co-occurrence paths.

An interesting application consists in extracting all the unique terms (or just the high frequency ones) from a text source and constructing an XOR query with these. We may refer to this as XORing a text source. This should help one identify a network of co-occurrence paths over a collection and which documents might be relevant to specific combination of terms from the original source.

The text source can be a title, description, abstract, or paragraph of a document, or even an entire document. However, XORing a large document might be computer-intensive.

A similar exercise can be done by XNORing a text source. In both cases, the resultant output can be used to identify prospective competitors; i.e., documents relevant to similar concepts or belonging to companies within the same business space.

Emulating XOR and XNOR Searches

If a search engine lacks of the XOR and XNOR search modes, but supports AND, OR, and NOT combinations, the following combination

w1 OR w2 OR w3 OR (w1 AND w2 AND w3) NOT (w1 AND w2) NOT (w1 AND w3) NOT (w2 AND w3)

can be used to emulate a 3-term XOR search. Similarly,

(w1 AND w2) OR (w1 AND w3) OR (w2 AND w3) NOT (w1 OR w2 OR w3) NOT (w1 AND w2 AND w3)

can be used to emulate a 3-term XNOR search.

These are not user-centered solutions, but a cumbersome way of working around a machine-centered platform. As the number of search terms increases, emulating XOR and XNOR searches in that way is not a practical retrieval technique.

A Better Way of Searching

With Minerazzi, all that hassle is avoided by submitting an XOR search like this

XOR:w1 w2 w3

and an XNOR search like this

XNOR:w1 w2 w3

Moreover, thanks to its Match Previews search interface, switching between search modes, or any mode for that matter, boils down to a one-click action from the end user.

Although we have limited Minerazzi searches to a maximum of five terms per query, we can remove this limit to run XOR and XNOR queries consisting of a large number of terms.

Applications to Clustering and Disambiguation

This section was added on 1-9-2014 and is the result of valuable feedback and discussion with Michael Berry (University of Tennessee, Knoxville). Berry is a world-class leading expert on many computer engineering topics, particularly LSI. During the '90s, he along with Dumais, Landauer, and others (5 - 8), developed the mathematical foundations of latent semantic analysis (5 - 11). We met back in 2006 during the Document Space Workshop held at the Institute of Pure and Applied Mathematics (IPAM), University of California, Los Angeles.

We discussed the following scenario. Some terms have too many contexts to go with. For instance, 'shot' might go with 'flu', 'drinking', or 'guns'. Similarly, 'bar' can go with 'establishment', 'counter', 'unit', 'music', 'law', 'landform', 'metal', 'sport', and 'solid' (12, 13), or even with 'deodorant' and 'shot'.

As he pointed out, an LSI search with multi-context terms might still struggle retrieving 'this, but not that' or in properly clustering documents as in 'include this topic, but nothing about that topic'.

These are open problems in computational linguistic (12) that cannot be solved with a single strategy or algorithm. No 'silver bullet' approaches are possible.

Because of that, diverse combinations of strategies have been used. Some of these are based on combining search operators, on analyzing word proximities and senses, on using query expansion methods, and so forth. PCA, SVD, LSI, LDA, Vector Space Models and other algorithms have been used with some of these techniques.

Frequently, the size and nature of a collection limit the success of those strategies and algorithms, not to mention that in most of these the AND search mode is commonly used. To the best of our knowledge, the use of XOR and XNOR searches with these strategies and algorithms is not documented.

So how we could use XOR and XNOR searches as part of a disambiguation and clustering strategy?

A closer look at Figure 1 and Tables 1 and 2 provides some insights. Notice that

  • the coordination levels from the XOR and XNOR search modes are by definition disambiguated from one another, forming specific clusters.
  • coordination levels and subsets within these levels are inherently disambiguated from each others.

The traditional AND and OR modes do not have these remarkable functionalities. So it seems reasonable to take advantages of these functionalities for the disambiguation of terms and clustering of documents.

True that XOR and XNOR results are supersets, but these can be further down resolved into individual subsets. To illustrate, consider the terms flu, shot, drinking, and guns.

First we construct a text source from these terms. We can expand the text source by adding synonyms, but for simplicity sake let assume that we don't do that. Since we have 4 terms, we can look at the coordination levels 0, 2, 4 of a XNOR search. Essentially we are XNORing the text source.

Level 0 has the { } subset and Level 4 the {w1, w2, w3, w4} subset. Results from { } can be subtracted by running a NOR search and results from {w1, w2, w3, w4} can be removed by running an AND search.

We can also write conditional if-else statements to selectively remove these subsets from XNOR results, so we end up with subsets only from level 2. These can be further down discriminated with a ranking function or the results post-processed by running specific AND searches.

Although a straightforward strategy, this 'brute force' solution is not easy to implement with long text sources. A far better approach consists in decomposing an XNOR superset with an algorithm similar to the one used by the Match Previews interface. This should provide a landscape of subsets with the corresponding match counts.

Responsive Searches and Answer Set Expansions

This section was added on 1-10-2014 and essentially consists of afterthoughts on the many possibilities of the search modes herein discussed.

As mentioned before, AND is the default mode of most search engines. This mode is also used in a vast portion of the IR literature under the assumption that documents matching all terms of a query tend to be relevant to said query and should be retrieved. Conversely, documents matching some terms of a query are not retrieved even if these are relevant.

To overcome this limitation of AND searches, a search system can be designed to expand in the background a query or answer set or both, impacting precision and recall. This can be done in many different ways, for instance through lookup lists of related terms, a thesaurus, by mining co-occurrence information (e.g., using LSI), or through other auxiliary algorithms.

Expanding a query and answer sets is another area that seems a good fit for XOR and XNOR searches.

Instead of using the default AND mode, a retrieval system can be designed to responsively switch between the XOR and XNOR modes based on the number of search terms submitted. If these amount to an odd number, the system can be instructed to switch to XOR; otherwise to XNOR (probably with the { } subset removed). The result is a responsive search experience.

For all practical purposes, in each case the result will be a superset of documents consisting of AND results expanded with partial matches. This can be realized from Figure 1 and Tables 1 and 2. Effectively, we are expanding the answer set of AND results, at query time, and without invoking additional resources from the information retrieval system.

Still, one might elect to use auxiliary query and answer set expansion methods, but that will be the icing on the cake.

Big Data: The XOR/XNOR Next Frontier

This section was added on 1-11-2014 to explore possible applications with big data sets. Big data (14) is considered the next frontier for all research disciplines. It has an increasing demand in developed economies to the point that governments, companies, and research centers worldwide are dedicating big budgets to its study and deployment.

The IR challenges faced when dealing with big data are far different from those found in traditional collections. It does not seem reasonable to query big data collections in the same traditional AND-way we query small collections. This is an area where implementation of the search modes herein discussed are promising.

As can be realized from the previous section, XOR and XNOR queries essentially are a composite of queries submitted at once to retrieve a superset.

So, instead of querying a big data collection with a batch of individual AND queries (inputs), it seems reasonable to submit their XOR/XNORed responsive versions, reducing the number of inputs and associated query costs. More data is accessible to the end user with less input efforts, increasing the 'velocity' of the search experience. And we must not forget that XOR and XNOR searches can also be used for clustering data sets.

At the time of writing, the use of these search modes with Big Data remains an open question that deserves to be explored.

Yes! More Comments

This section was added on 1-13-2014.

As ACM Fellow and IEEE Fellow and author of Modern Information Retrieval (15), Ricardo Baeza-Yates suggested us, the ideas herein expressed are interesting, but should be followed by validation through conferences like ICTIR. ICTIR is aimed at the dissemination of work "...that demonstrate a high level of research adventure or which break out of the traditional IR paradigms". Similar conferences, like ECIR, are available elsewhere.

While others might feel that the above are 'minor tweaks', and we respectfully disagree with that perception, sometimes a 'minor tweak' is precisely what it takes to improve big machines, great theories, and beautiful frameworks. And we must stop short before delving into Grover's quantum searches (16 - 19), the not-so-distant IR next frontier.

This section was added on 12-22-2016.

The beauty of XOR/XNOR searches as the main points of this article are still relevant, particularly as more researchers are paying attention to quantum searches in spite of some of its drawbacks (20 - 27), some of which are based on mere OR and AND searches—not on the XOR/XNOR search modes.

Final Remarks

Minerazzi supports the XOR and XNOR search modes natively. These modes allow users to run first-, second-, and high-order co-occurrence searches and provide functionalities not available through other search modes. At the time of writing, commercial search engines including some that claim to be based on LSI do not provide full native support for these search modes.

XOR and XNOR searches ('X Searches', for short) present new information retrieval paradigms. It will be interesting to see implementations in diverse areas of information retrieval and data mining, for instance with quantum searches.

Acknowledgements

We would like to thank April Kontostathis (Ursinus College) for providing valuable comments on an early draft of this article and to Mike Berry (University of Tennessee, Knoxville) for helping us to explore possible applications in the area of query disambiguation and clustering.

References

  1. Davies, R. B. (2002). Exclusive OR (XOR) and hardware random number generators.
  2. Kontostathis, A., Pottenger. W. M. (2002). Detecting Patterns in the LSI Term-Term Matrix. IEEE ICDM02 Workshop Proceedings, The Foundation of Data Mining and Knowledge Discovery (FDM02). pp. 243-248. Maebashi, Japan. December 2002.
  3. Kontostathis, A., Pottenger. W. M. (2003). A framework for understanding LSI Performance. Proceedings of ACM SIGIR Workshop on Mathematical/Formal Methods in Information Retrieval-ACMSIGIR MF/IR '03.
  4. Mill, W., Kontostathis, K. (2004). Analysis of the values in the LSI term-term matrix. Technical Report.
  5. Berry, M. (Accessed on 1-8-2014). LSI-Related Publications.
  6. Berry, M. (Accessed on 1-8-2014). Publications - Information Retrieval.
  7. Using latent semantic analysis to improve access to textual information; Proceedings of the Conference on Human Factors in Computing Systems, CHI. 281-286, Dumais, S. T., Furnas, G. W., Landauer, T. K., Deerwester, S. & Harshman, R. (1988).
  8. Improving information retrieval using Latent Semantic Indexing. Proceedings of the 1988 annual meeting of the American Society for Information Science. Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Beck, L. (1988).
  9. Wikipedia. (Accessed on 1-8-2014). Latent semantic analysis.
  10. Lemaire, B. Dessus, P. (Accessed on 1-8-2014). Readings in Latent Semantic Analysis for Cognitive Science and Education.
  11. Introduction to Latent Semantic Analysis, Simon Dennis, Tom Landauer, Walter Kintsch, Jose Quesada; University of Colorado.
  12. Wikipedia. (Accessed on 1-8-2014). Disambiguation.
  13. Mihalcea, R. Using (2007) Wikipedia for AutomaticWord Sense Disambiguation. HLT-NAACL 2007: 196-203. Rochester, New York.
  14. Wikipedia. (Accessed on 1-11-2014). Big Data.
  15. Baeza-Yates, R. and Ribeiro-Neto, B. (2011). Modern Information Retrieval Addison Wesley, 2nd Edition.
  16. Grover, L. K. (1996). A fast quantum mechanical algorithm for database search.
  17. Grover, L. K. (1997). Quantum Mechanics helps in searching for a needle in a haystack Phys. Rev. Lett. 79 (1997) 325.
  18. Lavor, C., Manssur, L.R.U., and Portugal, R. (2008). Grover's Algorithm: Quantum Database Search.
  19. Wikipedia. (Accessed on 1-13-2014). Grover's algorithm.
  20. Phys.org (2016). NIST asks public to help future-proof electronic information.
  21. Viamontes, G. F., Markov, I. L., & Hayes, P. (2005). Is Quantum Search Practical?
  22. Phys.org (2005). Data structures influence speed of quantum search in unexpected ways.
  23. Quora (2014). How do you use the Grover quantum search algorithm to find all the solutions to some search query?
  24. Paparo, G. D. & Martin-Delgado, M. A. (2012). Google in a Quantum Network.
  25. Wang, H., Wu, J., Yang, X., Chen, P., & Yi, X. (2014). An enhanced quantum PageRank algorithm integrated with quantum search.
  26. Lu, S., Zhang, Y., & Liu, F. (2013). An efficient quantum search engine on unsorted database.
  27. MIT Technology Review (2011). Quantum PageRank algorithm outperforms classical version.

E. Garcia, PhD
Updated: December 22, 2016
Published: December 21, 2013
Status: Under permanent revisions and improvements.

Know of relevant research links that might be included in this article? Let us know.