Posting Lists Generator
Posting Lists Generator
This tool generates posting lists from unformatted, plain text.
Instructions
- This tool generates posting lists from unformatted, plain text instead of from a document collection.
- Posting lists are data structures that store term information.
- The posting lists generated by our tool are of the form term: {frequency value, array of positions} where:
- term = unique term from a piece of text.
- frequency value = number of occurrences of a term.
- array of positions = array of positions occupied by a term in the piece of text.
- Term positions start at 0; i.e. the first term occupies position 0, the second position 1, and so forth.
- Input text should be in English and limited to the first 100,000 characters.
- Tokenization (text fragmentation, punctuation removal, and filtering) is limited to the following sequence of steps:
- Terms joined by hyphens, underscores, pipes, and stops are disjoined.
- Contractions are truncated.
- Nonletter characters are removed.
- Stopwords are retained.
- Survival terms are lowercased.
Who can use it?
- Data miners, teachers, students, or anyone interested in constructing posting lists and inverted indexes.
Suggested Exercises
- Positional inverted indexes include document ID information in their posting lists. Suggest a routine that converts this tool into an inverted index generator.
- In Robertson's Okapi Best Match 25 Model (BM25), term weights do not grow linearly with term frequency but saturate after a few occurrences. Suggest a routine that would allow our tool to compute BM25 weights.