What is Minerazzi?Minerazzi is a platform for building miners.
- Mission Statement: To turn searchers into data miners, big data predators, and collection curators.
- Vision Statement: To develop a new class of productivity search engines.
What is a miner?A miner is a topic-specific search engine with a novel search paradigm: letting users recrawl and mine search results while they search. The result is a new generation of more efficient and productive search engines that go beyond lists of links and text snippets.
Miners can be built from individual web pages (from-the-bottom-up) or a pre-existent directory or sitemap (from-top-to-bottom), turning said directory or sitemap into a data mining machine.
Because each miner is a microindex loaded with data extraction tools, users can deploy these to enhance or build their own curated collections.
What is a microindex?We define a microindex as a small collection of primary URLs about a given topic or knowledge domain. Recrawling these allows users discover new, secondary URLs with fresh and somehow related content.
In many cases, a single primary record lets users discover dozen or hundred of secondary records with front- and back-end information waiting to be consumed. The total number of primary and secondary URLs defines the reach of a microindex.
Since December 1 of 2014, we are the first search solution that allows users recrawling their own search results. This is done with several complementary tools:
- A URL crawler (). This tool allows users qualitatively extract resources from multiple markup fields and elements.
- A link crawler (). This tool allows users quantitatively extract resources by walking the link structure of sites.
- Dozens of element-specific crawling tools for discovering front- and back-end information.
- Dozens of extraction/mining tools and tutorials.
To do the above users only need to access one or a few links from a microindex or while recrawling. To illustrate, submit a query and then run the tool located within a search result. Try with any of the following queries and miners:
[ mit ]
[ cornell ], etc
[ indymedia ]
[ investigative reporting ], etc
[ background checks ]
[ fbi ], etc
Why is recrawling so important?Recrawling lets users walk the link graph of a site and discover hidden or fresh goldmines that search engines might not have discovered. In addition, recrawling exposes users to new content and involve them in learning through discovery.
By recrawling search results, users can build curated collections, self-guide investigative work, or gather link intelligence from sites, directories, blogs, forums, or social networks.
At this time, Minerazzi recrawls files with the most common formats (.php, .asp, .aspx, .html, .htm, .js, etc). For recrawling to be useful, however, the content of a file should not be obfuscated or blocked.
Why microindexing?Don't underestimate or mistakenly take the term micro in microindexing. Microindexing is a curation strategy that simplifies the building of third-party large collections. If you already have a curated collection, building a miner out of it allows you to mine your carefully collected resources. This can be done without investing in expensive web scraping services or human resources.
When querying a microindex, users can find relevant results with fewer keywords. For instance, if a microindex is about a particular virus disease, searching for [ vaccine ], [ treatment ], or [ remedies ] should return relevant results with less typing. If you are a researcher or librarian, you will love microindexing.
In general, Minerazzi turns searching into a data mining activity. This makes more sense than limiting the search experience to browsing through zillion of cached records or staring at a list of links. The problem with that is that frequently those records are either outdated or irrelevant, not to mention that users essentially become passive expectators.
What you can do with Minerazzi?
Build a search engine about a given topic like news, music, health, legal, human resources...
Index hard-to-find documents. Help others to find what they want. Be a leader instead of a follower.
Extract contact information and Web Intelligence from search result pages. Be a data miner.
Teachers: Build a search engine about disciplines, journals, lecture materials...
Researchers: Deploy a search engine about company resources, projects, tools...
Business Intelligence: Collect network and users contact information, keywords....
Anyone: Build search engines about popular topics like sport, shopping, social styles, recipes, games....
The Minerazzi Difference
A User-Centered Experience
Minerazzi places users at the center of the action. Instead of reducing their search experience to staring at a list of results, Minerazzi allows them to interact with those results. Users actually become paparazzies of information, chasing down and mining data—from here the name of our platform.
While searching with Minerazzi, users can extract all sort of contact information (phone numbers, email addresses,..), query-driven data (keywords, tracking codes,..), and server configuration records. The data gathered can then be used for any marketing or research purpose.
From Searching to Mining
This approach, wherein users are engaged in learning through discovery, data gathering, and analysis is a natural evolution of the traditional concept of searching.
Doing More Effortless
Moreover, with Minerazzi users can:
- Reformulate queries by clicking on keywords. No need to waste time with keyword brainstorming sessions.
- Modify search modes by clicking on match counts. No need to memorize or manipulate search commands.
- Accept URL submissions by regular email, thus from practically anywhere.
- Submit queries that require of diacritics like tildes and accents.
As an added feature, Minerazzi allows you to follow query-relevant sites across the top social networks and search engines.
Because Minerazzi is loaded with about 40 extraction tools, users can extract and mine all sort of data while searching. A sample of some these tools are given below.
- News - Access news headlines relevant to a miner.
- Word Statistics - Extract statistics and candidate keywords.
- Directory Exploits - Find exploits through robots text files.
- Configuration Exploits - Test possible misconfigurations.
- Email Tool - Extract email addresses.
- Phone Tool - Extract phone numbers.
- Geolocation Tool - Extract geolocation data.
- URL Tool - Extract all kind of URLs, not just from links.
- Image Tool - View images from web sites.
- CSS Tool - Get external and internal .css files
- Colors - Extract colors from external and internal .css files.
- HTTP Headers - Examine server configuration headers.
- Mail Exchangers - Spot email remote servers.
- DNS Checks - Check available DNS services.
- DNS Records - Determine DNS records.
- Web Plugins - Identify third-party tracking codes.
- Source Codes - Read file source codes.
- Meta Tags - Get OpenGraph, Twitter, DC, traditional tags.
- and a lot more.
Without a doubt, our platform induces users to spend more time researching and mining instead of merely searching.
New Search Paradigms
Minerazzi is the first search engine that allows users preview the total number of matches and nonmatches from all available search modes.
This is achieved through our unique Match Previews interface. The interface makes multimodal searches possible, reduces query costs, and helps users adopt a search strategy based on self-generated feedback.
Our Match Previews interface also provides full native support for X Searches. These are searches based on the XOR and XNOR search modes.
When combined with other IR algorithms and techniques (LSI, LDA, Semantics,...), these search modes provide new information retrieval paradigms. If you are into research and data mining, you will love X Searches.
Goodbye One-Way Searching
Did you realize that the days of one-way, machine-centered searching and staring at lists of search results ended with the last century? Two-way, user-centered searches are here to stay.
Why keep sleeping with the past or teaching outdated textbook stuff? Be part of new information retrieval paradigms.
You are at the right place, at the right time. As we make the platform widely available, sure there will be bumps along the way or things to improve.
Let's grow and learn together.
You may contact us for any business inqueries or general questions at firstname.lastname@example.org.