XML & Feed URLs Extractor
XML & Feed URLs Extractor
Extract URLs from XML files and Web feeds, the easy way.
Instructions
- This tool was originally designed to extract URLs from Web feeds. However, as of 06-23-2018, the tool also extracts URLs from XML files so we changed its name to indicate this. The content of this page is still valid.
- The above change takes the tool to a new level. For instance, now you can extract URLs from files like sitemaps.xml and similar files.
- To use it, submit the full URL of the file or feed you are interested in, including its http(s) scheme.
What is computed?
- Because this tool parses documents with the XML, RSS, Atom, and RDF formats, it is suitable for extracting URLs from files other than Web feeds. Candidate files may include site maps, inventory files, and similar files as long as they have the .xml, .rss, .atom, or .rdf extensions.
-
A web feed is a data format used for providing users with frequently updated content. Blog feeds aimed at online communities imparts a social component to the technology. In general, web syndication technology is a form of social communication (Wikipedia, 2017a; 2017b) that is rich in urls, waiting to be extracted and mined.
- This tool is derived from a previous one: The XML & Web Feed Flattener.
- Once a document with the above extensions is flattened, the tool extracts all of its URLs, effectively working as a URL extractor.
- As of 05-20-2018, the tool converts URLs to links, deduplicates URLs generated by hash (#) characters (e.g., due to comments), and lets users exclude those pointing to images.
Who can use this tool?
- Anyone that need to extracts URLs from feeds.
Suggested Exercises
- Extract URLs from the site maps of Google at
- https://www.google.com/sitemap.xml (returns 21 records, all .xml site maps).
- http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml (warning: returns 50,000 records, all .gz compressed files).
These two URLs were obtained from https://www.google.com/robots.txt. Repeat this exercise with other websites.
- Extract feed URLs from a news source like Google, Bing, MIT News, and similar sources.
References
Feedback
Contact us for any suggestion or question regarding this tool.