Towards Crawling the Web for Structured Data: Pitfalls of Common Crawl for E-Commerce


In the recent years, the publication of structured data inside HTML content of Web sites has become a mainstream feature of commercial Web sites. In particular, e-commerce sites have started to add RDFa or Microdata markup based on and GoodRelations vocabularies. For many potential usages of this huge body of data, we need to crawl the sites and extract the data from the markup. Unfortunately, a lot of markup resides in very deep branches of the sites, namely in the product detail pages. Such pages are difficult to crawl because of their sheer number and because they often lack links pointing to them. In this paper, we present a small-sized experiment where we compare the Web pages from a popular Web crawler, Common Crawl, with the URLs in sitemap files of respective Web sites. We show that Common Crawl lacks a major share of the product detail pages that hold a significant part of the data, and that an approach as simple as a sitemap crawl yields much more product pages. Based on our findings, we conclude that a rethinking of state-of-the-art crawling strategies is necessary in order to cater for e-commerce scenarios.

Proceedings of the 6th International Workshop on Consuming Linked Data, in conjunction with the 14th International Semantic Web Conference, CEUR Workshop Proceedings Vol. 1426, ISSN 1613-0073, October 12, 2015, Bethlehem, PA, USA