Product Classification and Categorization Datasets

This repository includes 3 datasets that offer an ideal ground for evaluating classification and categorization algorithms. All datasets contain e-commerce data; that is, product IDs, their titles, and their corresponding category. However, they can easily be applied to any problem which involves text/short-text mining.

The data originates from 3 real online electronic stores and product comparison platforms. It has been collected by a special focused Web crawler which has been developed for this purpose. For each of the 3 datasets, two versions exist: CSV and XML. The researchers that will use this repository may use any of these two versions according to their preferences. The following list contains some additional useful information:

  • The first dataset originates from ShopMania, a popular online product comparison platform. It enlists tens of millions of products organized in a three-level hierarchy that includes 230 categories. The two higher levels of the hierarchy include 39 categories, whereas the third lower level accommodates the rest 191 leaf categories. Each product is categorized into this tree structure by being mapped to only one leaf category. Some of these 191 leaf categories contain millions of products. However, shopmania.com allows only the first 10,000 products to be retrieved from each category. Under this restriction, our crawler managed to collect 313,706 products.
  • The second dataset was collected from PriceRunner, another popular product comparison platform. It includes 35,311 products from 10 categories, provided by 306 different vendors.
  • The third dataset was acquired by crawling the products of 12 categories of Skroutz. It includes 238,170 products supplied by 652 electronic stores.

The first dataset has been used in [1] to evaluate the proposed classifier. The researchers who used, or will use this dataset are kindly asked to cite [1] in their published article/s. On the other hand, the other two datasets have been employed in [2] for entity matching and clustering tasks. They can also be used in classification/categorization problems. Similarly, the researchers who used, or will use any of these 2 datasets are kindly asked to cite [2] in their published article/s.

[1] L. Akritidis, A. Fevgas, P. Bozanis, "Effective Product Categorization with Importance Scores and Morphological Analysis of the Titles", In Proceedings of the 30th IEEE International Conference on Tools with Artificial Intelligence (ICTAI), pp. 213-220, 2018.
[2] L. Akritidis, A. Fevgas, P. Bozanis, C. Makris, "A Self-Verifying Clustering Approach to Unsupervised Matching of Product Titles", Artificial Intelligence Review (Springer), pp. 1-44, 2020.

You may download the datasets and get access to some interesting kernels, by visiting this kaggle repository.