Identifying the Influential Bloggers in a Community: The TUAW dataset

In this paper we introduced MEIBI and MEIBIX, two metrics used to identify the most influential bloggers of a community blog.

The dataset we used is a crawl of the The Unofficial Apple Weblog (TUAW) performed in December 7th, 2008. Here we provide a MySQL dump of the employed dataset. All you have to do is to create a MySQL database (e.g. tuaw_data) and import the SQL dump into this database.

The dump consists of four tables:

  • bloggers: contains the 51 bloggers of TUAW.
  • linkage: contains linking information (i.e. the incoming links) of the TUAW posts. The table stores 53,575 records.
  • posts: The 17,831 TUAW posts accompanied by their metadata (blogger ID, publication date, number of comments, etc).
  • visited: This table stores the URLS of all the pages of TUAW visited by our crawler. In total, our crawler visited 162,012 Web pages.

You can download the dataset by clicking here (8.26 MB).

You may also want to check a CSV version the dataset on kaggle.

In case you need the dataset in another format (e.g. XML or JSON) please feel free to contact me.

Note: The researchers who used, or will use this dataset, are kindly asked to cite the following article in their work/s.
L. Akritidis, D. Katsaros, P. Bozanis, "Identifying Influential Bloggers: Time Does Matter", In Proceedings of the 2009 IEEE/WIC/ACM International Conference on Web Intelligence (WI), pp. 76-83, 2009.