Identifying the Productive and Influential Bloggers in a Community: The Techcrunch dataset

In this paper we introduced BP-Index and BI-Index, two metrics used to identify the most productive and influential bloggers in a community blog.

One of the datasets we used is a crawl of the Techcrunch blog community performed in April 2nd, 2010. Here we provide copies of the data files of the MySQL database of our employed dataset. All you have to do is to download the compressed database files, and extract them into the data/ directory of your MySQL installation. Then, you should be able to immediately access the Techcrunch database through your MySQL management interface.

The database consists of four tables:

  • authors: contains the 107 bloggers of Techrunch.
  • inlinks: contains linking information (i.e. the incoming links) of the Techcrunch posts. The table stores 193,808 records.
  • posts: The 19,464 Techcrunch posts accompanied by their metadata (blogger ID, publication date, number of comments, etc).
  • comments: This table stores all the comments made to the Techcrunch posts (746,561 records).

You can download the dataset by clicking here (126.4 MB).

You may also want to check a CSV version the dataset on kaggle.

In case you need the dataset in another format (e.g. XML or JSON) please feel free to contact me.

Note: The researchers who used, or will use this dataset, are kindly asked to cite the following article in their work/s.
L. Akritidis, D. Katsaros, P. Bozanis, "Identifying the Productive and Influential Bloggers in a Community", IEEE Transactions on Systems, Man, and Cybernetics-Part C: Applications and Reviews, vol. 41, no 5, pp. 759-764, 2011.