RASDaGen: Synthetic dataset generator for rank aggregation

RASDaGen is a generator of synthetic datasets that can be used in rank aggregation research. The user determines at least four parameters:

  • the number of queries to be created,
  • the number of voters who submit their preference list for each query, and
  • the length of the preference lists (i.e. how many elements they include), and
  • the size of the pool of the available elements (i.e. how many elements exist)

According to the aforementioned parameters, please notice that:

  • if the size of the pool of the available elements is equal to the preference list length, then RASDaGen generates full lists (also called permutations of the available elements),
  • if the size of the pool of the available elements is larger than the preference list length, then RASDaGen generates partial lists (i.e. they do not include all the available elements, some of them will be missing),

Output files

RASDaGen generates one output file per voter. Each file contains the voter’s preference lists for all queries placed one after the other. The output files are in TSV (Tab Separated Values) format that is identical to the one used by the Web Tracks of TREC. More specifically, 6 columns are created:

  • the first column is the topic number,
  • the second column is unused and it always is "Q0",
  • the third column is the element identifier,
  • the fourth and fifth columns contain the rank (integer) and the score (floating point) of each element, respectively. The scores are in descending (non-increasing) order, and
  • the sixth column shows the dataset name.

Relevance judgments

RASDaGen also creates another output file that for each query, contains the relevance judgments for each element in the pool. An element may be assigned a completely random relevance score. Otherwise, the user may specify a special bias parameter that takes into account the rankings of the elements in the input lists and appropriately modifies the probability that they are eventually relevant to a query. For instance, with biased assignment, a highly ranked element has a greater probability of being relevant to a query than a lower ranked element. The bias factor regulates the discrepancies among the aforementioned probabilities.

Downloads

RASDaGen is written in C++. The library is available for download through its official Github repository.