Large real-world datasets are useful to pose bigger challenges to studied methods and are more likely to cover a variety of distributions and patterns. While the most interesting datasets are not available to the public due to commercial interests and ethical considerations, the following list aims to compile some large publicly available datasets that are approriate for research purposes.

Tabular Data

Gaia space craft, European Space Agencymap of stars (and more) tensbillions
OpenStreetMap.orgcrowdsourced gps coordinates32.7 billion
University of Columbia & Facebook Connectivity Lab high-resolution population grids based on satellite imagerymillionsmillions
University of California, Irvinesimulated particle detector events related to higgs bosonstensmillions
ETH Zurichpoint cloud data3millions

Network Data

arnetminer.orgdblp citation network datamillionsmillions
KAIST university, South Koreatwitter network datamillionsmillions

Relational Data

TPCTPC-H benchmark (synthetic data)
TPCTPC-DS benchmark (synthetic data)
paperJoin Order Benchmark
IMDbmovie/series meta data


University of California, IrvineUC Irvine Machine Learning Repository