Haakon's weblog

Generating Anscombe's data

Anscombe's Quartet refers to four 2D data sets that all have the same descriptive statistics despite being completely different. They all have the same means, variances, correlations, linear regression lines and \(R^2\). I came upon them again and started wondering how many of these there are?

This is a classic inverse problem, i.e., given the descriptive statistics, generate data that does this. This is also the same problem with anonymizing data. If you give me an anonymized set of data it can be possible to de-anonymize it through optimization. If you find a few other data points.

It is trickier than straightforwardly solving since most solutions are symmetric. There are 10! permutations of each solution that looks different. We can add symmetry-breaking constraints that says that x should be descending.