Having standard benchmarks for a field is great because everyone has a common metric that they can compare their algorithms on. But by focusing efforts on these common tasks, benchmarks can also cause huge bias in the direction of progress in a field. Given this, is there some way we can verify the optimality of the benchmarks that the field chooses?

Well, generally we want our algorithms to perform well in some distribution of scenarios in the real world, and so we ought to select benchmarks that are reflective of that distribution. For example, perhaps you want to create algorithms for a household robot. The goal is to optimize the robot’s performance over the distribution of household environments in the real world.* So, your benchmarks should ideally be environments that are representative of the real-world structure of househould environments, rather than say, environments of randomly generated blocks. Of course, the downside here is that constructing more representative benchmarks involves extra work because this requires collecting data about what the distribution of household environments is actually like.

But given the potential for such benchmarks to shape the direction of a field, in many cases it seems worthwhile to me. Unfortunately, benchmarks seem to often be chosen without that much scrutiny. The people who initially propose a task have disproportionate power in instituting benchmarks because it’s much easier for everyone else to simply use a benchmark established in a prior paper than justify why they’re proposing a novel one. This is a disappointing anchoring effect because the original benchmarks may not have been that justified in the first place.

Perhaps there ought to be a stronger burden of proof expected for the benchmarks that become important to a field.

* This example came up in Anca’s AHRI class when I brought up this point.