1 Comment

"But in the last few years, we’ve gotten:

- Powerful Agents (Agent57, GATO, Dreamer V3)

...

- AIs that are superhuman at just about any task we can (or simply bother to) define a benchmark for"

I disagree with these two points. Progress in getting AIs to play games has been very slow, and headlines are way overhyped. For any impressive headline in AI, especially in AIs playing games, look past the headline and you will find that the headline was either misleading or outright false.

For example, the DreamerV3 paper claims to be the first agent to learn to get diamonds in Minecraft without training on human examples. That's not true - they increased the block breaking speed to 100x and gave their agent direct information about the gamestate.

If we are using games as a benchmark for AI progress then the AI should be competing on a level playing field with humans, which means it only has access to information the human has access to (pixels and audio) and, obviously, no drastically altering the rules of the game in order to make the objective easier while failing to mention that fact in your abstract.

Under these conditions, AI still hasn't managed to progress past Atari games, where the screen can be compressed to 84x84 grayscale images without losing any relevant information and there are a maximum of 18 possible choices (1 joystick and 1 button - the joystick can be in 9 possible positions, and the button can either be pressed or not pressed, so 18 possible options in combination).

In fact, not only has AI failed to reach superhuman performance on any videogame bigger than Atari without an unfair advantage in the form of being given information that the human doesn't have access to, it hasn't reached average-human level performance either. The closest was VPT for Minecraft, which got diamond pickaxes 2.5% of the time under these conditions compared to human testers' 12% (and of course a very good human would get it approximately 100% of the time).

Also, GATO performed very poorly - it was trained to imitate the play of an agent, Muesli, that got a median score of over 1000% of that of the human tester, and ended up with a score around 200-something% of that of a human. In other words, it was literally showed exactly how to get a very high score on Atari and was unable to replicate that. It and VPT were both reassuring negative results - i.e. they showed that the extremely impressive performance that large transformer models show in language and image generation does not easily cross over to the much scarier realm of agentic behaviour.

Expand full comment