Covariant Announces a Universal AI Platform for Robots

When IEEE Spectrum first wrote about Covariant in 2020, it was a new-ish robotics startup looking to apply robotics to warehouse picking at scale through the magic of a single end-to-end neural network. At the time, Covariant was focused on this picking use case, because it represents an application that could provide immediate value—warehouse companies pay Covariant for its robots to pick items in their warehouses. But for Covariant, the exciting part was that picking items in warehouses has, over the last four years, yielded a massive amount of real-world manipulation data—and you can probably guess where this is going.

Today, Covariant is announcing RFM-1, which the company describes as a robotics foundation model that gives robots the “human-like ability to reason.” That’s from the press release, and while I wouldn’t necessarily read too much into “human-like” or “reason,” what Covariant has going on here is pretty cool.

“Foundation model” means that RFM-1 can be trained on more data to do more things—at the moment, it’s all about warehouse manipulation because that’s what it’s been trained on, but its capabilities can be expanded by feeding it more data. “Our existing system is already good enough to do very fast, very variable pick and place,” says Covariant co-founder Pieter Abbeel. “But we’re now taking it quite a bit further. Any task, any embodiment—that’s the long-term vision. Robotics foundation models powering billions of robots across the world.” From the sound of things, Covariant’s business of deploying a large fleet of warehouse automation robots was the fastest way for them to collect the tens of millions of trajectories (how a robot moves during a task) that they needed to train the 8 billion parameter RFM-1 model.


Covariant

“The only way you can do what we’re doing is by having robots deployed in the world collecting a ton of data,” says Abbeel. “Which is what allows us to train a robotics foundation model that’s uniquely capable.”

There have been other attempts at this sort of thing: The RTX project is one recent example. But while RT-X depends on research labs sharing what data they have to create a dataset that’s large enough to be useful, Covariant is doing it alone, thanks to its fleet of warehouse robots. “RT-X is about a million trajectories of data,” Abbeel says, “but we’re able to surpass it because we’re getting a million trajectories every few weeks.”

“By building a valuable picking robot that’s deployed across 15 countries with dozens of customers, we essentially have a data collection machine.” —Pieter Abbeel, Covariant

You can think of the current execution of RFM-1 as a prediction engine for suction-based object manipulation in warehouse environments. The model incorporates still images, video, joint angles, force reading, suction cup strength—everything involved in the kind of robotic manipulation that Covariant does. All of these things are interconnected within RFM-1, which means that you can put any of those things into one end of RFM-1, and out of the other end of the model will come a prediction. That prediction can be in the form of an image, a video, or a series of commands for a robot.

What’s important to understand about all of this is that RFM-1 isn’t restricted to picking only things it’s seen before, or only working on robots it has direct experience with. This is what’s nice about foundation models—they can generalize within the domain of their training data, and it’s how Covariant has been able to scale their business as successfully as they have, by not having to retrain for every new picking robot or every new item. What’s counter-intuitive about these large models is that they’re actually better at dealing with new situations than models that are trained specifically for those situations.

For example, let’s say you want to train a model to drive a car on a highway. The question, Abbeel says, is whether it would be worth your time to train on other kinds of driving anyway. The answer is yes, because highway driving is sometimes not highway driving. There will be accidents or rush hour traffic that will require you to drive differently. If you’ve also trained on driving on city streets, you’re effectively training on highway edge cases, which will come in handy at some point and improve performance overall. With RFM-1, it’s the same idea: Training on lots of different kinds of manipulation—different robots, different objects, and so on—means that any single kind of manipulation will be that much more capable.

In the context of generalization, Covariant talks about RFM-1’s ability to “understand” its environment. This can be a tricky word with AI, but what’s relevant is to ground the meaning of “understand” in what RFM-1 is capable of. For example, you don’t need to understand physics to be able to catch a baseball, you just need to have a lot of experience catching baseballs, and that’s where RFM-1 is at. You could also reason out how to catch a baseball with no experience but an understanding of physics, and RFM-1 is not doing this, which is why I hesitate to use the word “understand” in this context.

But this brings us to another interesting capability of RFM-1: it operates as a very effective, if constrained, simulation tool. As a prediction engine that outputs video, you can ask it to generate what the next couple seconds of an action sequence will look like, and it’ll give you a result that’s both realistic and accurate, being grounded in all of its data. The key here is that RFM-1 can effectively simulate objects that are challenging to simulate traditionally, like floppy things.

Covariant’s Abbeel explains that the “world model” that RFM-1 bases its predictions on is effectively a learned physics engine. “Building physics engines turns out to be a very daunting task to really cover every possible thing that can happen in the world,” Abbeel says. “Once you get complicated scenarios, it becomes very inaccurate, very quickly, because people have to make all kinds of approximations to make the physics engine run on a computer. We’re just doing the large-scale data version of this with a world model, and it’s showing really good results.”

Abbeel gives an example of asking a robot to simulate (or predict) what would happen if a cylinder is placed vertically on a conveyor belt. The prediction accurately shows the cylinder falling over and rolling when the belt starts to move—not because the cylinder is being simulated, but because RFM-1 has seen a lot of things being placed on a lot of conveyor belts.

“Five years from now, it’s not unlikely that what we are building here will be the only type of simulator anyone will ever use.” —Pieter Abbeel, Covariant

This only works if there’s the right kind of data for RFM-1 to train on, so unlike most simulation environments, it can’t currently generalize to completely new objects or situations. But Abbeel believes that with enough data, useful world simulation will be possible. “Five years from now, it’s not unlikely that what we are building here will be the only type of simulator anyone will ever use. It’s a more capable simulator than one built from the ground up with collision checking and finite elements and all that stuff. All those things are so hard to build into your physics engine in any kind of way, not to mention the renderer to make things look like they look in the real world—in some sense, we’re taking a shortcut.”


RFM-1 also incorporates language data to be able to communicate more effectively with humans.
Covariant

For Covariant to expand the capabilities of RFM-1 towards that long-term vision of foundation models powering “billions of robots across the world,” the next step is to feed it more data from a wider variety of robots doing a wider variety of tasks. “We’ve built essentially a data ingestion engine,” Abbeel says. “If you’re willing to give us data of a different type, we’ll ingest that too.”

“We have a lot of confidence that this kind of model could power all kinds of robots—maybe with more data for the types of robots and types of situations it could be used in.” —Pieter Abbeel, Covariant

One way or another, that path is going to involve a heck of a lot of data, and it’s going to be data that Covariant is not currently collecting with its own fleet of warehouse manipulation robots. So if you’re, say, a humanoid robotics company, what’s your incentive to share all the data you’ve been collecting with Covariant? “The pitch is that we’ll help them get to the real world,” Covariant co-founder Peter Chen says. “I don’t think there are really that many companies that have AI to make their robots truly autonomous in a production environment. If they want AI that’s robust and powerful and can actually help them enter the real world, we are really their best bet.”

Covariant’s core argument here is that while it’s certainly possible for every robotics company to train up their own models individually, the performance—for anybody trying to do manipulation, at least—would be not nearly as good as using a model that incorporates all of the manipulation data that Covariant already has within RFM-1. “It has always been our long term plan to be a robotics foundation model company,” says Chen. “There was just not sufficient data and compute and algorithms to get to this point—but building a universal AI platform for robots, that’s what Covariant has been about from the very beginning.”