Following the success of Dall-E and ChatGPT, AI research firm OpenAI has announced Point-E, a conceptual AI platform that can produce intricate 3D models from text input and images conditioned on it.
The tool will allow users to create 3D objects simply by entering simple text. Technically, this is accomplished by combining a text-to-image AI model with an image-to-3D model. When a user types a word into a text query, the text-to-image model finds a related image to generate. The image-to-3D model will then generate a 3D object based on the sampled image.
Although the result will not be perfect, one of the most significant advantages of this approach is that it is very fast and requires very little hardware to generate the final image. Furthermore, the results are nowhere near the quality of a commercial 3D rendering in a film or video game.
It is based on a different machine learning model known as GLIDE. And it’s not nearly as capable right now. Given a text directive such as “a traffic cone,” Point•E generates a low-resolution point cloud – a collection of points in space – that looks like a traffic cone.
The text-to-image model was trained on a large database of text and image pairs, according to OpenAI. It can handle a wide range of complex prompts. The image-to-3D model on the other hand, was trained on a smaller dataset of images and 3D models. According to OpenAI, the model was trained on millions of 3D objects and the metadata associated with them.
The sources for this piece include an article in TheRegister.