The author of the Transformer paper started his own business again, and the second generation of Tesla humanoid robot Optimus was unveiled
1. Li Feifei and Google’s game-breaking work! Use Transformer to generate realistic videos to achieve photorealism
Li Feifei and the Stanford team teamed up with Google to develop a model called W.A.L.T. They successfully integrated the Transformer architecture into the video diffusion model to create photorealistic videos. It achieves an extremely high level of coherence and detail in video generation.
The core of the W.A.L.T model is to use a causal encoder to jointly compress images and videos in a shared latent space, and to adopt a windowed attention-based Transformer architecture to improve memory and training efficiency. This structure enables the model to generate realistic and consistent video based on natural language cues.
In experiments, researchers used a variety of tasks to evaluate the performance of W.A.L.T, including category-conditioned image and video generation, frame prediction, text-based video generation, etc. The results show that W.A.L.T performs well in multiple benchmarks, especially on the UCF-101 benchmark, where its zero-sample FVD score reaches the current best.
2. Google DeepMind’s most advanced visual large model Imagen 2 is released, supporting patching and image expansion.
Google DeepMind has released its latest large-scale visual model, Imagen 2. Its core feature is its ability to generate high-quality, realistic images that are highly consistent with user prompts based on specific user prompts.
To achieve this goal, Google DeepMind has optimized the training data set of Imagen 2, adding more detailed image descriptions to more accurately respond to user prompts. This enhanced image-description pair helps Imagen 2 better understand the relationship between images and text, improving understanding of context and nuance.
Imagen 2 also makes significant progress in solving common problems with text-to-image tools, such as rendering realistic hands and faces, and keeping images free of artifacts that interfere with vision.
In addition to generating high-quality images, Imagen 2 also supports image editing functions such as inpainting and outpainting, providing users with more creative space. At the same time, in order to reduce the potential risks and challenges of text-to-image generation technology, the Google team has set up strict safeguards at all stages of design, development, and product deployment to avoid generating potentially problematic content.
3. With mobility close to that of humans, the second generation of Tesla humanoid robot Optimus is launched
After more than a year, Tesla’s second-generation humanoid robot Optimus was recently unveiled. Compared with its predecessor, its highlight lies in its high degree of flexibility and practicality. It can perform complex movements such as squatting and dancing, indicating that its mobility is very close to that of humans.
Optimus made its debut in October 2022. At that time, it had 27 degrees of freedom of hand movement, but it was not yet able to perform complex movements such as dancing. By May 2023, Optimus will have the ability to walk smoothly and grab objects. In September, it evolved further to be able to classify objects autonomously.
The latest Optimus II has made significant improvements on the original basis. It is about 1.72 meters tall, can move at a speed of about 8 kilometers per hour, has a walking speed increased by 30%, and is 10 kilograms lighter. Its feet are designed to mimic humans, with articulated toes and foot force/torque sensing for a more human-like walking style. In addition, the hand design of Optimus II is also very advanced, with 11 degrees of freedom, allowing it to flexibly operate and handle delicate objects such as eggs.
As technology continues to advance, the second generation of Optimus and its subsequent products may play an important role in multiple fields, including home services, industrial manufacturing and even the entertainment industry.