Vincent Video Model Launches, Intensifies Commercial Competition

At the recent Yunqi Conference, Tongyi Wanxiang unveiled its self-developed AI video generation model, with the initial launch of text-to-video and image-to-video features. The Tongyi app is completely free of charge, allowing users unlimited daily usage.

In the text-to-video feature, by inputting any multilingual prompts in Chinese or English, a high-definition, realistic video can be generated. It supports the generation of videos up to 5 seconds long, at 30 frames per second, with a resolution of 720P. Even more impressively, it can also generate sound effects that match the visuals.

Jiang Han, a senior researcher at the Pangu Think Tank, stated in an interview with journalists: "Firstly, Alibaba's video generation model is a fully self-developed visual generation model, utilizing the industry-leading Diffusion+Transformer architecture. Secondly, the model has been launched on the mobile app and PC official website, supporting the generation of 5-second videos at 30 frames per second with a resolution of 720P, and it can generate sound effects that match the visuals. In terms of progress, Alibaba has successfully implemented the text-to-video and image-to-video functions, demonstrating good picture quality, semantic understanding, and style generalization capabilities during trials."

Advertisement

Concentrated Explosion

In September of this year, domestic video generation models have ushered in a new round of explosive growth: On August 31, MiniMax officially released the video model video-01, kicking off the "red carpet ceremony." On September 19, Alibaba Cloud CTO Zhou Jingren announced a new video generation model at the Yunqi Conference. On the same day, Kuaishou released the Keling 1.5 model, with internal evaluations showing a 95% improvement in overall performance compared to the Keling 1.0 model. On September 23, Meitu Company announced that the Meitu Qixiang model had completed an upgrade in video generation capabilities. On September 24, ByteDance's Volcano Engine released two video generation models, PixelDance and Seaweed. On September 26, Meitu Company opened the AI short film creation tool MOKI to all users. On September 30, Keling launched the "lip-syncing" feature, supporting the upload of audio content for generated characters, and announced the official full opening of the API (Application Programming Interface), launching the AI creation community "Creative Circle."

Alibaba stated that, as a "most obedient" AI video model, Tongyi Wanxiang has the ability to understand complex semantics and generate concept combinations, accurately presenting textual creativity.

It is understood that for friends lacking inspiration, by clicking on "Inspiration Expansion" in the text-to-video interface, simple prompts can be automatically "expanded" into longer prompts that remain true to the original meaning, greatly enhancing the generation effect.

The image-to-video feature supports the transformation of any image into a dynamic video, generated according to the uploaded image ratio or preset ratio, while also allowing control of video motion through prompts.

Tongyi Wanxiang's audio-visual synchronization function not only greatly improves the quality of the image but also spares creators the trouble of searching for background music and modifying sound effects. Now, simply by inputting text or uploading images on your computer, "Tongyi Wanxiang" can provide you with a one-stop "image + voice" service.

Jiang Han believes that compared to other video generation models, the advantages of Alibaba's video generation model are, "Firstly, it has a better understanding of Chinese style and Chinese, and can better understand and generate video content related to Chinese culture and language; secondly, it has advantages in computational efficiency, generating the final animation through progressive denoising, reducing computational load and increasing generation speed; finally, it supports a variety of scene applications, providing more sources of inspiration for e-commerce, advertising creativity, self-media, film/animation production, and other fields. In terms of disadvantages, compared to other models, there may still be some specific technical limitations and optimization space that require continuous research and improvement."Seeking Scalable Implementation Scenarios

As the players gather, the competition for video generation large models begins to enter the stage of seeking scalable implementation scenarios. We can find a rich array of application scenarios for video generation large models, ranging from consumer (C) to business (B) ends, such as social media content, AI short dramas, video advertisements, voice-over content, promotional videos, program production, and post-production for movies, etc.

For video platforms, individual content creators who form the foundation of the content creation ecosystem are the most important service targets. Lower barriers to video creation and more diverse expressions of inspiration mean a more prosperous video content ecosystem. Platforms like Jiyan, Ji Meng AI, Kuaiying, YouTube, Instagram, and even Meitu's MOKI are integrating video generation large model capabilities, focusing on serving these creators.

Regarding the future development prospects of Alibaba's video generation large model, Jiang Han said: "Firstly, I am optimistic about the future of Alibaba's video generation large model. Secondly, Alibaba has a deep technical accumulation in the field of artificial intelligence, and its self-developed visual generation large model is technically leading. At the same time, Alibaba also has strong capabilities in market promotion and expansion of application scenarios, which can provide users with better usage experience and services. In addition, as artificial intelligence technology continues to develop and application scenarios continue to expand, the market demand for video generation large models will also increase, providing broad space and opportunities for the development of Alibaba's video generation large models."

For large model startups, as MiniMax founder Yan Junjie said, most of the content consumed by humans every day is text and video, and the proportion of text is not high. Large models capable of outputting multimodal content can achieve higher user coverage and usage.

For mature companies with video business and user accumulation, large models may mean an opportunity to redistribute the cake, as well as the possibility of tapping into the potential of existing users. At the very least, investment in large models can also help companies reduce the risk of being squeezed out of the market.

In a research report, Dongwu Securities believes that the core driving force for the increase in AI penetration rate lies in the cost-saving and efficiency-enhancing demands of enterprises. According to Dongwu Securities' calculations, under a fully AI model, the production costs for movies, long dramas, animated films, and short dramas are 2.5/9.3/3.7/0.4 ten thousand yuan, respectively, with a cost reduction of over 95% compared to traditional models; under a human-machine collaborative model, the production cost of movies is expected to be reduced by 43%.

The cooperation between Runway and Lionsgate shows that the willingness of enterprises to combine with video generation large models is increasing. In this cooperation, Runway will use Lionsgate's movie catalog to train custom video models, enabling it to generate movie videos and enhance the works of creators. Of course, this exploration requires more time and has strong uncertainty.

Another templated path is to cooperate with top creators in the industry to introduce best practices. Kuaishou recently announced the launch of the "Keling AI" director co-creation plan, joining forces with nine directors, including Li Shaohong, Jia Zhangke, Ye Jin Tian, Xue Xiaolu, Yu Bai Mei, Dong Runian, Zhang Chiyu, Wang Zi Chuan, and Wang Mao Mao, relying on Keling's technical capabilities to produce and release nine AIGC (Artificial Intelligence Generated Content) short film scripts. The best practices that emerge from this can also provide references for more content creators when using Keling.

We also see that more and more video generation large models are starting to open API interfaces to enterprises, leveraging the strength of more enterprises to jointly develop scenario-based templates for video generation large models. For example, Runway has opened an API interface for its video generation large model Gen-3 Alpha Turbo, which is only available to invitees, allowing invitees to build video generation functions in their applications. Luma and Vidu have also launched their own API opening plans.