A big week for AI! - MetaVisions #18

Hi all, hope you’re doing well! This week has made my feed filled up with news around AI, which is a welcoming change after two weeks of Vision Pro domination! This is a topic that fascinates me and I do still think that many people don’t comprehend how much evolution in this space.

OpenAI Sora - mindblowing!

One of the biggest memes in the AI community, is the Will Smith video that was created in Q1 23’ by an OpenSource ‘text2video’ model. Honestly, it is terrifying, but what it showed at the time was that although GenAI with ChatGPT is incredible, there are things that AI still sucks at…

Like I said, we thought AI was still quite a long distance from being able to produce high quality Text to Video content. However, the OpenAI team completely flipped this by announcing Sora - an AI model built for text to video.
At the moment it can produce incredibly detailed videos of up to one minute. Many of these results posted by OpenAI are mind blowing.

Prompt: Several giant wooly mammoths approach treading through a snowy meadow, their long wooly fur lightly blows in the wind as they walk, snow covered trees and dramatic snow capped mountains in the distance, mid afternoon light with wispy clouds and a sun high in the distance creates a warm glow, the low camera view is stunning capturing the large furry mammal with beautiful photography, depth of field.

Prompt: Drone view of waves crashing against the rugged cliffs along Big Sur’s garay point beach. The crashing blue waters create white-tipped waves, while the golden light of the setting sun illuminates the rocky shore. A small island with a lighthouse sits in the distance, and green shrubbery covers the cliff’s edge. The steep drop from the road down to the beach is a dramatic feat, with the cliff’s edges jutting out over the sea. This is a view that captures the raw beauty of the coast and the rugged landscape of the Pacific Coast Highway.

Prompt: Historical footage of California during the gold rush.

Of course we would be foolish to say that those videos are perfect, I can tell that they are AI generated. But the reality of it is that if you had no clue that AI could do this sort of content, it would be hard to spot the flaws!
Remember, that Will Smith pasta video is a year old, look how fast this space has improved…

There will be a great need to be careful with the content that users can produce with the model, from crazy +18 content all the way to videos that can support fake news to impact an election.

The OpenAI team is already looking into this: ‘We’ll be taking several important safety steps ahead of making Sora available in OpenAI’s products. We are working with red teamers — domain experts in areas like misinformation, hateful content, and bias — who will be adversarially testing the model.

We’re also building tools to help detect misleading content such as a detection classifier that can tell when a video was generated by Sora. We plan to include C2PA metadata in the future if we deploy the model in an OpenAI product.’

Content creation is about to get crazy, we are only at the beginning of it all.
Question, do you think it will take more or less than 5 years for the first AI only movie to come out and do over $50M at the box office?

Google Gemini 1.5 Pro - token overload!

OpenAI may have the flashiest news of the week, but Google has released an update that is probably more impressive and relevant to those looking to create real value through GenAI. It’s newly announced Google Gemini 1.5 Pro model will have a 1 million token context window!

What is a token? A token is a unit of information that the model processes. It could represent a word, a few letters, a punctuation mark, etc., depending on the implementation of the model. For example, the sentence “Nemo is a fish and likes to wander around” gets converted to 12 tokens.

What is a context window? The context window of an AI model is the amount of information it can consider at any one time. It can be thought of as the model’s memory for a particular conversation.

To understand the sheer scale of this context window, have a look at the comparative graph below:

Want to ask Gemini to summarize a 1400 page book at once? No problem!
Asking Gemini to review over 30K lines of code? Not a challenge!
Querying Gemini on a one hour video? Done!

This opens a new door for what is possible for GenAI in the workplace… As a reminder, ChatGPT only blew up a year ago!!

Elon’s X AI ‘Grok’ model will get some playtime!

November last year, Elon Musk unveiled ‘Grok’ a GenAI model created by his startup xAI Corp. It’s biggest differential is that it was trained to answer queries with a sense of humor and a rebellious nature.

Reputable leaker @swak_12 announced that Grok will be used to summarise trending topics on X, allowing users to understand what is going on without having to read through loads of posts!

Interesting to see Gen-AI being implemented in a platform used by millions of people every day!

Meta’s V-JEPA model!

Of course Meta did not want to stay out of this week’s AI party!
Meta announced V-JEPA, a non-generative AI model that predicts missing or masked parts of a video.This model revolutionizes video comprehension, prioritizing holistic understanding over pixel-level details. It’s like how babies learn just by watching and don’t need someone to tell them what’s happening. This makes learning faster and more efficient. It focuses on figuring out missing parts of a video in a smart way, instead of trying to fill in every detail.

Self-supervised Learning
V-JEPA employs self-supervised learning techniques, enhancing its adaptability and versatility across various tasks without necessitating labeled data during the training phase.

Feature Prediction Objective
Instead of reconstructing images or relying on pixel-level predictions, V-JEPA prioritizes video feature prediction. This approach leads to more efficient training and superior performance in downstream tasks.

Efficiency
With V-JEPA, Meta has achieved significant efficiency gains, requiring shorter training schedules compared to traditional pixel prediction methods while maintaining high performance levels.

Versatile Representations
V-JEPA produces versatile visual representations that excel in both motion and appearance-based tasks, showcasing its effectiveness in capturing complex interactions within video data.

I will be honest, this was the hardest piece of news to digest and get my head around, but these are some real world use cases that could come out of this model:
Video Understanding
V-JEPA excels in understanding the content of various video streams, making it invaluable for computer vision tasks such as video classification, action recognition, and spatio-temporal action detection. Its ability to capture detailed object interactions and distinguish fine-grained actions sets it apart in the field of video understanding.

Contextual AI Assistance
The contextual understanding provided by V-JEPA lays the groundwork for developing AI assistants with a deeper understanding of their surroundings. Whether it's providing context-aware recommendations or assisting users in navigating complex environments, V-JEPA can enhance the capabilities of AI assistants in diverse scenarios.

Augmented Reality (AR) Experiences
V-JEPA's contextual understanding of video content can enrich AR experiences by providing relevant contextual information overlaid on the user's surroundings. Whether it's enhancing gaming experiences or providing real-time information overlays, V-JEPA can contribute to the development of immersive AR applications.

It is incredible to see the advancements in the AI space! I think all organisations should be trying to understand how to use all of this to improve their product, processes, productivity and user experience!

See you next week,
Davi, MetaVisions

Reply

or to participate.