

When ChatGPT emerged in late 2022, the potential seemed limitless, but few consumers knew how exactly to leverage the new AI tool in their day-to-day lives. Businesses responded much the same way at first to Large Langauge Models (LLMs), such as ChatGPT, for creating images from text: these seemed like an obviously powerful solution-set, but one that was perhaps in search of a specific problem to solve.
In the 18 months since, however, businesses from Google to Adobe to Duolingo have been steadily building both LLMs and text-to-image tools into their flagship products, making their usefulness no longer seem so remote. In this blog post we will explore the emerging implications of generative AI technology for the video surveillance industry, including the most promising use cases and some recent developments in the underlying technology.
Use Case #1: Synthetic Image Data Creation:
Synthetic data, algorithmically generated to approximate real data, is a potentially valuable resource for training AI models in environments where real data is unavailable, insufficient, or too sensitive to use. For video surveillance use cases, image-generation technology could offer a way to create vast amounts of training data without encountering the ethical and privacy concerns inherent in using real-world data. Also, synthetic image data allows for the creation of varied scenarios that might be rare or difficult to capture in real life, but are essential for comprehensive AI training, ensuring systems are well-trained for anything they might encounter.
This is especially crucial when a surveillance system needs to learn from unusual cases or rare events—such as unusual methods of theft or specific types of intrusions that wouldn’t be a common occurrence in off-the-shelf datasets.
Use case #2: Improved Natural Language Interaction:
Another application of generative AI involves the way users interact with and gain insights from their video systems. One side benefit of rapidly improving text-to-image prompting technology is that it opens the door for security professionals to query their video monitoring installations using natural language prompts instead of through icons or other traditional UI elements. For instance, if a store manager suspects that a theft occurred sometime last week, but isn’t sure of the exact day — instead of manually searching through hours of video, or clicking on a series of icons and entering various parameters, they could ask the system: “Show me instances of suspected theft near the electronics aisle between 3 PM and 5 PM last week.” The AI, understanding the query, sifts through the data to present relevant footage, saving time and making the process far more efficient. Much less training would be necessary and fewer errors would be made with natural language as the primary interface for the system.
Use case #3: Heightened Accuracy and Scene Detection:
Generative AI, with newly emerging technologies described below, can also improve the accuracy of video surveillance analytics by reducing false positives through better understanding of image context. In retail, AI-enhanced video analytics can reduce false positives in theft detection by more accurately distinguishing between suspicious behavior and normal customer interactions. For example, differentiating between a customer picking up an item for closer inspection and one attempting to conceal the item for theft becomes more manageable.
The newest technology — Vision Transformers:
For analyzing images and video, most commercial AI technology today leverages Convolutional Neural Networks, or CNNs. The Generative AI era, on the other hand, has introduced Vision Transformers (ViTs), a new image-to-text technique inspired by language-only LLMs such as ChatGPT.
To understand why ViTs stand to do better at video analysis tasks than CNNs, consider a crowded city park scene where we want to identify and track every person’s movement over time. A traditional CNN might focus on local patterns, such as recognizing parts of people (for example, heads or shoulders), but might struggle to keep track of each person among many others, especially as they move behind trees or other people.
A Vision Transformer, however, treats the scene as a big puzzle, looking at all parts of the image at once. It’s better at understanding that the person who walked behind a tree is the same one who comes out the other side, even if partially obscured for a moment. This global view makes ViTs especially good at tracking movements in busy or complicated scenes, improving how we monitor and analyze movements in public spaces for safety and security.
Beyond Vision Transformers: what’s next?
Vision Transformers are revolutionizing how we process still images by treating them like interconnected puzzles, but the ever-evolving field of video analytics demands even more contextual understanding across timestamps. Recognizing the dynamic nature of video, researchers have begun developing innovations that extend beyond static images to understand the flow and narrative of scenes over time. These advancements consider how objects and people move, interact, and change from one frame to the next, providing a deeper understanding of video content. This evolution from static to dynamic analysis is crucial for video surveillance, as it will enhance our ability to interpret complex activities and behaviors as the technology matures.
Conclusion:
As we stand on the brink of these exciting technological advancements, it’s clear that the integration of Vision Transformers and new innovations tailored for dynamic video analysis into video analytics products is imminent. In the coming months, we can anticipate these developments to significantly enhance the capabilities of video surveillance systems. The implications of these advancements are profound: by enabling more accurate identification of activities, behaviors, and trends, they will greatly influence the effectiveness and success of video security installations.
Donald Lyman, President, Peregrine Security, Inc.