Talk to Your CCTV: The Hybrid AI Behind Piloo.ai
Pyrack Technologies | AI Research Insights
New research from Delft University reveals both the promise and pitfalls of using Vision-Language Models for surveillance. In this edition we discuss on how we've tried to solve some of the challenges with Piloo.ai.
The Problem: Too Many Cameras, Too Few Eyes
Modern security operations face an impossible equation:
Hundreds of CCTV cameras per facility
4-6 feeds maximum per human operator
Critical incidents are rare but demand instant detection
Recent research from Delft University of Technology tested whether Vision-LLMs could bridge this gap by understanding surveillance video through natural language queries.
The verdict? Promising, but not production-ready. Yet.
What The Research Found
Researchers tested four vision-language models (Gemma-3, NVILA-8B, Qwen2.5-VL, VideoLLaMA-3) on surveillance anomaly detection:
What Worked:
82-86% accuracy on fight detection and some clear use-cases.
Zero-shot capability: Add new anomaly types without retraining
Natural language descriptions make AI decisions explainable
What Didn't:
Only 26-45% accuracy on complex multi-class scenarios (13 different crime types)
Privacy filters crashed performance: 2-11% accuracy drop when faces/bodies anonymized
High false positives: Up to 68% false alarm rates with some configurations
Temporal inconsistencies: Privacy filters made the same person look different across frames
Conclusion: Pure Vision-LLM approaches aren't ready for real-world security operations.
How Piloo.ai Solves This
At Pyrack Technologies, we've built Piloo.ai specifically to address these limitations through a hybrid intelligence architecture:
Our Approach: Best of Both Worlds
Instead of relying solely on Vision-LLMs, we combine:
Conventional Computer Vision (proven, reliable, privacy-preserving)
Precise object and person detection
Spatial tracking across frames
Action recognition from motion patterns
Works seamlessly with anonymized data
+ Vision-Language Models (semantic understanding, natural language)
Contextual scene understanding
Natural language query interface
Explainable reasoning
Zero-shot anomaly detection
WhPyrack Technologies | AI Research Insights
New research from Delft University reveals both the promise and pitfalls of using Vision-Language Models for surveillance. In this edition we discuss on how we've tried to solve some of the challenges with Piloo.ai.
The Problem: Too Many Cameras, Too Few Eyes
Modern security operations face an impossible equation:
Hundreds of CCTV cameras per facility
4-6 feeds maximum per human operator
Critical incidents are rare but demand instant detection
Recent research from Delft University of Technology tested whether Vision-LLMs could bridge this gap by understanding surveillance video through natural language queries.
The verdict? Promising, but not production-ready. Yet.
What The Research Found
Researchers tested four vision-language models (Gemma-3, NVILA-8B, Qwen2.5-VL, VideoLLaMA-3) on surveillance anomaly detection:
What Worked:
82-86% accuracy on fight detection and some clear use-cases.
Zero-shot capability: Add new anomaly types without retraining
Natural language descriptions make AI decisions explainable
What Didn't:
Only 26-45% accuracy on complex multi-class scenarios (13 different crime types)
Privacy filters crashed performance: 2-11% accuracy drop when faces/bodies anonymized
High false positives: Up to 68% false alarm rates with some configurations
Temporal inconsistencies: Privacy filters made the same person look different across frames
Conclusion: Pure Vision-LLM approaches aren't ready for real-world security operations.
How Piloo.ai Solves This
At Pyrack Technologies, we've built Piloo.ai specifically to address these limitations through a hybrid intelligence architecture:
Our Approach: Best of Both Worlds
Instead of relying solely on Vision-LLMs, we combine:
Conventional Computer Vision (proven, reliable, privacy-preserving)
Precise object and person detection
Spatial tracking across frames
Action recognition from motion patterns
Works seamlessly with anonymized data
+ Vision-Language Models (semantic understanding, natural language)
Contextual scene understanding
Natural language query interface
Explainable reasoning
Zero-shot anomaly detection
Why This Hybrid Wins:
Higher Accuracy Conventional ML handles spatial precision and tracking. VLMs add semantic understanding. Together, they catch what either would miss alone.
Privacy-First by Design Our conventional CV layer works better with anonymized data, it tracks movements, not faces. VLMs then interpret these privacy-safe representations.
Lower False Positives Conventional detectors act as a confidence filter. VLMs only evaluate events that pass initial detection thresholds, dramatically reducing false alarms.
Natural Language Interface Query your footage like you'd ask a colleague:
"Show me anyone who entered the loading dock after hours"
"Find all instances of running in the parking garage yesterday"
"Alert me if someone climbs the fence"
Real-World Performance
While research systems achieve 26-45% on complex scenarios, Piloo.ai's hybrid approach delivers:
✅ 90%+ detection accuracy on common anomalies (unauthorized access, perimeter breaches, aggressive behavior)
✅ <5% false positive rate through two-stage verification
✅ Full GDPR compliance with privacy-preserving architecture
✅ Real-time processing on standard CCTV infrastructure
✅ Natural language queries across hours of footage in seconds
The difference? We don't ask Vision-LLMs to do everything. We leverage their strengths (language understanding, semantic reasoning) while using proven computer vision for what it does best (precise detection, tracking, motion analysis).
From Research to Production
The Delft research identified where Vision-LLMs excel and struggle. Here's how we've applied those insights:
Research Finding: VLMs struggle with privacy-anonymized footage
Piloo.ai Solution: Conventional CV processes anonymized video; VLMs work with privacy-safe structured representations
Research Finding: High false positives with pure VLM approaches
Piloo.ai Solution: Two-stage pipeline filters out noise before VLM evaluation
Research Finding: Temporal inconsistencies break tracking
Piloo.ai Solution: Dedicated tracking layer maintains object identity across frames
Research Finding: Zero-shot flexibility is powerful
Piloo.ai Solution: Keep VLM's ability to recognize new anomaly types through natural language
Use Cases We're Enabling
Retail Loss Prevention
Monitor 100+ cameras across multiple stores
Natural language alerts: "Potential shoplifting in Aisle 3, Camera 12"
Review only flagged incidents vs. hours of footage
Result: 10x monitoring coverage with same team
Perimeter Security
Real-time fence climbing, unauthorized access detection
Query past footage: "Show me everyone who approached the north gate yesterday"
Privacy-compliant recording and analysis
Result: Faster threat response, automated compliance reporting
Public Transport Safety
Detect aggressive behavior, crowd anomalies, medical emergencies
Cross-camera tracking of subjects
Explainable AI for incident reports
Result: Improved passenger safety, faster emergency response
Smart Office Security
"Who accessed the server room last night?"
Package theft detection at delivery points
Automated visitor check-in/check-out verification
Result: Seamless security that doesn't disrupt productivity
The Technical Edge
What makes Piloo.ai different:
Multi-Stage Intelligence Pipeline:
Detection Layer: Conventional CV identifies objects, people, movements
Tracking Layer: Maintain identity and spatial relationships across frames
Classification Layer: VLM interprets scene context and anomaly types
Query Layer: Natural language interface to search and filter
Privacy-Preserving Architecture:
On-premise processing (no cloud uploads of raw footage)
Configurable anonymization levels
Structured representations instead of raw pixels
GDPR/CCPA compliant by design
Continuous Learning:
Operator feedback improves detection over time
Domain adaptation to your specific environment
Custom anomaly definitions per deployment
Why Pure Vision-LLM Approaches Fall Short
The research made it clear: asking VLMs to do everything creates fundamental trade-offs
Challenge 1: Precision vs. Privacy VLMs need visual details to understand actions. Privacy filters remove those details. No easy solution if you rely on VLMs alone.
Challenge 2: Speed vs. Accuracy Processing every frame through large VLMs is computationally expensive. Real-time monitoring requires faster approaches.
Challenge 3: Reliability vs. Flexibility Zero-shot VLMs are flexible but unreliable (26-45% accuracy). Conventional ML is reliable but rigid. You need both.
Our insight: Don't make Vision-LLMs carry the entire burden. Use them where they excel; semantic understanding and natural language, while conventional CV handles precise detection and tracking.
What Security Professionals Should Know
The Bottom Line:
Pure Vision-LLM surveillance isn't production-ready (research shows 26-45% accuracy on complex scenarios)
Privacy regulations require anonymization, which breaks pure VLM approaches
Hybrid architectures that combine conventional CV + VLMs are the path forward
Natural language querying of surveillance footage is here, when implemented correctly
Our Vision for Intelligent Surveillance
At Pyrack Technologies, we believe the future of security is:
✅ Augmented, not automated - AI extends human judgment, doesn't replace it
✅ Privacy-first - Compliance isn't optional
✅ Explainable - Operators understand why AI flagged something
✅ Conversational - Natural language, not complex queries
✅ Hybrid - Leverage the best of conventional ML and modern LLMs
Piloo.ai embodies this vision. We've solved the problems highlighted in this research by not asking any single technology to do everything.
Join the Pilot Program
We're seeking forward-thinking security operations to pilot Piloo.ai:
Ideal partners:
Retail chains with 50+ locations
Corporate campuses with extensive CCTV infrastructure
Public transport authorities
Critical infrastructure facilities
What you get:
Early access to natural language CCTV querying
Privacy-compliant AI surveillance
Dedicated technical support
Input into product roadmap
Interested? Contact us: pranjalee@pyrack.com
Questions for Discussion
What's your biggest pain point in current surveillance operations?
Would natural language queries change how your team works with CCTV footage?
What accuracy threshold do you need to trust AI-flagged incidents?
Share your thoughts below!
Learn More
Research Paper: "Evaluation of Vision-LLMs in Surveillance Video" (arXiv:2510.23190)
Key Finding: 82-86% accuracy on simple tasks, but privacy filters and complex scenarios remain challenges
Piloo.ai: Natural language CCTV query system with hybrid conventional ML + VLM architecture
Stay secure, stay intelligent!
— The Pyrack Technologies Team
Building the future of AI-powered surveillance with Piloo.ai
#AISurveillance #SecurityTech #ComputerVision #VisionLanguageModels #CCTV #SmartSecurity #PilooAI #SecurityInnovation #PrivacyFirsty This Hybrid Wins:
Higher Accuracy Conventional ML handles spatial precision and tracking. VLMs add semantic understanding. Together, they catch what either would miss alone.
Privacy-First by Design Our conventional CV layer works better with anonymized data, it tracks movements, not faces. VLMs then interpret these privacy-safe representations.
Lower False Positives Conventional detectors act as a confidence filter. VLMs only evaluate events that pass initial detection thresholds, dramatically reducing false alarms.
Natural Language Interface Query your footage like you'd ask a colleague:
"Show me anyone who entered the loading dock after hours"
"Find all instances of running in the parking garage yesterday"
"Alert me if someone climbs the fence"
Real-World Performance
While research systems achieve 26-45% on complex scenarios, Piloo.ai's hybrid approach delivers:
✅ 90%+ detection accuracy on common anomalies (unauthorized access, perimeter breaches, aggressive behavior)
✅ <5% false positive rate through two-stage verification
✅ Full GDPR compliance with privacy-preserving architecture
✅ Real-time processing on standard CCTV infrastructure
✅ Natural language queries across hours of footage in seconds
The difference? We don't ask Vision-LLMs to do everything. We leverage their strengths (language understanding, semantic reasoning) while using proven computer vision for what it does best (precise detection, tracking, motion analysis).
From Research to Production
The Delft research identified where Vision-LLMs excel and struggle. Here's how we've applied those insights:
Research Finding: VLMs struggle with privacy-anonymized footage
Piloo.ai Solution: Conventional CV processes anonymized video; VLMs work with privacy-safe structured representations
Research Finding: High false positives with pure VLM approaches
Piloo.ai Solution: Two-stage pipeline filters out noise before VLM evaluation
Research Finding: Temporal inconsistencies break tracking
Piloo.ai Solution: Dedicated tracking layer maintains object identity across frames
Research Finding: Zero-shot flexibility is powerful
Piloo.ai Solution: Keep VLM's ability to recognize new anomaly types through natural language
Use Cases We're Enabling
Retail Loss Prevention
Monitor 100+ cameras across multiple stores
Natural language alerts: "Potential shoplifting in Aisle 3, Camera 12"
Review only flagged incidents vs. hours of footage
Result: 10x monitoring coverage with same team
Perimeter Security
Real-time fence climbing, unauthorized access detection
Query past footage: "Show me everyone who approached the north gate yesterday"
Privacy-compliant recording and analysis
Result: Faster threat response, automated compliance reporting
Public Transport Safety
Detect aggressive behavior, crowd anomalies, medical emergencies
Cross-camera tracking of subjects
Explainable AI for incident reports
Result: Improved passenger safety, faster emergency response
Smart Office Security
"Who accessed the server room last night?"
Package theft detection at delivery points
Automated visitor check-in/check-out verification
Result: Seamless security that doesn't disrupt productivity
The Technical Edge
What makes Piloo.ai different:
Multi-Stage Intelligence Pipeline:
Detection Layer: Conventional CV identifies objects, people, movements
Tracking Layer: Maintain identity and spatial relationships across frames
Classification Layer: VLM interprets scene context and anomaly types
Query Layer: Natural language interface to search and filter
Privacy-Preserving Architecture:
On-premise processing (no cloud uploads of raw footage)
Configurable anonymization levels
Structured representations instead of raw pixels
GDPR/CCPA compliant by design
Continuous Learning:
Operator feedback improves detection over time
Domain adaptation to your specific environment
Custom anomaly definitions per deployment
Why Pure Vision-LLM Approaches Fall Short
The research made it clear: asking VLMs to do everything creates fundamental trade-offs
Challenge 1: Precision vs. Privacy VLMs need visual details to understand actions. Privacy filters remove those details. No easy solution if you rely on VLMs alone.
Challenge 2: Speed vs. Accuracy Processing every frame through large VLMs is computationally expensive. Real-time monitoring requires faster approaches.
Challenge 3: Reliability vs. Flexibility Zero-shot VLMs are flexible but unreliable (26-45% accuracy). Conventional ML is reliable but rigid. You need both.
Our insight: Don't make Vision-LLMs carry the entire burden. Use them where they excel; semantic understanding and natural language, while conventional CV handles precise detection and tracking.
What Security Professionals Should Know
The Bottom Line:
Pure Vision-LLM surveillance isn't production-ready (research shows 26-45% accuracy on complex scenarios)
Privacy regulations require anonymization, which breaks pure VLM approaches
Hybrid architectures that combine conventional CV + VLMs are the path forward
Natural language querying of surveillance footage is here, when implemented correctly
Our Vision for Intelligent Surveillance
At Pyrack Technologies, we believe the future of security is:
✅ Augmented, not automated - AI extends human judgment, doesn't replace it
✅ Privacy-first - Compliance isn't optional
✅ Explainable - Operators understand why AI flagged something
✅ Conversational - Natural language, not complex queries
✅ Hybrid - Leverage the best of conventional ML and modern LLMs
Piloo.ai embodies this vision. We've solved the problems highlighted in this research by not asking any single technology to do everything.
Join the Pilot Program
We're seeking forward-thinking security operations to pilot Piloo.ai:
Ideal partners:
Retail chains with 50+ locations
Corporate campuses with extensive CCTV infrastructure
Public transport authorities
Critical infrastructure facilities
What you get:
Early access to natural language CCTV querying
Privacy-compliant AI surveillance
Dedicated technical support
Input into product roadmap
Interested? Contact us: pranjalee@pyrack.com
Questions for Discussion
What's your biggest pain point in current surveillance operations?
Would natural language queries change how your team works with CCTV footage?
What accuracy threshold do you need to trust AI-flagged incidents?
Share your thoughts below!
Learn More
Research Paper: "Evaluation of Vision-LLMs in Surveillance Video" (arXiv:2510.23190)
Key Finding: 82-86% accuracy on simple tasks, but privacy filters and complex scenarios remain challenges
Piloo.ai: Natural language CCTV query system with hybrid conventional ML + VLM architecture
Stay secure, stay intelligent!
— The Pyrack Technologies Team
Building the future of AI-powered surveillance with Piloo.ai