Despite major advancements in artificial intelligence (AI), a new study reveals that AI models still fall short when it comes to understanding the complexities of human social interactions. While AI has made significant strides in interpreting static images, it struggles to accurately analyze dynamic social scenes, a skill critical for technologies like autonomous vehicles, assistive robots, and human-centric AI systems.
In a recent study led by researchers at Johns Hopkins University, human participants significantly outperformed over 350 AI models in their ability to judge short social video clips. This finding highlights the limitations of current AI, which continues to fall behind human brainpower when it comes to processing dynamic, real-life social contexts.
The study, conducted by cognitive scientists at Johns Hopkins, assessed how well humans and AI models could interpret social scenes. Participants were shown three-second video clips depicting people either interacting with one another, working side-by-side, or performing independent activities. The viewers were asked to rate the key features of the interactions on a scale of one to five, reflecting how well they understood the context of the scene.
Despite being asked to evaluate similar social cues, AI models struggled to match human judgment in interpreting the videos. Regardless of the model's size or the data it had been trained on, over 350 AI models—spanning language, video, and image-based models—were unable to match the accuracy of human assessments. Even video models, which are designed to analyze motion, failed to reliably predict the actions of people in the clips.
AI's struggles were especially evident when video models were tasked with predicting human behaviors. While image-based models, which analyzed still frames, were unable to determine if two people were communicating or engaged in side-by-side activities, language models fared slightly better by predicting human behavior through written captions. However, even these models couldn't capture the dynamic nature of social interactions in real-world situations.
As Leyla Isik, lead author and assistant professor of cognitive science at Johns Hopkins University, explains: “For a self-driving car, for example, you need the AI to understand human intentions—whether a pedestrian is about to walk or two people are having a conversation. But current AI just isn’t able to predict these things in the real world with any accuracy.”
Isik emphasizes that the gap in understanding between AI and humans is rooted in the architecture of AI systems, which rely on neural networks inspired by the brain’s static image processing regions. These models are not equipped to handle the dynamic nature of human interactions in real-life scenarios.
The central issue, according to the researchers, lies in the design of AI systems. AI neural networks have traditionally been modeled after brain areas responsible for processing static images, such as faces and objects. However, understanding dynamic social interactions requires a different kind of brainpower—a mechanism in the brain designed to interpret the flow of events, emotions, and intentions as they unfold in real-time.
“Understanding the relationships, context, and dynamics of social interactions is the next step for AI,” said Kathy Garcia, a doctoral student involved in the research. “This study shows that there’s a blind spot in AI model development when it comes to processing complex, dynamic scenes.”
Garcia further emphasizes that AI’s reliance on static image processing is one of the main factors preventing it from comprehending real-world social dynamics. The challenge lies not just in identifying individuals or objects but in understanding the subtleties of human behavior, including intentions, emotions, and context—elements that are essential for effective social interaction.
The study underscores a significant contrast in AI's ability to analyze still images versus dynamic, moving scenes. While AI has made considerable progress in recognizing objects and faces in images, understanding what’s happening within a sequence of actions or interpreting the narrative unfolding in a social interaction is much more challenging.
“The real problem is that AI can recognize static objects and faces, but it struggles to understand the story unfolding in a dynamic scene,” Garcia added. “The dynamics of social interactions are incredibly complex, and AI has yet to crack that code.”
While AI's shortcomings in understanding human social dynamics are clear, the research offers valuable insights into the future of AI development. The findings suggest that for AI to become truly adept at human interaction, researchers need to rethink how AI is modeled and trained. Instead of focusing solely on static image recognition, AI systems must develop the ability to process dynamic scenes in real-time.
According to Isik, “There's something fundamental about the way humans process social scenes that these models are missing. The next generation of AI needs to focus on understanding dynamic, real-time human interactions to truly integrate into human-centric technologies like self-driving cars and assistive robots.”
As AI continues to evolve, the gap between human and machine understanding of social interactions remains a significant hurdle. While AI excels in reading still images and recognizing objects, its inability to comprehend the flow of dynamic social scenes poses a major challenge for applications such as autonomous vehicles and assistive robots.
The study from Johns Hopkins University underscores the need for a shift in AI model design—one that moves beyond static image processing and embraces the complexities of human social interactions. For now, humans remain far superior to AI in interpreting the nuanced, dynamic behaviors that define our social world. As researchers continue to explore this issue, it’s clear that understanding the brain's mechanisms for processing real-time social scenes will be key to developing AI that can truly understand and interact with humans on a meaningful level.