Abstract
Recently, as social anxiety regarding violent crimes accompanied by frequent occurrences of violence has increased, the need for intelligent video analysis in CCTV systems for crime prevention and rapid response to incidents has grown. One of the methods used for detecting violent behavior through video analysis is action-based detection using pose estimation. However, relying solely on joint angles and changes obtained from pose estimation to detect violent acts can lead to issues. False positives occur when non-violent actions such as petting a head or hugging are mistakenly classified as violent behavior. This study aims to reduce the frequency of false positives in action-based violence detection methods that utilize only pose estimation. We propose a new violence detection method that combines the results of facial emotion recognition (anger, disgust, fear, sadness, surprise, happiness, and neutrality) of the expected victim with the existing pose estimation-based violence detection method. By combining pose estimation with facial emotion recognition results on a video dataset consisting of YouTube videos and self-made videos, we were able to achieve a higher accuracy rate of 92.5% compared to the traditional method which solely relies on pose estimation. Future research will focus on studying violence detection in actual CCTV scenarios to improve the reliability of the result data.