I developed this project while participating in the AI-Challenge competition in 2023, where I utilized cutting-edge AI technologies, including Zero-Shot learning with CLIP model and cosine similarity calculation, to retrieve event-specific videos from visual data based on Textual KIS queries.
Summary of the CLIP Model in the paper Learning Transferable Visual Models From Natural Language Supervision
Model Framework for Zero-Shot Video Event Retrieval:
`openai/clip-vit-base-patch32`
) for extracting features from both textual queries and video keyframes.
The dataset used in the Video Event Retrieval application is provided by the competition organizers, comprising videos related to news bulletins produced by various Vietnamese broadcasters and extracted from YouTube.
The dataset consists of 20 subdirectories, corresponding to news sources, including 'VTV 60 Seconds' program and several others.
The varying number of videos in each subdirectory showcases the diversity of the data sources, with the duration of each video ranging from 15 to 30 minutes.
Each subdirectory is accompanied by a number of keyframes, representing iconic images extracted from the respective videos.
Folder | # Videos | # Keyframes |
---|---|---|
L01 | 31 | 7,658 |
L02 | 30 | 8,820 |
L03 | 31 | 8,554 |
L04 | 31 | 9,908 |
L05 | 28 | 7,781 |
L06 | 28 | 8,980 |
L07 | 31 | 8,694 |
L08 | 29 | 9,779 |
L09 | 30 | 8,074 |
L10 | 30 | 9,058 |
L11 | 31 | 8,631 |
L12 | 31 | 8,294 |
L13 | 30 | 7,865 |
L14 | 30 | 8,835 |
L15 | 31 | 8,404 |
L16 | 31 | 8,914 |
L17 | 33 | 9,260 |
L18 | 23 | 4,961 |
L19 | 99 | 24,245 |
L20 | 100 | 25,030 |
Advantages | Impact |
---|---|
Diversity | This dataset is diverse in content, covering events both within and outside the country, cultural activities, sports, travel, culinary, fashion, and news alerts. |
Data Size | The dataset is large with a significant number of videos, providing the model with opportunities to learn from various types of data, improving its understanding and differentiation between contexts. |
Accompanying Keyframes | The inclusion of accompanying keyframes by the organizers helps in saving time for extracting representative images for videos and content processing. |
Disadvantages | Impact |
---|---|
Image Quality | Some videos from security cameras have low image quality and lack color information, posing a challenge in extracting keyframes and using the CLIP model. |
Content Diversity | The diversity of content in news videos poses requirements for diverse training data and the model's capability to understand and differentiate between various content types. |
Data Size | A large dataset is both an advantage and a disadvantage as it requires significant computational resources. |
Audio and Text Conversion | One issue with news videos is the amount of accompanying information summarized in the presenter's speech. Processing information from audio and text demands the ability to convert from speech to text. |
Keyframe feature extraction:
Keyframe feature extraction is a crucial step in video processing and can facilitate concise and effective representations for image-based tasks within videos.
The process of extracting keyframe features involves the following steps:
After obtaining both the text features from the query and the keyframe features from the video, the comparison process is performed by calculating the cosine similarity.
Cosine similarity is a measurement of similarity between two vectors in a multi-dimensional space. It's commonly used in various statistical, machine learning, natural language processing, and other fields for comparing data points.
cos(θ) = Σ Ai * Bi / (||A|| * ||B||)
The Video Event Retrieval application has made progress in searching and retrieving events in videos based on text queries.
HoIver, it faces several limitations and unachieved capabilities due to computational resource constraints and specific characteristics of the CLIP model:
This application interacts with Hadoop to access keyframe and video files from HDFS.
The keyframe features are stored in a JSON file on HDFS and accessed via HTTP requests.
Videos are also accessed from HDFS by downloading the video content locally to extract video clips around similar keyframes.
The application interface includes a text box for users to input queries.
Due to resource limitations and the Vietnamese language processing capability of the CLIP model, queries need to be translated to English before being input into the application for optimal results.
Besides retrieving the top 10 keyframes with the highest query similarity and a short 10-second clip extracted from the video containing the most similar keyframe,
the application also features a sidebar providing basic information about the extracted video, such as: