Development of a Deep Learning-Based Sentiment Analysis System for Tigrigna YouTube Comments
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Mekelle University
Abstract
As social media grows, millions of Tigrigna-speaking users share their thoughts on platforms like YouTube. However, there are no built-in tools for real-time sentiment analysis on these plat forms, and for low-resource languages like Tigrigna, even basic datasets for sentiment analysis are unavailable. This research addresses this problem by developing a real-time sentiment analysis system specifically for Tigrigna YouTube comments using the Design Science Research Methodology (DSRM). A dual-purpose browser extension was developed to facilitate both real-time data collection and live sentiment prediction directly on the YouTube interface. This tool incorporates a Multi-Stage Linguistic Pre-processing Pipeline to distinguish Tigrigna from Amharic, resulting in a gold-standard dataset of 30,353 comments. Because the initial data had very few neutral samples, a human-in-the-loop (HITL) strategy was used, where 1,500 model-predicted neutral samples were manually verified and added. This increased the size of the minority class and improved the system’s ability to recognize neutral comments. The preprocessing pipeline followed a specific order to handle the informal nature of social media text: repeated character reduction, abbreviation expansion, character normalization, and the removal of URLs, punctuation, numbers, and non-Tigrigna text, followed by stop word removal. The final dataset was split into 70% for training, 15% for validation, and 15% for testing to ensure a rigorous evaluation. This was followed by tokenization with a sequence length of 32. This study compared nine different experimental setups using CNN, Bi-LSTM, and Hybrid architectures paired with Word2Vec, Fast Text, and Hybrid embed dings. The results show that the Bi-LSTM model with Fast Text embed dings performed the best, achieving an accuracy of 82% and a Macro F1-score of 78%. The system showed a major improvement in the neutral class while maintaining high performance for positive sentiment. The final system provides users with instant sentiment breakdowns of live YouTube comments, offering a practical tool for real-time monitoring and a significant step forward for Tigrigna natural language processing. This methodology provides a framework that can be used for other low-resource languages. Future work should focus on improving the detection of sarcasm and more complex language patterns in Tigrigna.