On social media, toxic comments can spread like wildfire, targeting individuals and marginalized groups. While overt hate speech is relatively easy to flag, implicit toxicity - which relies on stereotypes and coded language rather than outright insults - poses a more challenging problem. How can we train artificial intelligence systems to not only detect this hidden toxicity but also explain why it is harmful?
Researchers from Nanyang Technological University, National University of Singapore, and Institute for Infocomm Research in Singapore are tackling this challenge head-on with a novel framework called ToXCL, outlined in Figure 2. Unlike previous systems that integrated detection and explanation into a single text generation task, ToXCL takes a multi-module approach, breaking the problem down into several steps.
First, there is a target group generator - a text generation model that identifies the specific minority groups that a particular post may be targeting. Next is an encoder-decoder model, which first classifies the post as toxic or non-toxic using its encoder. If it is labeled as toxic, the decoder generates an explanation with the help of the target group information, explaining why it is problematic.
But here's the clever part: to enhance the detection capabilities of the encoder, the researchers introduced a powerful teacher classifier. Using knowledge distillation techniques, this teacher model imparts its expertise to the encoder during the training process, thereby improving its classification ability.
The researchers also added a conditional decoding constraint - a clever trick that ensures the decoder only generates explanations for posts classified as toxic, eliminating contradictory outputs.
So how does it perform? In two major benchmarks for implicit toxicity, ToXCL outperforms state-of-the-art baseline models, surpassing models that solely focus on detection or explanation. Human evaluators also rated its outputs higher in terms of correctness, fluency, and reduction of harmfulness compared to other leading systems.
Of course, there is still room for improvement. The model sometimes makes errors on encoding symbols or abbreviations that require external knowledge. And the subjectivity of implicit toxicity means that "correct" explanations are often multifaceted. However, overall, ToXCL represents an impressive step forward for artificial intelligence systems in identifying hidden hate and elucidating its harmful effects. As this technology continues to evolve, we must also address potential risks such as reinforcing biases or generating toxic language itself. But with caution and care, it can pave the way for amplifying the voices of marginalized groups and curbing oppressive speech on the internet.