Deep Learning Approaches for Detecting Adversarial Cyberbullying and Hate Speech in Social Networks (2024)

Sylvia Worlali AzumahSchool of Information Technology
University of Cincinnati
Cincinnati, Ohio, USA
azumahsw@mail.uc.edu Nelly ElsayedSchool of Information Technology
University of Cincinnati
Cincinnati, Ohio, USA
elsayeny@ucmail.uc.edu Zag ElSayedSchool of Information Technology
University of Cincinnati
Cincinnati, Ohio, USA
elsayezs@ucmail.uc.edu Murat OzerSchool of Information Technology
University of Cincinnati
Cincinnati, Ohio, USA
ozermm@ucmail.uc.edu Amanda La GuardiaSchool of Human Services
University of Cincinnati
Cincinnati, Ohio, USA
laguaraa@ucmail.uc.edu

Abstract

Cyberbullying is a significant concern intricately linked to technology that can find resolution through technological means. Despite its prevalence, technology also provides solutions to mitigate cyberbullying. To address growing concerns regarding the adverse impact of cyberbullying on individuals’ online experiences, various online platforms and researchers are actively adopting measures to enhance the safety of digital environments. While researchers persist in crafting detection models to counteract or minimize cyberbullying, malicious actors are deploying adversarial techniques to circumvent these detection methods. This paper focuses on detecting cyberbullying in adversarial attack content within social networking site text data, specifically emphasizing hate speech. Utilizing a deep learning-based approach with a correction algorithm, this paper yielded significant results. An LSTM model with a fixed epoch of 100 demonstrated remarkable performance, achieving high accuracy, precision, recall, F1-score, and AUC-ROC scores of 87.57%, 88.73%, 87.57%, 88.15%, and 91% respectively. Additionally, the LSTM model’s performance surpassed that of previous studies.

Index Terms:

Cyberbullying, hate speech, adversarial attacks, deep learning

I Introduction

Social media platforms allow individuals to stay informed about trends and news and freely express their opinions, fostering user discussions. However, the absence of effective moderation on these platforms has led to various issues, including the rampant dissemination of false information, online harassment, and cyberbullying[1]. Cyberbullying is a prevalent issue across various online platforms, particularly among young users who share their thoughts, interests, and challenges. Unfortunately, some individuals resort to harmful tactics like harassment, threats, intimidation, and mocking to purposefully inflict emotional harm. The intention behind such behavior is to make others feel worse, erode their self-esteem, and discourage them from engaging in online discussions, posting questions, messages, or sharing personal images[2]. Anonymity presents a significant appeal to cyberbullies, as it shields them from social repercussions and negative consequences associated with their improper behavior. This problem has been on the rise since the advent of social networks. A notable incident occurred with the closure of Formspring, likely due to the suicides linked to cyberbullying messages exchanged on that platform[2]. The anonymity feature of the service, enabling individuals who know each other offline to message each other anonymously, played a crucial role.

Cyberbullying is a pressing issue deeply rooted in technology. However, technology can also serve as a solution to reduce or even eliminate it[3]. In response to the increasing worry surrounding the negative effects of cyberbullying on individuals’ Internet experiences, numerous online platforms and researchers are implementing measures to improve the safety of their online environments[4][5]. These online platforms utilize various strategies to address the issue, including refining content through crowdsourcing utilizing upvotes and downvotes, disabling comments, or implementing manual moderation to minimize the impact of inappropriate content[6]. However, these approaches have proven to be inefficient and lacking scalability. Consequently, researchers have been significantly demanding to devise methods for automatically detecting abusive or toxic content in real-time[7]. Developing Natural Language Processing (NLP) algorithms specifically designed to detect cyberbullying is one approach towards this goal. Such algorithms rely on annotated data to measure their performance effectively. Popular Machine Learning (ML) algorithms, particularly Deep Neural Networks (DNN), require large annotated corpora to attain high-quality classification results. While researchers continue developing detection models to curb or reduce cyberbullying, attackers use adversarial techniques to bypass detection methods. Machine learning models are typically optimized to perform well when operating with clean data and in benign conditions[8]. However, according to most research works highlighted by Hossein et al.[1], they are susceptible to attacks in adversarial techniques. These attacks involve exploiting vulnerabilities in the machine learning algorithms, wherein an adversary can manipulate the algorithm’s prediction scores by making slight, often imperceptible perturbations to the input data[9][10]. Such inputs are known as adversarial examples. These inputs are commonly referred to as adversarial examples[11], and they have demonstrated their effectiveness in bypassing various machine learning algorithms, even when the adversary has limited knowledge and can only access the target model as a black box[12].This paper aims to improve the user experience when using different social media platforms by investigating and implementing cyberbullying content detection in adversarial text data, utilizing a simple deep learning approach that can be efficiently implemented on different platforms.

I-A Adversarial Attacks

Adversarial attacks involve creating examples with the intent to deceive a system or model. According to Goodfellow et al., [13], adversarial examples are inputs formed by applying small but intentionally worst-case perturbations to examples from the dataset, aiming to make the model output an incorrect answer with high confidence. These attacks can involve subtle or significant modifications to previously correctly classified examples of various types to cause the model to misclassify them while maintaining readability for human users[13].In the field of Natural Language Processing (NLP), various types of adversarial attacks exist, categorized by Alsmadi et al.[14] into four groups:

i) Character level attacks: Involving substitutions, additions, or deletions of characters within a word (typos, adding white spaces, etc.). Example: I want to k ! l l you.ii) Word-level attacks: Entailing the replacement of words with adversarial synonyms that are more challenging to classify. Example: I want to end you.iii) Sentence level attacks: Encompassing additions, deletions, or paraphrasing of entire sentences. Example: I would love to send you to heaven today.iv) Inter-level attacks: A combination of character, word, and sentence-level attacks.Semantic changes, particularly in word and sentence-level attacks, pose complexity as they require an understanding of grammar and sentence structure. In many instances, human users might not discern these changes as attempts to evade detection[15]. On the other hand, character-level attacks are more straightforward, making alterations only to individual letters.

Adversarial attacks exploit models’ weaknesses, and knowledge of the model architecture, functions, and parameters provides an advantage[14]. In practical scenarios, however, attackers often lack insight into the specific model they aim to deceive. Consequently, the discovery and exploitation of model-specific weaknesses usually occur through trial and error.Adversarial attacks can be initiated with various motives, such as influencing a classifier’s decision or compromising security. An illustrative case involves the exploitation of vulnerabilities in Microsoft’s Tay chatbot, resulting in its shutdown due to the generation of racist tweets[16].

It is worth noting that while each system may have individual weak points, many systems share vulnerabilities to the same types of attacks, as observed by Goodfellow et al.[13]. These weaknesses can be consistent across various models and training datasets.

II Related Work

Deep learning falls under the umbrella of machine learning and is characterized by computational models with multiple layers, offering a high level of abstraction. These models learn from experience, perceiving the world through a hierarchy of concepts. Utilizing the backpropagation algorithm, deep learning explores intricate details in extensive datasets to compute data representation in each layer based on the preceding layer’s representation[17].

Its significance lies in providing solutions to previously insurmountable problems with conventional machine learning techniques. Advancements in deep neural network models, coupled with high performance hardware, have driven progress in traditional areas like image classification, speech recognition, and language translation, as well as more sophisticated domains such as drug molecule analysis[18], brain circuit reconstruction[19], particle accelerator data analysis[20], and studying DNA mutation effects[21].

The unprecedented accuracy of deep learning networks has revolutionized AI-based services on the Internet. Major players like Google, Alibaba, Intel, Amazon, and Nvidia have leveraged deep learning for cloud computing AI-based services. Its applications extend to safety and security critical environments, including self-driving cars, malware detection, drones, and robotics. Recent advancements in face recognition have led to biometric authentication in ATMs and mobile phones. Products like Apple Siri, Amazon Alexa, and Microsoft Cortana have become possible through Automatic Speech Recognition (ASR) models and Voice Controllable Systems (VCS).

However, as deep neural networks transition from labs to the real world, concerns arise regarding the security and integrity of applications. Adversaries can manipulate legitimate inputs, which are imperceptible to humans but capable of forcing a trained model to produce incorrect outputs. Szegedy et al.[11] first identified the susceptibility of well performing deep neural networks to adversarial attacks. Carlini et al.[22] and Zhang et al.[23] highlighted vulnerabilities in automatic speech recognition and voice controllable systems. Kurakin et al.[24] demonstrated attacks on autonomous vehicles by manipulating traffic signs. Several countermeasures have been proposed to address adversarial attacks, including adversarial training, distillation, and Generative Adversarial Networks (GANs). However, no single solution has proven effective against all types of attacks, and implementing these defenses may lead to performance degradation and reduced model efficiency.

Hate speech, characterized by abusive or threatening expressions directed against a particular group based on attributes like color, race, religious beliefs, gender, or sexual orientation, is recognized as a significant contributor to escalating global violence[25]. The prevalence of social networks such as Facebook, Instagram, and Twitter, coupled with increased Internet accessibility, has exacerbated the issue by enabling individuals to freely and effectively express their opinions to a global audience[26].

In response to growing concerns about global violence, various initiatives have been undertaken to identify potential sources and mitigate the spread of hate speech on social networks. The AI, ML, data science, and NLP communities have contributed by proposing innovative hate speech detection techniques[27]. For example, Mozafari et al.[28] introduced a hate speech detection and classification framework for Twitter text streams based on BERT. However, like other NLP applications, hate speech detection methods are susceptible to adversarial attacks, as demonstrated in a study where state-of-the-art NLP adversarial attacks modified text to deceive hate speech recognition models[29]. The objective of these attacks is to disrupt the classification capabilities of the models, resulting in the misclassification of abusive and toxic content.

The literature provides various examples of adversarial attacks on hate speech detection techniques. Gröndahl et al.[30] employed three types of adversarial attacks: word changes, word-boundary changes, and appending unrelated innocuous words. Similarly, in another study, hate speech detection models were fooled using typos, removing white spaces, inserting benign words, and appending character boundaries[31]. Combining individual attack types resulted in more effective adversarial attacks.

To counter attacks on hate speech detection models, several defense strategies have been introduced. Many proposed methods rely on adversarial training to handle perturbations[30][32][33]. For instance, adversarial training in[32] extends the fast gradient method (FGM) attack by incorporating a learnable and fine-grained noise magnitude, adding noise to misleading samples. Additionally, some solutions utilize preprocessing to defend against adversarial attacks on hate speech detection models[34]. Moh et al.[31] introduced four preprocessing defense techniques for space removal, typos, benign word insertion, and character boundary appending attacks.

Similar to other social media applications, a significant portion of hate speech detection solutions rely on graph-based approaches for robust detection. Beatty et al.[35] demonstrated the robustness of graph-based models against adversarial attacks, surpassing the performance of text-based solutions in detecting hate or toxic speech[35]. In[36], a graph-based solution was proposed to protect against adversarial attacks by incorporating the concept of latent neighborhood and systematic sampling of neighborhood nodes.

The research of Hosseini et al.[1] proposes and presents an attack on the Perspective toxic detection system using adversarial examples. The Perspective API detection system is a recent project developed by Google and Jigsaw that utilizes machine learning to automatically identify online insults, harassment, and abusive speech[37]. In their paper, they demonstrated that an adversary can make subtle modifications to highly toxic words, causing the system to assign a significantly lower toxicity score to it. By applying this attack to the sample words provided on the Perspective website, the researchers consistently reduced the toxicity scores to match those of non-toxic words. By conducting various experiments, they demonstrated that an adversary can manipulate the system by intentionally misspelling abusive words or inserting punctuation marks between the letters.

After experimentation, the researchers noticed that the adversarial perturbations can transfer across different words. This means that if a specific modification to a word decreases the toxicity score of one word, applying the same modification to the word is likely to reduce the toxicity score of another phrase as well. Exploiting this characteristic, an adversary can create a dictionary of adversarial perturbations for each word, greatly streamlining the attack process[1]. Furthermore, it was observed that the Perspective detection system occasionally assigns high toxicity scores to seemingly harmless words incorrectly, assigned a 34% toxicity score to a majority of misspelled and randomly generated words, and the Perspective interface enables users to provide feedback on the toxicity scores of words, indicating that the learning algorithm incorporates new data to update itself. However, this functionality also opens the system to potential poisoning cyber attacks. In such attacks, an adversary can manipulate the training data, specifically the labels, to trick the model into assigning low toxicity scores to specific words.

These tactics enable the adversary to deceive the Perspective toxic detection system. Such adversarial examples pose a significant threat to toxic detection systems and or cyberbullying detection systems, severely undermining their effectiveness and usability.Similar to[1], Li, Jinfeng, et al.[38] discusses the security vulnerabilities of Deep Learning based Text Understanding (DLTU) and proposes a general attack framework called TEXTBUGGER for generating adversarial texts. The problem is that DLTU is vulnerable to adversarial text attacks, where maliciously crafted texts trigger target DLTU systems and services to misbehave[39]. This is concerning given the increasing use of DLTU in security-sensitive applications such as sentiment analysis and toxic content detection. The proposed solution, TEXTBUGGER, is a general attack framework that outperforms state-of-the-art attacks in terms of attack success rate, preserves the utility of benign text, and generates adversarial text with computational complexity sub-linear to the text length. In this paper the authors empirically evaluated TEXTBUGGER on a set of real world DLTU systems and services used for sentiment analysis and toxic content detection, demonstrating its effectiveness, evasiveness, and efficiency. For example, TEXTBUGGER achieves a 100% success rate on the IMDB dataset using Amazon AWS Comprehend in just 4.61 seconds while still maintaining a 97% semantic similarity[38].

The authors concluded in their paper that TEXTBUGGER is effective and efficient for generating targeted adversarial NLP. The transferability of such examples hint at potential vulnerabilities in many real applications, including text filtering systems and online recommendation systems. The authors suggested possible defense mechanisms to mitigate such attacks, such as spelling checks and adversarial training, and propose exploring an ensemble of linguistically aware or structurally aware-based defense systems to improve robustness.

In one of the initial attempts to deceive deep neural text classifiers, Papernot et al.[40] introduced a white box adversarial attack. They iteratively applied the attack to modify an input text until the resulting sequence was misclassified. Although successful in fooling the classifier, their approach resulted in significant alterations at the word level, significantly impacting the original meaning.

In another study by Ebrahimi et al.[41], they proposed a gradient-based optimization method. This technique involved changing one token to another using the gradients of the model with respect to the one-hot vector input. Additionally, Samanta et al.[42] utilized embedding gradients to identify crucial words and devised heuristic rules combined with hand-crafted synonyms and typos. These various approaches aimed to manipulate the text inputs to evade classification models while preserving some semantic relevance.

Other prior works also focused on generating adversarial examples for text by substituting a word with another legible but out of vocabulary word[1][43][44]. For example, Belinkov et al.[43] demonstrated that character-level machine translation systems exhibit heightened sensitivity to random character manipulations, such as keyboard typos. Similarly, Gao et al.[44] introduced DeepWordBug, which utilizes character perturbations to create adversarial texts specifically targeting deep learning classifiers. However, it is worth noting that DeepWordBug is not computationally efficient and is not suitable for practical applications[38].

Thomas et al.[45] introduced TOXIGEN, a novel dataset comprising 274k toxic and benign statements related to 13 minority groups. This dataset was created to address issues where toxic language detection systems tend to incorrectly flag text as toxic. To achieve this, they devised a demonstration-based prompting framework and employed an adversarial classifier in the loop decoding method to generate subtly toxic and benign text using a large pre-trained language model. By controlling machine generation in this manner, TOXIGEN extends its coverage to implicitly toxic text on a larger scale and across more demographic groups than previous resources relying on human written text. Additionally, they conducted a human evaluation on a challenging subset of TOXIGEN and observed that annotators face difficulty distinguishing machine generated text from human written language. Additionally, our findings indicate that human annotators label 94.5% of toxic examples as hate speech. Leveraging three publicly available datasets, we demonstrate that fine-tuning a toxicity classifier on our data significantly enhances its performance on human written data. Moreover, we showcase that TOXIGEN proves effective in combating machine-generated toxicity, as fine-tuning substantially improves the classifier’s performance on our evaluation subset.

Another study by Grolman et al.[46] introduced an innovative algorithm designed to generate adversarial examples targeting hate speech detection models in both white box and black box scenarios. The algorithm was specifically tailored to address the unique characteristics of tabular datasets, including immutable features and diverse feature types. By combining these enhancements, this approach achieved significant improvements. In white box attacks, the enhanced method attained nearly 90% transferability, a substantial enhancement over the baseline attack’s 52% transferability. Similarly, the black box attacks approach achieved 75% transferability, outperforming the baseline attack’s result of 60%.

Grolman et al.[46] highlighted a departure from traditional adversarial attack strategies, which primarily rely on altering features based on the substitute model’s gradient. Instead, they demonstrated that employing alternative machine learning methods, such as feature importance and correlation-based techniques, yields superior results. Furthermore, they illustrated that successful adversarial attacks can be achieved by modifying user behavior patterns rather than altering the content of their tweets.

Continual challenges persist in hate speech detection due to adversarial attacks, which involve modifying correctly classified examples to induce misclassification, often with the aim of evading detection. These alterations intend to be incomprehensible to a model while remaining easily readable for humans. Another study assesses the robustness of two German hate speech detection models against six distinct character-level adversarial attacks, following the methodology outlined by Gröndahl et al.[30] in their All You Need Is Love research paper. Despite advancements in models since Gröndahl et al.’s[30] experiments, even simplistic attacks strongly impact the models. This research aimed to evaluate German hate speech detection systems against character-level adversarial attacks and identify their vulnerabilities. An expectation existed for substantial improvements in hate speech detection models since Gröndahl et al.’s[30] experiments. However, while improvements have occurred, the models still exhibit susceptibility, particularly to straightforward attacks.

III Methodology

This section is a proposed approach that outlines the different stages involved in the cyberbullying detection process. Figure1 shows a proposed adversarial and cyberbullying or hate speech detection methodology in online social networking sites (SNS).

III-A Dataset

This collection of data, named hate speech offensive, developed by Davidson et al.[47], has been meticulously assembled and comprises annotated tweets curated explicitly for the purpose of identifying hate speech and offensive language. The data gathering for this dataset involved crowdsourcing by utilizing Twitter’s public API to extract tweets using predefined search terms associated with hate speech and offensive language. Subsequently, these tweets underwent manual labeling by multiple annotators who examined them to assign appropriate classifications. The dataset predominantly contains English tweets and serves as a resource for training machine learning models or algorithms to detect hate speech.

The dataset provides various columns containing valuable information for comprehending the classification of each tweet. The column named count represents the total number of annotations for each tweet, while the hate speech count indicates how many annotations classified a particular tweet as hate speech. Conversely, offensive language count denotes the number of annotations categorizing a tweet as containing offensive language. Moreover, neither count reveals how many annotations identified a tweet as neither hate speech nor offensive language.

III-B Data Preprocessing

The first phase of this research entails the creation of an adversarial attack model meticulously designed to identify and correct adversarial examples or attacks within publicly available cyberbullying datasets. This is a critical step in the preprocessing phase of the cyberbullying detection system, aimed at enhancing classification accuracy. This endeavor draws inspiration from existing research on adversarial attacks and aims to replicate the methodologies advanced by pioneering researchers in this domain. To start the first phase, the dataset undergoes several data processing techniques, including normalization, noise removal, stop words removal and tokenization. These preprocessing methods are discussed extensively in the subsections.Numerous research efforts have emerged to identify adversarial attacks in cyberbullying or hate speech datasets, employing natural language processing (NLP) and machine learning (ML) methods like permutation recovery and obfuscation. This dissertation employs a Spell checker to rectify adversarial input within the dataset before conducting classification, given the textual nature of the data. Spellcheckers are essential components of word processors, capable of identifying and highlighting incorrect words in text. They offer the functionality to correct these errors by suggesting suitable alternatives from a predefined list of words[48]. Typically, spellcheckers examine each word in a text individually, comparing it against a stored lexicon to determine its correctness. If a word is found in the lexicon, it is deemed appropriate regardless of its context[48]. Additionally, several studies have explored the effectiveness of spelling and grammatical checks in defending against character or word level attacks in text data, which involve modifying characters and words[16].To assess the resilience of our adversary detection and correction phase, the spell checker was incorporated into the data preprocessing stage. To be specific, Python’s SpellChecker library was employed for this purpose.

III-B1 Noise Removal

Noise removal removes URL’s multiple spaces, punctuation, hashtags, usernames, and emojis from the text data for processing. This method is essential because it helps to reduce noise in the dataset and focuses mainly on the words that carry the most meanings.

III-B2 Stop Words

Stop words are often used as function words, such as articles, conjunctions, and prepositions, and can make text higher dimensional. Removing stop words can reduce the dimensionality of term space, making it easier to analyze and interpret text data. It can also improve the accuracy of NLP models. This is because stop words can often be misleading to models, and they can cause the models to make incorrect predictions.

III-B3 Tokenization

Tokenization breaks down text into smaller units called tokens by splitting the text into paragraphs, sentences, and words. Tokenization is used because it is impossible to feed the whole text sample to a model at once. By breaking the text down into smaller units, the model can process the text more easily[49]. Additionally, the meaning of the text is preserved when it is tokenized, so the model can still understand the text. This method is essential because many NLP algorithms rely on individual words as the basis of analysis, so breaking down text into its individual components is necessary.

III-B4 Normalization

Normalization is converting all the tokens in a text to a standard form. This can involve lower casing all the letters, removing stop words, and stemming or lemmatizing the words. These steps are undertaken to preprocess the data for both the adversary and cyberbullying detection model. The methodology is divided into two key aspects: the adversarial example detection phase, designed to identify and rectify adversarial examples, and the second phase, which introduces a cyberbullying detection model utilizing the corrected data from the initial phase.

III-C Deep Learning Model Architecture

The second phase introduces an innovative cyberbullying detection model, leveraging deep learning techniques specifically tailored for text detection. The model will harness publicly available data sourced from social networking sites, as mentioned in the first phase. This phase represents a forward-looking stride in the ongoing pursuit of advancing cyberbullying detection, striving to enhance the effectiveness and accuracy of identifying and mitigating this pervasive online issue.

III-C1 Long Short Term Memory (LSTM)

The Long Short-Term Memory (LSTM) is a category of recurrent neural network (RNN) designed for classifying and predicting temporal dependent data, such as time series and signal datasets[50][51][52]. Initially applied in natural language processing (NLP) and speech recognition, LSTM has demonstrated superior learning precision when compared to other algorithms like recurrent cascade correlation and neural sequence chunking[53][54].

Research findings indicate that LSTM effectively handles tasks that were challenging for traditional recurrent network algorithms in the past[55, 56]. Introduced by Hochreiter et al.[57], the LSTM architecture comprises the input gate, a memory cell, and an output gate[55]. Within an LSTM network, memory cells store features and consist of a forget gate ( $f_{t}$ ), input gate ( $i_{t}$ ), and an output gate ( $o_{t}$ )[55].

Deep Learning Approaches for Detecting Adversarial Cyberbullying and Hate Speech in Social Networks (2)

Figure2 illustrates the design of the Long short term memory cell’s architecture as follows:

•
Forget gate $(f)$ : Specifies the details regarding the removal of information from the cell state[52].
•
Input gate $(i)$ : Determines the inclusion of information into the cell state[52].
•
Input update $(o)$ : Modifies the memory content according to the present input to the LSTM[52].
•
Output gate $(o)$ : Establishes and outlines the information to be presented as output from the cell state[52].

At each time-step $t$ , the computation of each component in the standard LSTM is determined as follows:

$\displaystyle i_{(t)}$	$\displaystyle=\sigma(W_{xi}x_{(t)}+U_{hi}h_{(t-1)}+b_{i})$	(1)
$\displaystyle g_{(t)}$	$\displaystyle=\mathrm{tanh}(W_{xg}x_{(t)}+U_{hg}h_{(t-1)}+b_{g})$	(2)
$\displaystyle f_{(t)}$	$\displaystyle=\sigma(W_{xf}x_{(t)}+U_{hf}h_{(t-1)}+b_{f})$	(3)
$\displaystyle o_{(t)}$	$\displaystyle=\sigma(W_{xo}x_{(t)}+U_{ho}h_{(t-1)}+b_{o})$	(4)
$\displaystyle s_{(t)}$	$\displaystyle=f^{(t)}\odot s_{(t-1)}+i_{(t)}\odot g_{(t)}$	(5)
$\displaystyle h_{(t)}$	$\displaystyle=\mathrm{tanh}(s_{(t)})\odot g_{(t)}$	(6)

where, $x_{(t)}$ represents the input at time step $t$ , and $h_{(t-1)}$ denotes the output of the memory cells from the previous time-step $t-1$ [57]. The symbols $\sigma$ and $\odot$ correspond to the logistic sigmoid function and element-wise Hadamard multiplication, respectively. The model incorporates two activation units: input-update and output activation, where the $\mathrm{tanh}$ activation function is recommended for use[58].

The memory cell state at time $t$ is denoted as $s^{(t)}$ , and the output of the LSTM unit at time $t$ is $h_{(t)}$ . The biases for each gate are represented by $b_{i}$ , $b_{g}$ , $b_{f}$ , and $b_{q}$ . The feed forward weights and recurrent weights are denoted by $W$ and $U$ , respectively.

IV Experiment, Results And Analysis

IV-A Training And Testing

In this study, the Long Short Term Memory (LSTM) model uses a 60-20-20 split ratio to train, validate and test. The model is trained on 80% of the dataset, validated on 20% and tested with 20% of the dataset.

IV-B Performance Metrics

The LSTM model’s performance in classifying attacks on cyberbullying data is assessed using standard metrics: Accuracy, Recall, Precision, F1-score, and AUC-ROC. These metrics are calculated as follows:

$\displaystyle Accuracy$	$\displaystyle=\frac{TP+TN}{TP+TN+FP+FN}$	(7)
$\displaystyle Recall$	$\displaystyle=\frac{TP}{TP+FN}$	(8)
$\displaystyle Precision$	$\displaystyle=\frac{TP}{TP+FP}$	(9)
$\displaystyle F1$	$\displaystyle=2\times\frac{Precision*Recall}{Precision+Recall}$	(10)

where TP, TN, FN, and FP represent true positives, true negatives, false negatives, and false positives. Additionally, to check the model’s classification performance, we calculate the Area Under the Curve (AUC) - Receiver operating characteristics (ROC) curve.

IV-C Results

The outcomes of the experiment are displayed in TablesI andII where TableI illustrates the experimental results and TableIIillustrates comparative studies of results. Following the training and testing of the model using the hate speech dataset, the experiment was conducted. The trial commenced and concluded with a consistent epoch of 100, with a progressive augmentation of the model’s hidden layers and denseness. Figures3 and4 depict the training and validation accuracy, and the training and validation loss of the model, respectively.

Metrics	Training Results
Accuracy	87.57%
Precision	88.73%
Recall	87.57%
F1-Score	88.17%
AUC-ROC	91%
Training parameters	100,227
All parameters	2,144,127

Deep Learning Approaches for Detecting Adversarial Cyberbullying and Hate Speech in Social Networks (3)

Deep Learning Approaches for Detecting Adversarial Cyberbullying and Hate Speech in Social Networks (4)

TableI illustrates the outcomes of the trial experiment conducted using the LSTM model, employing a fixed epoch of 100, yielding Accuracy, Precision, Recall, F1-score, and AUC-ROC scores of 87.57%, 88.73%, 87.57%, 88.17%, and 91%, respectively. Model parameters are characteristics of the training data that are learned during the training process. In the context of deep learning, these parameters typically refer to weights and biases. Parameters serve as a gauge of the model’s performance. Specifically, in the LSTM model, there are 100,227 trainable parameters and a total of 2,144,127 parameters. The total parameter count represents the sum of all the weights and biases in the LSTM model, while the trainable parameters refer to those that can be learned and adjusted throughout the training process. Meanwhile, the results depicted in Figure3 and Figure4 present the average training and validation loss, along with accuracy. Specifically, the training loss is observed to be 0.0654, with an approximate accuracy of 98%, while the validation accuracy stands at 88% with a corresponding validation loss of 0.0652.

IV-D Comparative Study

In this section, we conduct a comparative analysis of our research findings in relation to recent studies. Our investigation focuses on the performance of the LSTM model with multiple dense layers in the trial phase. The outcomes reveal significant success, with recorded metrics including an Accuracy of 87.57%, Precision of 88.73%, Recall of 87.57%, F1-Score of 88.17%, and AUC-ROC of 91%. To assess the effectiveness of the LSTM model, we compare its results with those of Davidson et al.[59], Del et al.[60], and Li et al.[38] because these two studies used the same data mentioned in subsectionIII-A. Remarkably, our LSTM model surpasses state-of-the-art classification models such as Support Vector Machine (SVM) and Long Short Term Memory (LSTM), as illustrated in TableII.

Davidson et al.[59] conducted research utilizing the SVM model in conjunction with lexicons for the classification of hate speech, cyberbullying, and offensive language. Their model achieved an Accuracy of 86%. The researchers further assessed their model’s performance using precision, recall, and f1-Score as effective metrics. Despite achieving an overall Precision of 91%, Recall of 90%, and F1-score of 90%, they observed that their model misclassified nearly 40% of hate speech instances. They concluded that this discrepancy stemmed from the model’s bias toward categorizing tweets as less hateful or offensive, leading to a significant under representation of tweets classified as more offensive or hateful than their actual category.

Model	Method	Accuracy (%)
Davidson et al.[59]	SVM	86%
Del et al.[60]	SVM	75.23%
Our Model	LSTM	87.57%

Del et al.[60] conducted experiments on textual data aimed at classifying hate speech, resulting in an Accuracy of 75.23% when utilizing both SVM and LSTM models. However, neither SVM nor LSTM could effectively distinguish between the three classes (strong hate, weak hate, and no hate), especially struggling with the classification of strong hate instances. The researchers suggested that this difficulty might stem from the limited number of strong hate documents, which constituted the class with the fewest documents, as well as the low level of agreement among annotators. Consequently, the authors proceeded to conduct an additional experiment focusing on a two-class classification, yielding significantly improved accuracy results compared to their initial trial.

When comparing our model to those employed in the aforementioned studies for hate speech classification, it became apparent that none of the prior works integrated an adversary correction model to rectify inputs or examples that might lead the model to misclassify outputs. In contrast, our study not only assessed the model’s performance based on accuracy, precision, recall, and f1-score but also included an evaluation of AUC-ROC. AUC-ROC is regarded as one of the most crucial evaluation metrics for assessing classification models’ performance[61]. It provides a comprehensive measure of overall accuracy during model testing.

The AUC-ROC curve ranges from 0 to 1, where 0 signifies a perfectly inaccurate test and 1 signifies a perfectly accurate test[62]. Typically, an AUC of 0.5 indicates no discrimination (i.e., the ability to classify correctly), while values between 0.7 and 0.8 are considered acceptable, 0.8 to 0.9 are deemed excellent, and above 0.9 are regarded as outstanding[62].

In our study, the AUC-ROC score achieved using the LSTM model is 91%, indicating a 91% probability that the model accurately distinguished among the three classes: hate speech, offensive language, and neither. This underscores the effectiveness of our model in tackling hate speech classification.

Deep Learning Approaches for Detecting Adversarial Cyberbullying and Hate Speech in Social Networks (5)

Deep Learning Approaches for Detecting Adversarial Cyberbullying and Hate Speech in Social Networks (6)

Model	Method	Accuracy (%)
Model 1	1D-CNN	55.27%
Model 2	GRU	84.30%
Our Model	LSTM	87.57%

Deep Learning Approaches for Detecting Adversarial Cyberbullying and Hate Speech in Social Networks (7)

Deep Learning Approaches for Detecting Adversarial Cyberbullying and Hate Speech in Social Networks (8)

In this comparative study, we also applied both the GRU and 1D-CNN methods to our dataset. Employing a fixed epoch of 100 and conducting a single trial experiment, we observed the following performance metrics: For the 1D-CNN model, the Accuracy was 55.27%, Precision score was 55.98%, Recall was 55.27%, F1-score was 55.62%, and AUC-ROC was 66.77%. Conversely, the GRU model outperformed with an Accuracy of 84.30%, Precision of 85.81%, Recall of 84.30%, F1-score of 85.05%, and AUC-ROC score of 88.66%. Overall, the proposed LSTM model performed better than GRU and 1D-CNN. The results are illustrated in TableIII. Visual representations of the training versus validation accuracy and loss for both models are provided in Figure5 and Figure6 for the 1D-CNN, and Figure7 and Figure8 for the GRU model.

V Conclusion

This paper focused on detecting adversarial attacks within social networking site text data, with a particular emphasis on identifying hate speech. It involved an experiment employing a deep learning-based approach combined with a correction algorithm aimed at rectifying adversarial inputs or attacks, ultimately leading to the classification of the corrected text. The trial experiment utilized an LSTM model with a fixed epoch of 100, yielding Accuracy, precision, Recall, F1-score, and AUC-ROC scores of 87.57%, 88.73%, 87.57%, 88.17%, and 91%, respectively.

Upon comparing our model with those employed in prior studies for hate speech classification, it was evident that none of the previous works integrated an adversary correction model to rectify inputs that might lead to misclassification. In contrast, our study not only assessed the model’s performance based on standard metrics like Accuracy, precision, Recall, and F1-score but also included an evaluation of AUC-ROC, a crucial metric for assessing classification models’ performance. The achieved AUC-ROC score of 91% using the LSTM model indicated a 91% probability of accurately distinguishing among the three classes: hate speech, offensive language, and neither, showcasing the effectiveness of our model in hate speech classification.

References

[1]H.Hosseini, S.Kannan, B.Zhang, and R.Poovendran, “Deceiving google’sperspective api built for detecting toxic comments,” arXiv preprintarXiv:1702.08138, 2017.
[2]M.Ptaszyński, G.Leliwa, M.Piech, and A.Smywiński-Pohl,“Cyberbullying detection–technical report 2/2018, department of computerscience agh, university of science and technology,” arXiv preprintarXiv:1808.00926, 2018.
[3]S.Farley, I.Coyne, and P.D’Cruz, “Cyberbullying at work: Understandingthe influence of technology,” Concepts, Approaches and Methods,pp.233–263, 2021.
[4]Meta.
[5]B.Etim, “Approve or reject: Can you moderate five new york times comments?,”Sep 2016.
[6]A.Greenberg, “Now anyone can deploy google’s troll-fighting ai,” Feb 2017.
[7]E.Wulczyn, N.Thain, and L.Dixon, “Ex machina: Personal attacks seen atscale,” in Proceedings of the 26th international conference on worldwide web, pp.1391–1399, 2017.
[8]M.Barreno, B.Nelson, R.Sears, A.D. Joseph, and J.D. Tygar, “Can machinelearning be secure?,” in Proceedings of the 2006 ACM Symposium onInformation, computer and communications security, pp.16–25, 2006.
[9]M.Barreno, B.Nelson, A.D. Joseph, and J.D. Tygar, “The security of machinelearning,” Machine Learning, vol.81, pp.121–148, 2010.
[10]L.Huang, A.Joseph, B.Nelson, B.Rubinstein, and J.Tygar, “Proceedings ofthe 4th acm workshop on security and artificial intelligence,” 2011.
[11]C.Szegedy, W.Zaremba, I.Sutskever, J.Bruna, D.Erhan, I.Goodfellow, andR.Fergus, “Intriguing properties of neural networks,” arXiv preprintarXiv:1312.6199, 2013.
[12]N.Papernot, P.McDaniel, I.Goodfellow, S.Jha, Z.B. Celik, and A.Swami,“Practical black-box attacks against machine learning,” in Proceedingsof the 2017 ACM on Asia conference on computer and communications security,pp.506–519, 2017.
[13]I.J. Goodfellow, J.Shlens, and C.Szegedy, “Explaining and harnessingadversarial examples,” arXiv preprint arXiv:1412.6572, 2014.
[14]I.Alsmadi, K.Ahmad, M.Nazzal, F.Alam, A.Al-Fuqaha, A.Khreishah, andA.Algosaibi, “Adversarial attacks and defenses for social network textprocessing applications: Techniques, challenges and future researchdirections,” arXiv preprint arXiv:2110.13980, 2021.
[15]J.Bitton, M.Pavlova, and I.Evtimov, “Adversarial text normalization,” arXiv preprint arXiv:2206.04137, 2022.
[16]I.Alsmadi, K.Ahmad, M.Nazzal, F.Alam, A.Al-Fuqaha, A.Khreishah, andA.Algosaibi, “Adversarial nlp for social network applications: Attacks,defenses, and research directions,” IEEE Transactions on ComputationalSocial Systems, 2022.
[17]Y.LeCun, Y.Bengio, and G.Hinton, “Deep learning,” nature, vol.521,no.7553, pp.436–444, 2015.
[18]J.Ma, R.P. Sheridan, A.Liaw, G.E. Dahl, and V.Svetnik, “Deep neural netsas a method for quantitative structure–activity relationships,” Journal of chemical information and modeling, vol.55, no.2, pp.263–274,2015.
[19]M.Helmstaedter, K.L. Briggman, S.C. Turaga, V.Jain, H.S. Seung, andW.Denk, “Connectomic reconstruction of the inner plexiform layer in themouse retina,” Nature, vol.500, no.7461, pp.168–174, 2013.
[20]T.Ciodaro, D.Deva, J.DeSeixas, and D.Damazio, “Online particle detectionwith neural networks based on topological calorimetry information,” in Journal of physics: conference series, vol.368, p.012030, IOP Publishing,2012.
[21]H.Y. Xiong, B.Alipanahi, L.J. Lee, H.Bretschneider, D.Merico, R.K. Yuen,Y.Hua, S.Gueroussov, H.S. Najafabadi, T.R. Hughes, etal., “Thehuman splicing code reveals new insights into the genetic determinants ofdisease,” Science, vol.347, no.6218, p.1254806, 2015.
[22]N.Carlini, P.Mishra, T.Vaidya, Y.Zhang, M.Sherr, C.Shields, D.Wagner,and W.Zhou, “Hidden voice commands,” in 25th USENIX securitysymposium (USENIX security 16), pp.513–530, 2016.
[23]G.Zhang, C.Yan, X.Ji, T.Zhang, T.Zhang, and W.Xu, “Dolphinattack:Inaudible voice commands,” in Proceedings of the 2017 ACM SIGSACconference on computer and communications security, pp.103–117, 2017.
[24]A.Kurakin, I.Goodfellow, and S.Bengio, “Adversarial machine learning atscale,” arXiv preprint arXiv:1611.01236, 2016.
[25]Z.Laub, “Hate speech on social media: Global comparisons,” Council onforeign relations, vol.7, 2019.
[26]A.Matamoros-Fernández and J.Farkas, “Racism, hate speech, and socialmedia: A systematic review and critique,” Television & New Media,vol.22, no.2, pp.205–224, 2021.
[27]A.Schmidt and M.Wiegand, “A survey on hate speech detection using naturallanguage processing,” in Proceedings of the fifth internationalworkshop on natural language processing for social media, pp.1–10, 2017.
[28]M.Mozafari, R.Farahbakhsh, and N.Crespi, “A bert-based transfer learningapproach for hate speech detection in online social media,” in ComplexNetworks and Their Applications VIII: Volume 1 Proceedings of the EighthInternational Conference on Complex Networks and Their Applications COMPLEXNETWORKS 2019 8, pp.928–940, Springer, 2020.
[29]R.Oak, “Poster: Adversarial examples for hate speech classifiers,” in Proceedings of the 2019 ACM SIGSAC Conference on Computer and CommunicationsSecurity, pp.2621–2623, 2019.
[30]T.Gröndahl, L.Pajola, M.Juuti, M.Conti, and N.Asokan, “All you needis” love” evading hate speech detection,” in Proceedings of the 11thACM workshop on artificial intelligence and security, pp.2–12, 2018.
[31]M.Moh, T.-S. Moh, and B.Khieu, “No” love” lost: Defending hate speechdetection models against adversaries,” in 2020 14th InternationalConference on Ubiquitous Information Management and Communication (IMCOM),pp.1–6, IEEE, 2020.
[32]T.Tran, Y.Hu, C.Hu, K.Yen, F.Tan, K.Lee, and S.Park, “Habertor: Anefficient and effective deep hatespeech detector,” arXiv preprintarXiv:2010.08865, 2020.
[33]M.Xia, A.Field, and Y.Tsvetkov, “Demoting racial bias in hate speechdetection,” arXiv preprint arXiv:2005.12246, 2020.
[34]B.T. Khieu, “Tsar: A system for defending hate speech detection modelsagainst adversaries,” 2019.
[35]M.Beatty, “Graph-based methods to detect hate speech diffusion on twitter,”in 2020 IEEE/ACM International Conference on Advances in Social NetworksAnalysis and Mining (ASONAM), pp.502–506, IEEE, 2020.
[36]S.Li, N.A. Zaidi, Q.Liu, and G.Li, “Neighbours and kinsmen: hateful usersdetection with graph neural network,” in Pacific-Asia Conference onKnowledge Discovery and Data Mining, pp.434–446, Springer, 2021.
[37]Perspective, 2023.
[38]J.Li, S.Ji, T.Du, B.Li, and T.Wang, “Textbugger: Generating adversarialtext against real-world applications,” arXiv preprintarXiv:1812.05271, 2018.
[39]N.Carlini and D.Wagner, “Towards evaluating the robustness of neuralnetworks,” in 2017 ieee symposium on security and privacy (sp),pp.39–57, Ieee, 2017.
[40]N.Papernot, P.McDaniel, A.Swami, and R.Harang, “Crafting adversarial inputsequences for recurrent neural networks,” in MILCOM 2016-2016 IEEEMilitary Communications Conference, pp.49–54, IEEE, 2016.
[41]J.Ebrahimi, A.Rao, D.Lowd, and D.Dou, “Hotflip: White-box adversarialexamples for text classification,” arXiv preprint arXiv:1712.06751,2017.
[42]S.Samanta and S.Mehta, “Towards crafting text adversarial samples,” arXiv preprint arXiv:1707.02812, 2017.
[43]Y.Belinkov and Y.Bisk, “Synthetic and natural noise both break neuralmachine translation,” arXiv preprint arXiv:1711.02173, 2017.
[44]J.Gao, J.Lanchantin, M.L. Soffa, and Y.Qi, “Black-box generation ofadversarial text sequences to evade deep learning classifiers,” in 2018IEEE Security and Privacy Workshops (SPW), pp.50–56, IEEE, 2018.
[45]T.Hartvigsen, S.Gabriel, H.Palangi, M.Sap, D.Ray, and E.Kamar, “Toxigen:A large-scale machine-generated dataset for adversarial and implicit hatespeech detection,” arXiv preprint arXiv:2203.09509, 2022.
[46]E.Grolman, H.Binyamini, A.Shabtai, Y.Elovici, I.Morikawa, and T.Shimizu,“Hateversarial: Adversarial attack against hate speech detection algorithmson twitter,” in Proceedings of the 30th ACM Conference on UserModeling, Adaptation and Personalization, pp.143–152, 2022.
[47]T.Davidson, D.Warmsley, M.Macy, and I.Weber, “Automated hate speechdetection and the problem of offensive language,” in Proceedings of the11th International AAAI Conference on Web and Social Media, ICWSM ’17,pp.512–515, 2017.
[48]S.Singh and S.Singh, “Review of real-word error detection and correctionmethods in text documents,” in 2018 second international conference onelectronics, communication and aerospace technology (ICECA), pp.1076–1081,IEEE, 2018.
[49]R.Kumar and A.Bhat, “A study of machine learning-based models for detection,control, and mitigation of cyberbullying in online social media,” International Journal of Information Security, vol.21, no.6,pp.1409–1431, 2022.
[50]M.Roopak, G.Y. Tian, and J.Chambers, “Deep learning models for cybersecurity in iot networks,” in 2019 IEEE 9th annual computing andcommunication workshop and conference (CCWC), pp.0452–0457, IEEE, 2019.
[51]N.Elsayed, Gated Convolutional Recurrent Neural Networks for PredictiveCoding.PhD thesis, University of Louisiana at Lafayette, 2019.
[52]S.W. Azumah, N.Elsayed, V.Adewopo, Z.S. Zaghloul, and C.Li, “A deep lstmbased approach for intrusion detection iot devices network in smart home,”in 2021 IEEE 7th World Forum on Internet of Things (WF-IoT),pp.836–841, IEEE, 2021.
[53]H.Sak, A.W. Senior, and F.Beaufays, “Long short-term memory recurrentneural network architectures for large scale acoustic modeling,” 2014.
[54]I.Sutskever, O.Vinyals, and Q.V. Le, “Sequence to sequence learning withneural networks,” arXiv preprint arXiv:1409.3215, 2014.
[55]S.Hochreiter and J.Schmidhuber, “Long short-term memory,” Neuralcomputation, vol.9, no.8, pp.1735–1780, 1997.
[56]N.Elsayed, Z.ElSayed, and A.S. Maida, “Litelstm architecture for deeprecurrent neural networks,” in 2022 IEEE International Symposium onCircuits and Systems (ISCAS), pp.1304–1308, IEEE, 2022.
[57]T.Fischer and C.Krauss, “Deep learning with long short-term memory networksfor financial market predictions,” European Journal of OperationalResearch, vol.270, no.2, pp.654–669, 2018.
[58]N.Elsayed, A.S. Maida, and M.Bayoumi, “Empirical activation functioneffects on unsupervised convolutional LSTM learning,” in 2018 IEEE30th International Conference on Tools with Artificial Intelligence (ICTAI),pp.336–343, IEEE, 2018.
[59]T.Davidson, D.Warmsley, M.Macy, and I.Weber, “Automated hate speechdetection and the problem of offensive language,” in Proceedings of theinternational AAAI conference on web and social media, vol.11,pp.512–515, 2017.
[60]F.DelVigna12, A.Cimino23, F.Dell’Orletta, M.Petrocchi, and M.Tesconi,“Hate me, hate me not: Hate speech detection on facebook,” in Proceedings of the first Italian conference on cybersecurity (ITASEC17),pp.86–95, 2017.
[61]S.Narkhede, “Understanding auc-roc curve,” Towards data science,vol.26, no.1, pp.220–227, 2018.
[62]J.N. Mandrekar, “Receiver operating characteristic curve in diagnostic testassessment,” Journal of Thoracic Oncology, vol.5, no.9,pp.1315–1316, 2010.