I am a Ph.D. student at Faculté des sciences, where I am currently pursuing my thesis entitled "An Automatic Software Vulnerability Classification for Critical Infrastructure Systems", as a member of CNS (Computer Networks and Systems team), under the supervision of Professor Abdelbaki El Belrhiti El Alaoui.
My PhD research focuses on using deep learning and Natural Language Processing (NLP) techniques for vulnerability classification and detection. Recently, I have been exploring the application of transformers in this domain. My broader interests include cybersecurity, deep learning for NLP, and the use of advanced NLP models such as transformers for various tasks.
You can download my cv here.
Ph.D. thesis: "An Automatic Software Vulnerability Classification for Critical Infrastructure Systems"
Université Moulay Ismail, Faculté des sciences, Meknès
Master in Computer Networks and Embedded Systems
Université Moulay Ismail, Faculté des sciences, Meknès
Bachelor in Mathematical Sciences, Computer Science and Applications (4-year curriculum)
Faculté Polydisciplinaire, Ouarzazate
© Aissa Ben Yahya, powered by Bootstrap, last updated:
My main research interests and activities.
My published and submitted work.
The growing prevalence of vulnerabilities in embedded devices poses a significant risk to critical infrastructure. While deep learning has advanced vulnerability classification, its effectiveness is often hindered by limitations in word representation. Traditional word embeddings struggle with out-of-vocabulary (OOV) words common in domain-specific reports, while pre-trained language models (PLMs), despite their contextual power, may lack specialized domain knowledge. To address these challenges, we propose a novel Two-Stream hybrid embedding architecture that combines Vuln2Vec, a custom domain-specific word embedding, with a large pre-trained language model (PLM) using a learnable weighted feature fusion. Our approach leverages the rich domain-specific vocabulary of Vuln2Vec to understand specialized terminology, while the PLM captures broader contextual relationships and effectively handles OOV words. We validate our method through rigorous experiments, including ablation studies and comparative analyses on vulnerability databases such as the National Vulnerability Database (NVD), the Chinese Vulnerability Database (CNNVD), and a challenging manually collected dataset. Our experiments demonstrate that the proposed hybrid embedding method achieves a state-of-the-art F1-score of 94.25% and an accuracy of 94.88% on the challenging test dataset, validating the superiority of fusing specialized and general-purpose knowledge for this critical task.
The increasing number of vulnerabilities in embedded devices poses a significant threat to the critical infrastructure security where these devices are used. While deep learning approaches have advanced software vulnerability classification, they exhibit critical limitations regarding word weighting. Conventional methods like term frequency-inverse document frequency (TF-IDF) prioritize global term distributions but overlook intra-class distinctions. While improved variants of this technique have been proposed, they often fail to consider that a word’s importance can vary across categories and struggle to prioritize rare but distinctive words adequately. Additionally, high inter-class semantic overlap and terminological ambiguity in vulnerability descriptions hinder model performance by failing to separate intra-class keywords from background noise. To address these gaps, we propose a novel vulnerability classification and word vector weighting approach based on Bayes theorem. Our method dynamically adjusts term relevance by calculating posterior probabilities of word-category associations, emphasizing rare tokens with high intra-class specificity. We validate the approach on four test datasets derived from databases such as the National Vulnerability Database (NVD) and the Chinese Vulnerability Database (CNNVD). Rigorous ablation and comparative studies demonstrate that Bayes-based word weighting outperformed other methods by achieving a performance of 97.63% accuracy, and 97.60% F1-score on the most challenging test data. All our models and code to produce our results are open-sourced.
Critical infrastructure increasingly relies on embedded systems, making them particularly vulnerable to cyber attacks due to their complexity and interconnectivity. Unlike general-purpose systems, embedded systems need specialized security solutions tailored to their unique vulnerabilities. Accurate classification of embedded system vulnerabilities is essential for targeted analysis and mitigation. Traditional methods using pre-trained embeddings like Word2Vec, GloVe, and FastText often struggle with Out-of-Vocabulary (OOV) words, reducing their effectiveness. We address this with a novel ensemble embedding technique that combines multiple pre-trained embeddings, enhancing the classification of embedded system vulnerabilities. Our BiLSTM-based model, tested on datasets such as NVD and CNNVD, achieved 82.61% accuracy on unseen data, outperforming traditional embeddings.
Deep learning models have achieved remarkable success in various tasks, especially in classification. This success is particularly evident in the precise classification of plant diseases, which is crucial for effective agricultural management. However, accurate classification faces challenges, particularly in data collection, where certain classes are underrepresented, namely the minority classes. This issue can significantly impact model performance. To tackle this challenge, this paper introduces a novel methodology that differs from existing approaches. We focus on addressing the issue of minority classes in image-based classification tasks, particularly for olive diseases. We employ data generation methods, including basic transformations, to produce augmented data and utilize Deep Convolutional Generative Adversarial Networks (DCGAN) to produce generated data. Next, we apply the Frechet Inception Distance (FID) to the generated dataset to select the highest-quality images. We then distribute varying percentages (25%, 50%, 75%, 100%) of this new data into the minority classes of the original dataset. Our data distribution strategies involve incorporating specific amounts of (1) augmented data, (2) generated data, and (3) a combination of both augmented and generated data to achieve target percentages (T.P) in the resulting dataset. Our experiments focus on classifying olive diseases into seven distinct categories using a pre-trained Convolutional Neural Network (CNN) architecture. We observe significant improvements in the model’s performance, particularly in the accurate classification of minority classes. This approach enhances diagnostic accuracy and optimizes data distribution, which is crucial for effectively addressing the challenges posed by minority classes.
The security of embedded systems is deteriorating in comparison to conventional systems due to resource limitations in memory, processing, and power. Daily publications highlight various vulnerabilities associated with these systems. While significant efforts have been made to systematize and analyze these vulnerabilities, most studies focus on specific areas within embedded systems and lack the implementation of artificial intelligence (AI). This research aims to address these gaps by utilizing support vector machine (SVM) to classify vulnerabilities sourced from the national vulnerabilities database (NVD) and specifically targeting embedded system vulnerabilities. Results indicate that seven of the top 10 common weakness enumeration (CWE) vulnerabilities in embedded systems are also present in the 2022 CWE Top 25 Most Dangerous Software Weaknesses. The findings of this study will facilitate security researchers and companies in comprehensively analyzing embedded system vulnerabilities and developing tailored solutions.
The olive tree is affected by a variety of diseases. To identify these diseases, many farmers typically use traditional methods that require a lot of effort and specialization. These methods include visually observing the tree or conducting tests in a laboratory. Fortunately, recent progress in machine learning (ML) and deep learning (DL) has demonstrated promising potential to automatically classify diseases with both high accuracy and speed. However, as indicated by the literature, only a few studies are utilizing ML and DL techniques for identifying and categorizing diseases that affect olive trees. Therefore, in this study, we collected a dataset containing 4138 images of olive leaves from various sources. The dataset comprises four categories: three representing diseases and one denoting a healthy category. We also introduced an innovative approach to classify olive leaf diseases by combining deep learning architectures, specifically convolutional neural networks (CNNs), with machine learning classifiers. In this approach, we developed a total of 30 distinct deep hybrid models (DHMs), utilizing six pre-trained convolutional neural network architectures (VGG19, ResNet50, MobileNetV2, InceptionV3, DenseNet201, and EfficientNetB0) as feature extractors, along with five machine learning classifiers (MLP, LR, RF, SVM, and DT). To assess the performance of the DHMs, we used performance evaluation metrics (Accuracy, Precision, Recall, F1-score) and we conducted an assessment to validate the reliability rating of the DHMs using a cross-validation technique. Additionally, we employed the Non-Parametric ScottKnott ESD (NPSK) test to assess the ranking of the best DHMs. The study’s findings revealed that the most efficient deep hybrid model was achieved by using the EfficientNetB0 model in combination with a logistic regression classifier, achieving an impressive accuracy score of 96.14%. Our approach has the potential to significantly assist olive farmers in rapidly and accurately identifying diseases, thereby potentially reducing economic losses.
Feel free to send me an email to discuss about research or even arrange a meetup.