Please use this identifier to cite or link to this item: https://hdl.handle.net/2440/139386
Type: Thesis
Title: Data Quality for Data-Driven Software Vulnerability Analysis
Author: Croft, Roland Lloyd
Issue Date: 2023
School/Discipline: School of Computer and Mathematical Sciences
Abstract: Software vulnerabilities enable malicious actors to exploit security weaknesses of a software system, potentially causing enormous damages for organisations. Detection and prevention of software vulnerabilities is vital to achieve software security. However, software vulnerability discovery is a difficult task due to the high expertise and labor requirements. Consequently, many developers rely on tool support for achieving software security. Traditional tools that rely on rule-based techniques fail to make inroads in software security however, due to their high false-positive rates and lack of scalability. Data-driven methods that utilise artificial intelligence have promising capabilities for automated software vulnerability discovery. A properly trained model can effectively learn the underlying patterns of software vulnerabilities and provide classifications efficiently on incoming source code modules. However, such models require sufficiently large and high-quality datasets to learn from. Data preparation for software vulnerability datasets is not a trivial task, due to the scarcity, lacking documentation, and sensitivity of associated software vulnerability data. Consequently, we observe that data preparation challenges are currently ill-considered and overlooked for data-driven software vulnerability analysis, and current datasets are usually of poor quality. These data challenges prevent software vulnerability prediction models from satisfying industrial applications. We have made the following contributions towards improving data quality for datadriven software vulnerability analysis. Firstly, we have benchmarked data-driven software vulnerability analysis approaches in comparison to traditional rule-based tools. This investigation yielded insights into the relative strengths and weaknesses of each approach, but we particularly observed that the promising capabilities of learningbased models were inhibited by their data requirements. Secondly, we provided a systematized view of software vulnerability data preparation practices for software vulnerability prediction. Through a systematic literature review, we uncovered a taxonomy of 16 data preparation challenges which act as obstacles towards achieving practical software vulnerability prediction. Thirdly, we conducted formal assessment of the data quality for state-of-the-art vulnerability datasets using five inherent data quality attributes. This research provided measurement of data insufficiencies and demonstrated their impact to inspire the need for data improvement. Furthermore, we also investigated inconsistency stemming from original vulnerability data sources. Finally, we proposed a technique for training robust vulnerability prediction models that can leverage noisy training datasets to still provide effective predictions. Our proposed method can circumvent some potentially unsolvable issues of software vulnerability datasets. We expect our contributions to help unlock the potential of software vulnerability data by improving dataset quality and use. These efforts in turn enable effective and practical applications for data-driven software vulnerability analysis.
Advisor: Babar, M. Ali
Jayatilaka, Asangi
Dissertation Note: Thesis (Ph.D.) -- University of Adelaide, School of Computer and Mathematical Sciences, 2023
Keywords: Cybersecurity; Machine Learning; Software Vulnerability; Data Quality
Provenance: This thesis is currently under Embargo and not available.
Appears in Collections:Research Theses

Files in This Item:
File Description SizeFormat 
Croft2023_PhD.pdf
  Restricted Access
Library staff access only.10.77 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.