ieeexplore.ieee.org/abstract/document/10386516
Preview meta tags from the ieeexplore.ieee.org website.
Linked Hostnames
2Thumbnail

Search Engine Appearance
Evaluating Representativeness in PDF Malware Datasets: A Comparative Study and a New Dataset
With the widespread use of the Portable Document Format (PDF), it’s increasingly becoming a target for malware, highlighting the need for effective detection solutions. In recent years, machine learning-based methods for PDF malware detection have grown in popularity. However, the effectiveness of ML models is closely related to the quality of the training datasets. In this research, we investigated two widely used PDF malware datasets: Contagio and CIC. We found biases and representativeness issues that could affect the reliability and applicability of models built on them. Our statistical analysis revealed marked difference between these datasets and PDF malware samples from VirusTotal, as well as benign PDFs from Govdocs, pointing to the necessity for more representative datasets in PDF malware research.. To address this gap, we introduce a novel dataset: PdfRep. Our findings demonstrate that PdfRep outperforms both CIC and Contagio across various evaluation metrics. The main contribution of this paper is the introduction of PdfRep, a new PDF malware dataset that overcomes the limitations of representativeness in existing datasets. This enhancement substantially increases the accuracy of PDF malware detection models and holds promise for advancing the field of PDF malware detection research.
Bing
Evaluating Representativeness in PDF Malware Datasets: A Comparative Study and a New Dataset
With the widespread use of the Portable Document Format (PDF), it’s increasingly becoming a target for malware, highlighting the need for effective detection solutions. In recent years, machine learning-based methods for PDF malware detection have grown in popularity. However, the effectiveness of ML models is closely related to the quality of the training datasets. In this research, we investigated two widely used PDF malware datasets: Contagio and CIC. We found biases and representativeness issues that could affect the reliability and applicability of models built on them. Our statistical analysis revealed marked difference between these datasets and PDF malware samples from VirusTotal, as well as benign PDFs from Govdocs, pointing to the necessity for more representative datasets in PDF malware research.. To address this gap, we introduce a novel dataset: PdfRep. Our findings demonstrate that PdfRep outperforms both CIC and Contagio across various evaluation metrics. The main contribution of this paper is the introduction of PdfRep, a new PDF malware dataset that overcomes the limitations of representativeness in existing datasets. This enhancement substantially increases the accuracy of PDF malware detection models and holds promise for advancing the field of PDF malware detection research.
DuckDuckGo
Evaluating Representativeness in PDF Malware Datasets: A Comparative Study and a New Dataset
With the widespread use of the Portable Document Format (PDF), it’s increasingly becoming a target for malware, highlighting the need for effective detection solutions. In recent years, machine learning-based methods for PDF malware detection have grown in popularity. However, the effectiveness of ML models is closely related to the quality of the training datasets. In this research, we investigated two widely used PDF malware datasets: Contagio and CIC. We found biases and representativeness issues that could affect the reliability and applicability of models built on them. Our statistical analysis revealed marked difference between these datasets and PDF malware samples from VirusTotal, as well as benign PDFs from Govdocs, pointing to the necessity for more representative datasets in PDF malware research.. To address this gap, we introduce a novel dataset: PdfRep. Our findings demonstrate that PdfRep outperforms both CIC and Contagio across various evaluation metrics. The main contribution of this paper is the introduction of PdfRep, a new PDF malware dataset that overcomes the limitations of representativeness in existing datasets. This enhancement substantially increases the accuracy of PDF malware detection models and holds promise for advancing the field of PDF malware detection research.
General Meta Tags
12- titleEvaluating Representativeness in PDF Malware Datasets: A Comparative Study and a New Dataset | IEEE Conference Publication | IEEE Xplore
- google-site-verificationqibYCgIKpiVF_VVjPYutgStwKn-0-KBB6Gw4Fc57FZg
- DescriptionWith the widespread use of the Portable Document Format (PDF), it’s increasingly becoming a target for malware, highlighting the need for effective detection so
- Content-Typetext/html; charset=utf-8
- viewportwidth=device-width, initial-scale=1.0
Open Graph Meta Tags
3- og:imagehttps://ieeexplore.ieee.org/assets/img/ieee_logo_smedia_200X200.png
- og:titleEvaluating Representativeness in PDF Malware Datasets: A Comparative Study and a New Dataset
- og:descriptionWith the widespread use of the Portable Document Format (PDF), it’s increasingly becoming a target for malware, highlighting the need for effective detection solutions. In recent years, machine learning-based methods for PDF malware detection have grown in popularity. However, the effectiveness of ML models is closely related to the quality of the training datasets. In this research, we investigated two widely used PDF malware datasets: Contagio and CIC. We found biases and representativeness issues that could affect the reliability and applicability of models built on them. Our statistical analysis revealed marked difference between these datasets and PDF malware samples from VirusTotal, as well as benign PDFs from Govdocs, pointing to the necessity for more representative datasets in PDF malware research.. To address this gap, we introduce a novel dataset: PdfRep. Our findings demonstrate that PdfRep outperforms both CIC and Contagio across various evaluation metrics. The main contribution of this paper is the introduction of PdfRep, a new PDF malware dataset that overcomes the limitations of representativeness in existing datasets. This enhancement substantially increases the accuracy of PDF malware detection models and holds promise for advancing the field of PDF malware detection research.
Twitter Meta Tags
1- twitter:cardsummary
Link Tags
9- canonicalhttps://ieeexplore.ieee.org/abstract/document/10386516
- icon/assets/img/favicon.ico
- stylesheethttps://ieeexplore.ieee.org/assets/css/osano-cookie-consent-xplore.css
- stylesheet/assets/css/simplePassMeter.min.css?cv=20250812_00000
- stylesheet/assets/dist/ng-new/styles.css?cv=20250812_00000
Links
17- http://www.ieee.org/about/help/security_privacy.html
- http://www.ieee.org/web/aboutus/whatis/policies/p9-26.html
- https://ieeexplore.ieee.org/Xplorehelp
- https://ieeexplore.ieee.org/Xplorehelp/overview-of-ieee-xplore/about-ieee-xplore
- https://ieeexplore.ieee.org/Xplorehelp/overview-of-ieee-xplore/accessibility-statement