A Longitudinal Study of Vulnerable Client-side Resources and Web Developers' Updating Behaviors (ACM IMC 2023)

A Longitudinal Study of Vulnerable Client-side Resources and Web Developers' Updating Behaviors

Kyungchan Lim^*, Yonghwi Kwon^†, and Doowon Kim^*

University of Tennessee, Knoxville^* and University of Maryland^†

Abstract

Modern Websites rely on various client-side web resources, such as JavaScript libraries, to provide end-users with rich and interactive web experiences. Unfortunately, anecdotal evidence shows that improperly managed client-side resources could open up attack surfaces that adversaries can exploit. However, there is still a lack of a comprehensive understanding of the updating practices among web developers and the potential impact of inaccuracies in Common Vulnerabilities and Exposures (CVE) information on the security of the web ecosystem. In this paper, we conduct a longitudinal (four-year) measurement study of the security practices and implications on client-side resources (e.g., JavaScript libraries and Adobe Flash) across the Web. Specifically, we first collect a large-scale dataset of 157.2M webpages of Alexa Top 1M websites for four years in the wild. Analyzing the dataset, we find an average of 41.2% websites (in each year of the four years) carry at least one vulnerable client-side resource (e.g., JavaScript or Adobe Flash). We also reveal that vulnerable JavaScript library versions are frequently observed in the wild, suggesting a concerning level of lagging update practice in the wild. On average, we observe 531.2 days with 25,337 websites of the window of vulnerability due to the unpatched client-side resources from the release of security patches. Furthermore, we manually investigate the fidelity of CVE (Common Vulnerabilities and Exposures) reports on client-side resources, leveraging PoC (Proof of Concept) code. We find that 13 CVE reports (out of 27) have incorrect vulnerable version information, which may impact security-related tasks such as security updates.

Dataset and Source Code

Source code is publicly available in this GitHub repository. Also, we share our dataset that we have collected the landing pages (e.g., index.html) of Alexa 1M domain on a weekly basis for four years (Mar. 2018 – Feb. 2022). Specifically, our Web crawler (implemented in Go using the net/http library) visits each Alexa 1M domain over HTTPS and collects the landing page of each domain every week. This dataset consists of 157,242,243 (157.2 M) HTML files for the 201 weeks of the four years. On average, we consistently collect the index pages from the 782,300 domains every week. If you want to download the dataset, please contact us through this Google form.