URLs often utilize query strings (i.e., key-value pairs appended to the URL path) as a means to pass session parameters and form data. Often times these arguments are not privacy sensitive but are necessary to render the web page. However, query strings may also contain tracking mechanisms, user names, email addresses, and other information that users may not wish to reveal. In isolation such URLs are not particularly problematic, but the growth of Web 2.0 platforms such as social networks and micro-blogging means URLs (often copy-pasted from web browsers) are increasingly being publicly broadcast.
This position paper argues that the threat posed by such privacy disclosures is significant and prevalent. It demonstrates this by analyzing 892 million user-submitted URLs, many disseminated in (semi)-public forums. Within this corpus our case-study identifies troves of personal data including 1.7 million email addresses. In the most egregious examples the query string contains plaintext usernames and passwords for administrative and extremely sensitive accounts. With this as motivation the authors propose a privacy-aware service they name "CleanURL". CleanURL's goal is to transform addresses by stripping non-essential key-value pairs and/or notifying users when sensitive data is critical to proper page rendering. This logic is based on difference algorithms, mining of URL corpora, and human feedback loops. Though realized as a link shortener in its prototype implementation, CleanURL could be leveraged on any platform to scan URLs before they are published or retroactively sanitize existing links.
- query strings,
- clean URL,
- link shortening,
- social networks