AI Advances Threaten Online Anonymity with High Accuracy in User Identification

Technology Source: arstechnica.com

Burner accounts on social media are increasingly vulnerable to identification through AI, according to recent research with significant implications for internet privacy. Researchers have demonstrated that AI can correlate specific individuals with accounts or posts across multiple platforms with a recall rate of up to 68% and a precision rate as high as 90%. This surpasses traditional methods that required structured data sets or manual investigation.

The findings challenge the effectiveness of pseudonymity, a common privacy measure used by individuals to engage in sensitive discussions online without revealing their identities. The ability to identify users behind pseudonymous accounts could lead to doxxing, stalking, and the creation of detailed marketing profiles, thereby undermining privacy.

Researchers collected datasets from public social media sites, including posts from Hacker News and LinkedIn, and used cross-platform references to link them. They stripped identifying information from the posts and applied a large language model (LLM) to analyze them. Another dataset involved Netflix micro-identities, which had previously been used to identify users and their personal information. A third technique involved analyzing a single user's Reddit history.

Simon Lermen, a co-author of the study, noted that AI can now identify individuals from free text, a capability previously requiring structured data. In one experiment, AI identified 7% of participants from a questionnaire about AI usage. Although this recall rate is low, it highlights AI's growing ability to identify people from general information.

In another experiment, researchers analyzed comments from the r/movies subreddit and other related communities. They found that the more movies a user discussed, the easier it was to identify them. Users discussing one movie had a 3.1% identification rate at 90% precision, while those discussing more than ten movies had rates of 48.1% at 90% precision and 17% at 99% precision.

In a third experiment, researchers compared their method to the older Netflix prize attack using a set of 5,000 Reddit users and 5,000 distraction identities. The LLM-based method significantly outperformed the classical approach, maintaining higher precision even as more guesses were made.

The researchers suggested several mitigations, such as platforms enforcing rate limits on API access, detecting automated scraping, and restricting bulk data exports. LLM providers could monitor for misuse and implement guardrails against deanonymization requests. Alternatively, users could reduce their social media usage or regularly delete posts.

If LLMs continue to improve in deanonymizing users, the researchers warn of potential misuse by governments and other entities.

Read original article →

Related Articles