DeepSeek Data Leak – 12,000 Hardcoded Live API keys & Passwords Exposed

A recent analysis uncovered 11,908 live DeepSeek API keys, passwords, and authentication tokens embedded in publicly scraped web data.

According to cybersecurity firm Truffle Security, the study highlights how AI models trained on unfiltered internet snapshots risk internalizing and potentially reproducing insecure coding patterns.

The findings follow earlier revelations that LLMs frequently suggest hardcoding credentials in codebases, raising questions about the role of training data in reinforcing these behaviors.

DeepSeek Data Exposed

Truffle Security scanned 400 terabytes of Common Crawl’s December 2024 dataset, comprising 2.67 billion web pages from 47.5 million hosts. Using their open-source tool TruffleHog, researchers identified:

  • 11,908 verified live secrets that authenticate to services like AWS, Slack, and Mailchimp.
  • 2.76 million web pages containing exposed credentials, with 63% of keys reused across multiple domains.
  • A single WalkScore API key recurring 57,029 times across 1,871 subdomains, illustrating widespread credential reuse.

Notably, the dataset included high-risk exposures like AWS root keys in front-end HTML and 17 unique Slack webhooks hardcoded into a single webpage’s chat feature.

Mailchimp API keys dominated the leaks (1,500+ instances). They were often embedded directly in client-side JavaScript, a practice that enabled phishing campaigns and data exfiltration.

Common Crawl’s dataset, stored in 90,000 WARC files, preserves raw HTML, JavaScript, and server responses from crawled sites.

Truffle Security deployed a 20-node AWS cluster to process the archive, splitting files using awk and scanning each segment with TruffleHog’s verification engine.

The tool differentiated live secrets (authenticated against their services) from inert strings—a critical step given that LLMs cannot discern valid credentials during training.

Researchers faced infrastructural hurdles: WARC’s streaming inefficiencies initially slowed processing, while AWS optimizations reduced download times by 5–6x.

WARC File (Source: Truffle Security)

Despite these challenges, the team prioritized ethical disclosure by collaborating with vendors like Mailchimp to revoke thousands of keys, avoiding spam-like outreach to individual website owners.

The study underscores a growing dilemma: LLMs trained on publicly accessible data inherit its security flaws. While models like DeepSeek utilize additional safeguards fine-tuning, alignment techniques, and prompt constraints—the prevalence of hardcoded secrets in training corpora risks normalizing unsafe practices.

Non-functional credentials (e.g., placeholder tokens) contribute to this issue, as LLMs cannot contextually evaluate their validity during code generation.

Truffle Security warns that developers who reuse API keys across client projects face heightened risks. In one case, a software firm’s shared Mailchimp key exposed all client domains linked to its account, a goldmine for attackers.

Mitigations

To curb AI-generated vulnerabilities, Truffle Security recommends:

  1. Integrating security guardrails into AI coding tools via platforms like Github
  2. which can enforce policies against hardcoding secrets.
  3. Expanding secret-scanning programs to include archived web data as historical leaks resurface in training datasets.
  4. Adopting Constitutional AI techniques to align models with security best practices, reducing inadvertent exposure of sensitive patterns.

With LLMs increasingly shaping software development, securing their training data is no longer optional—it’s foundational to building a safer digital future.

Leave a Reply

Your email address will not be published. Required fields are marked *