CeWL Is Dead. Here's What Replaces It. - Chocapikk's Cybersecurity Blog

CeWL is not CeWL anymore. I had to make the pun, no offense Robin.

TL;DR

CeWL has been the go-to custom wordlist generator since 2012. It spiders a site, pulls out words, and that’s it. The problem: it only sees what’s literally on the page.

CeWL AI goes further. It crawls HTTP, FTP, SFTP, SMB, and S3 targets, feeds the extracted context to an LLM, and gets back words that are contextually related but never appear on the site - likely passwords, department names, industry jargon, product codenames. It scans for secrets with 800+ trufflehog detectors, dumps all crawled files to disk, and runs CUPP-style mutations on top. Single Go binary, 6 AI providers (4 free), local model support via any OpenAI-compatible endpoint.

The idea came from @stlthr4k3r - CeWL is kind of old and probably pre-AI, so someone could make a more accurate version using AI to generate “like words” and “industry similar terms”. We figured it probably already existed. It didn’t.

Why CeWL Falls Short

CeWL visits pages and extracts words. If the target says “Acme Corporation” and “Paris”, you get Acme, Corporation, Paris. Useful, but a pentester would also think of:

Acme2026! - company name + year + symbol
AcmeParis - company + city
finance, hr, devops - departments that almost certainly exist
HIPAA, GDPR - compliance terms based on the industry

That mental leap from extracted words to contextual guesses is what separates a generic wordlist from one that actually cracks accounts. CeWL can’t do it. A human can, but it takes time. An LLM can do it in seconds.

Architecture

flowchart TD
    A[Target URL] --> B{Protocol}
    B -->|http/https| C[colly + goquery]
    B -->|ftp| D[FTP client]
    B -->|sftp| E[SSH + SFTP]
    B -->|smb| F[SMB2/3 client]
    B -->|s3| G[AWS SDK v2]
    C --> H[HTML + JS + CSS + XML + JSON]
    D --> H
    E --> H
    F --> H
    G --> H
    H --> I[Words + Context]
    H --> J[Emails + Metadata]
    H --> K[Secrets - 800+ trufflehog detectors]
    H --> L[--dump: files to disk]
    I --> M[LLM - Groq / Anthropic / OpenAI / Local]
    M --> N[AI-generated words]
    I --> O[Merge + Dedup]
    N --> O
    O --> P{--mutate?}
    P -->|Yes| Q[CUPP-like mutations]
    P -->|No| R[Final wordlist]
    Q --> R

Better Than CeWL Even Without AI

Strip the --ai flag entirely. The crawler alone is already a significant upgrade.

DOM parsing vs regex

CeWL runs regex on raw HTML and hopes for the best. CeWL AI uses goquery to build a proper DOM tree, removes <script> and <style> nodes before extraction, and injects whitespace between elements to prevent word concatenation. The output is clean.

JavaScript awareness

CeWL skips JavaScript completely. Any content loaded by JS, any API endpoints hardcoded in .js files, any secrets in client-side code - all invisible to CeWL.

CeWL AI integrates jsluice (the same library Katana uses) to parse both inline <script> blocks and external .js files. It extracts URLs, API paths, and secrets like hardcoded keys.

Email deobfuscation

CeWL catches mailto: links and runs a basic email regex. CeWL AI does the same, plus it deobfuscates common anti-scraping patterns:

Pattern	Detected
`user@domain.com`	Both
`user [at] domain [dot] com`	CeWL AI only
`user (at) domain (dot) com`	CeWL AI only
`user {at} domain {dot} net`	CeWL AI only
`user <at> domain <dot> org`	CeWL AI only
`user AT domain DOT com`	CeWL AI only

Native metadata

CeWL shells out to exiftool for PDF and Office document metadata - if it’s installed. CeWL AI extracts Author, Creator, and LastModifiedBy from PDFs (via regex on raw bytes) and Office files (via archive/zip + XML parsing) natively. No external tools needed.

Distribution

CeWL: Ruby, 10+ gems, bundle install, system dependencies. CeWL AI: single binary, go install, done.

Multi-Protocol Crawling

CeWL only does HTTP. Real pentests involve file shares, FTP servers, cloud storage. CeWL AI crawls all of them with the same interface:

# FTP (anonymous or authenticated)
cewlai -u ftp://anonymous@ftp.example.com --secrets

# SMB file shares
cewlai -u smb://user:pass@192.168.1.10/data --secrets

# SFTP
cewlai -u sftp://user:pass@host/path

# S3 buckets (AWS, MinIO, any S3-compatible)
cewlai -u s3://bucket-name/prefix?region=eu-west-1
cewlai -u 's3://bucket?endpoint=http://minio:9000' --auth-user KEY --auth-pass SECRET

Every protocol goes through the same pipeline: list files, extract words from filenames, download, parse by content type, scan for secrets, dump to disk if requested.

Secret Scanning

The --secrets flag runs every downloaded file and HTTP response through 800+ trufflehog detectors. API keys, tokens, connection strings, private keys - all matched by regex without making any external API calls.

cewlai -u smb://user:pass@192.168.1.10/data --secrets --secrets-file findings.txt

[+] Found 1 secrets
[Postgres] postgres://admin:s3cretP4ss@db.acme.local:5432 (source: config/.env)

Point it at a file share during an internal pentest and let it find what people left behind. Combined with the wordlist generation, you get both the leaked credentials and a targeted password list from a single command.

File Dump

The --dump flag mirrors all crawled files to a local directory. Structure is preserved.

cewlai -u s3://bucket-name --dump /tmp/loot --secrets

/tmp/loot/
  config/.env
  config/app.json
  config/nginx.conf
  docs/index.html

Works on every protocol. On HTTP it saves every response body. On FTP/SFTP/SMB/S3 it saves every file. The wordlist, secrets, and emails are still generated in parallel - the dump is just a bonus.

No tool does recursive multi-protocol dump + wordlist + secrets in one pass. curl downloads single files. wget -r does recursive HTTP only. rclone does S3 only. CeWL AI does all of them.

What AI Brings to the Table

After the crawl finishes, CeWL AI sends a context summary (up to 4000 characters of extracted text) to an LLM with a specialized system prompt. The model reads the context, understands the domain, and generates words that a human pentester would think of but that never appear on the site.

Real output from chocapikk.com

Crawl only:

$ cewlai -u https://chocapikk.com | wc -l
5522

With AI enrichment (Groq, 100 words):

$ cewlai -u https://chocapikk.com --ai -p groq --ai-words 100 -v
[+] Crawled 15 pages, extracted 5808 raw words
[*] Attempt 1: got 82/100 words (+82 new)
[*] Attempt 2: got 100/100 words (+18 new)
[+] AI generated 100 words

Sample of AI-generated words (not on the site):

firmware
credential
lateral
implant
perimeter
segmentation
remediation
triage

Every word comes from the site context. The LLM reads about exploit development, CVEs, and pentesting, then generates related industry terms that never appear on any page. CeWL would never produce these.

The tool retries automatically until the requested word count is reached, deduplicating across attempts. If the LLM runs out of unique contextual words, it stops gracefully.

Five prompt modes

The same site produces very different outputs depending on the mode:

# General contextual words (default)
cewlai -u https://example.com --ai -p groq

# Likely passwords
cewlai -u https://example.com --ai -p groq --mode passwords

# Hidden directories and endpoints
cewlai -u https://example.com --ai -p groq --mode dirs

# Probable subdomains
cewlai -u https://example.com --ai -p groq --mode subdomains

# Geographic password patterns
cewlai -u https://example.com --ai -p groq --mode geo

Need something specific? Pass your own system prompt:

cewlai -u https://example.com --ai -p groq \
  --prompt "Generate API endpoint paths for a healthcare SaaS platform"

CUPP-Style Mutations

The --mutate flag applies password transformations to every word in the list - crawled and AI-generated:

Lowercase, uppercase, capitalized
Reversed (admin -> nimda)
Leet speak (admin -> 4dm1n)
Common suffixes: 123, !, 2026, @, $
Prefix support via config

From 15 base words, mutations produce 500+. Combined with AI enrichment, a single run against a medium-sized site can produce tens of thousands of targeted, contextual password candidates.

Mutation rules are fully customizable via JSON:

{
  "leet": {"a": "@", "e": "3", "s": "$"},
  "suffixes": ["2026", "!", "123", "#"],
  "prefixes": ["pre_"],
  "capitalize": true,
  "reverse": true,
  "leet_enabled": true,
  "min_length": 6,
  "max_length": 20
}

cewlai -u https://example.com --mutate --mutate-config rules.json

Providers

CeWL AI supports 6 cloud providers and any OpenAI-compatible local endpoint:

Provider	Free tier	Env var	Signup
Groq	Yes	`GROQ_API_KEY`	console.groq.com
OpenRouter	Yes	`OPENROUTER_API_KEY`	openrouter.ai
Cerebras	Yes	`CEREBRAS_API_KEY`	cloud.cerebras.ai
HuggingFace	Yes	`HF_TOKEN`	huggingface.co
Anthropic	No	`ANTHROPIC_API_KEY`	console.anthropic.com
OpenAI	No	`OPENAI_API_KEY`	platform.openai.com

For local models (Ollama, LM Studio, vLLM), point --base-url at your endpoint:

ollama pull llama3
cewlai -u https://example.com --ai -p openai -m llama3 \
  --base-url http://localhost:11434/v1 --api-key dummy

No internet required. No API key cost. Full AI enrichment running on your own hardware.

Putting It All Together

A realistic engagement command:

cewlai -u https://example.com \
  --ai -p groq \
  --mode passwords \
  --mutate \
  --email \
  --meta \
  --secrets \
  --ai-words 500 \
  -d 3 \
  -t 5 \
  -o wordlist.txt \
  --email-file emails.txt \
  --meta-file authors.txt \
  --secrets-file secrets.txt

What happens:

Crawls 3 levels deep with 5 threads
Parses all HTML (goquery), JavaScript (jsluice), PDFs and Office docs
Extracts emails including obfuscated ones
Scans every response for secrets with 800+ trufflehog detectors
Sends context to Groq and generates exactly 500 contextual password candidates
Mutates everything with leet speak, case variations, and suffix patterns
Outputs the wordlist, emails, secrets, and document authors to separate files

Internal pentest with a file share:

cewlai -u smb://user:pass@192.168.1.10/data \
  --secrets --secrets-file findings.txt \
  --dump /tmp/loot \
  -o wordlist.txt

One command: dump every file, scan for leaked credentials, build a targeted wordlist from the content. Feed wordlist.txt to Hydra. Feed findings.txt to your report. Feed /tmp/loot to manual review.

Under the Hood

A few design decisions that make the output quality better than “just calling an API”.

Why CeWL can’t do this

CeWL treats every page as a bag of words. It doesn’t build context, it doesn’t track relationships between pages, it doesn’t understand what the site is about. It extracts “Acme” from one page and “Paris” from another but never connects them into “AcmeParis”. It sees “hospital” but never infers “HIPAA”. There’s no intelligence layer - it’s a scraper with a word splitter.

CeWL was written in 2012, before LLMs existed. It was the best approach at the time and it served the community well for over a decade. But the tooling landscape has moved on. LLMs are accessible, fast, and cheap (or free). Go produces single binaries that run everywhere without a runtime. Someone just had to sit down and connect the pieces. That’s all this is.

Context depth matters

LLMs are next-token predictors. The quality of their output is directly proportional to the quality of the input context. This is why the crawl step isn’t just about extracting words - it’s about building the richest possible context for the model. A depth-1 crawl on a landing page gives the LLM almost nothing to work with. A depth-3 crawl across 50 pages gives it company names, product names, employee roles, tech stack hints, geographic clues, and industry terminology. The more signal the model sees, the more relevant its predictions become.

This is also why the tool sends a 4000-character context summary rather than just a list of extracted words. The raw text preserves sentence structure and relationships between concepts that a flat word list would lose.

Token efficiency

LLM APIs charge per token. Asking the model to output one word per line wastes a newline token on every single word. CeWL AI asks for comma-separated output instead, cutting token usage by roughly 30-40% for the same number of words. That’s the difference between hitting Groq’s free tier limit at 300 words or at 500.

Retry with deduplication

LLMs don’t reliably produce an exact word count. Ask for 500, you might get 300. CeWL AI handles this with a retry loop: it keeps calling the model with the remaining count until the target is reached. Each batch is deduplicated against previous results, so the model is forced to produce new words on each attempt. If a retry adds zero new words, the loop stops - the model has exhausted its contextual knowledge for that target.

Mutations as a multiplier

AI calls are the expensive step (in time and tokens). Mutations are free and local. By applying CUPP-style transformations after the AI step, a small set of high-quality contextual words becomes a large set of password candidates without any additional API calls. 50 AI words with mutations can produce 800+ candidates. The AI provides the semantic intelligence, the mutations provide the coverage.

Install

go install github.com/Chocapikk/cewlai/cmd/cewlai@latest

Or download a prebuilt binary from releases (Linux, macOS, Windows).

Personal note

CeWL and CUPP were part of my workflow when I started doing CTFs years ago. They were some of the first tools I learned. Coming back to this space and realizing nobody had combined them with AI felt like a gap that shouldn’t still be there. What started as a CeWL replacement grew into something broader - once you’re crawling targets across 5 protocols and parsing every format, adding secret scanning and file dumps is a natural extension. I wish I’d had it when I was starting out.

Source: github.com/Chocapikk/cewlai