CeWL Is Dead. Here's What Replaces It.
Table of Contents
CeWL is not CeWL anymore. I had to make the pun, no offense Robin.
TL;DR
CeWL has been the go-to custom wordlist generator since 2012. It spiders a site, pulls out words, and that’s it. The problem: it only sees what’s literally on the page.
CeWL AI goes further. It crawls HTTP, FTP, SFTP, SMB, and S3 targets, feeds the extracted context to an LLM, and gets back words that are contextually related but never appear on the site - likely passwords, department names, industry jargon, product codenames. It scans for secrets with 800+ trufflehog detectors, dumps all crawled files to disk, and runs CUPP-style mutations on top. Single Go binary, 6 AI providers (4 free), local model support via any OpenAI-compatible endpoint.
The idea came from @stlthr4k3r - CeWL is kind of old and probably pre-AI, so someone could make a more accurate version using AI to generate “like words” and “industry similar terms”. We figured it probably already existed. It didn’t.
Why CeWL Falls Short
CeWL visits pages and extracts words. If the target says “Acme Corporation” and “Paris”, you get Acme, Corporation, Paris. Useful, but a pentester would also think of:
Acme2026!- company name + year + symbolAcmeParis- company + cityfinance,hr,devops- departments that almost certainly existHIPAA,GDPR- compliance terms based on the industry
That mental leap from extracted words to contextual guesses is what separates a generic wordlist from one that actually cracks accounts. CeWL can’t do it. A human can, but it takes time. An LLM can do it in seconds.
Architecture
flowchart TD
A[Target URL] --> B{Protocol}
B -->|http/https| C[colly + goquery]
B -->|ftp| D[FTP client]
B -->|sftp| E[SSH + SFTP]
B -->|smb| F[SMB2/3 client]
B -->|s3| G[AWS SDK v2]
C --> H[HTML + JS + CSS + XML + JSON]
D --> H
E --> H
F --> H
G --> H
H --> I[Words + Context]
H --> J[Emails + Metadata]
H --> K[Secrets - 800+ trufflehog detectors]
H --> L[--dump: files to disk]
I --> M[LLM - Groq / Anthropic / OpenAI / Local]
M --> N[AI-generated words]
I --> O[Merge + Dedup]
N --> O
O --> P{--mutate?}
P -->|Yes| Q[CUPP-like mutations]
P -->|No| R[Final wordlist]
Q --> R
Better Than CeWL Even Without AI
Strip the --ai flag entirely. The crawler alone is already a significant upgrade.
DOM parsing vs regex
CeWL runs regex on raw HTML and hopes for the best. CeWL AI uses goquery to build a proper DOM tree, removes <script> and <style> nodes before extraction, and injects whitespace between elements to prevent word concatenation. The output is clean.
JavaScript awareness
CeWL skips JavaScript completely. Any content loaded by JS, any API endpoints hardcoded in .js files, any secrets in client-side code - all invisible to CeWL.
CeWL AI integrates jsluice (the same library Katana uses) to parse both inline <script> blocks and external .js files. It extracts URLs, API paths, and secrets like hardcoded keys.
Email deobfuscation
CeWL catches mailto: links and runs a basic email regex. CeWL AI does the same, plus it deobfuscates common anti-scraping patterns:
| Pattern | Detected |
|---|---|
user@domain.com | Both |
user [at] domain [dot] com | CeWL AI only |
user (at) domain (dot) com | CeWL AI only |
user {at} domain {dot} net | CeWL AI only |
user <at> domain <dot> org | CeWL AI only |
user AT domain DOT com | CeWL AI only |
Native metadata
CeWL shells out to exiftool for PDF and Office document metadata - if it’s installed. CeWL AI extracts Author, Creator, and LastModifiedBy from PDFs (via regex on raw bytes) and Office files (via archive/zip + XML parsing) natively. No external tools needed.
Distribution
CeWL: Ruby, 10+ gems, bundle install, system dependencies.
CeWL AI: single binary, go install, done.
Multi-Protocol Crawling
CeWL only does HTTP. Real pentests involve file shares, FTP servers, cloud storage. CeWL AI crawls all of them with the same interface:
# FTP (anonymous or authenticated)
cewlai -u ftp://anonymous@ftp.example.com --secrets
# SMB file shares
cewlai -u smb://user:pass@192.168.1.10/data --secrets
# SFTP
cewlai -u sftp://user:pass@host/path
# S3 buckets (AWS, MinIO, any S3-compatible)
cewlai -u s3://bucket-name/prefix?region=eu-west-1
cewlai -u 's3://bucket?endpoint=http://minio:9000' --auth-user KEY --auth-pass SECRET
Every protocol goes through the same pipeline: list files, extract words from filenames, download, parse by content type, scan for secrets, dump to disk if requested.
Secret Scanning
The --secrets flag runs every downloaded file and HTTP response through 800+ trufflehog detectors. API keys, tokens, connection strings, private keys - all matched by regex without making any external API calls.
cewlai -u smb://user:pass@192.168.1.10/data --secrets --secrets-file findings.txt
[+] Found 1 secrets
[Postgres] postgres://admin:s3cretP4ss@db.acme.local:5432 (source: config/.env)
Point it at a file share during an internal pentest and let it find what people left behind. Combined with the wordlist generation, you get both the leaked credentials and a targeted password list from a single command.
File Dump
The --dump flag mirrors all crawled files to a local directory. Structure is preserved.
cewlai -u s3://bucket-name --dump /tmp/loot --secrets
/tmp/loot/
config/.env
config/app.json
config/nginx.conf
docs/index.html
Works on every protocol. On HTTP it saves every response body. On FTP/SFTP/SMB/S3 it saves every file. The wordlist, secrets, and emails are still generated in parallel - the dump is just a bonus.
No tool does recursive multi-protocol dump + wordlist + secrets in one pass. curl downloads single files. wget -r does recursive HTTP only. rclone does S3 only. CeWL AI does all of them.
What AI Brings to the Table
After the crawl finishes, CeWL AI sends a context summary (up to 4000 characters of extracted text) to an LLM with a specialized system prompt. The model reads the context, understands the domain, and generates words that a human pentester would think of but that never appear on the site.
Real output from chocapikk.com
Crawl only:
$ cewlai -u https://chocapikk.com | wc -l
5522
With AI enrichment (Groq, 100 words):
$ cewlai -u https://chocapikk.com --ai -p groq --ai-words 100 -v
[+] Crawled 15 pages, extracted 5808 raw words
[*] Attempt 1: got 82/100 words (+82 new)
[*] Attempt 2: got 100/100 words (+18 new)
[+] AI generated 100 words
Sample of AI-generated words (not on the site):
firmware
credential
lateral
implant
perimeter
segmentation
remediation
triage
Every word comes from the site context. The LLM reads about exploit development, CVEs, and pentesting, then generates related industry terms that never appear on any page. CeWL would never produce these.
The tool retries automatically until the requested word count is reached, deduplicating across attempts. If the LLM runs out of unique contextual words, it stops gracefully.
Five prompt modes
The same site produces very different outputs depending on the mode:
# General contextual words (default)
cewlai -u https://example.com --ai -p groq
# Likely passwords
cewlai -u https://example.com --ai -p groq --mode passwords
# Hidden directories and endpoints
cewlai -u https://example.com --ai -p groq --mode dirs
# Probable subdomains
cewlai -u https://example.com --ai -p groq --mode subdomains
# Geographic password patterns
cewlai -u https://example.com --ai -p groq --mode geo
Need something specific? Pass your own system prompt:
cewlai -u https://example.com --ai -p groq \
--prompt "Generate API endpoint paths for a healthcare SaaS platform"
CUPP-Style Mutations
The --mutate flag applies password transformations to every word in the list - crawled and AI-generated:
- Lowercase, uppercase, capitalized
- Reversed (
admin->nimda) - Leet speak (
admin->4dm1n) - Common suffixes:
123,!,2026,@,$ - Prefix support via config
From 15 base words, mutations produce 500+. Combined with AI enrichment, a single run against a medium-sized site can produce tens of thousands of targeted, contextual password candidates.
Mutation rules are fully customizable via JSON:
{
"leet": {"a": "@", "e": "3", "s": "$"},
"suffixes": ["2026", "!", "123", "#"],
"prefixes": ["pre_"],
"capitalize": true,
"reverse": true,
"leet_enabled": true,
"min_length": 6,
"max_length": 20
}
cewlai -u https://example.com --mutate --mutate-config rules.json
Providers
CeWL AI supports 6 cloud providers and any OpenAI-compatible local endpoint:
| Provider | Free tier | Env var | Signup |
|---|---|---|---|
| Groq | Yes | GROQ_API_KEY | console.groq.com |
| OpenRouter | Yes | OPENROUTER_API_KEY | openrouter.ai |
| Cerebras | Yes | CEREBRAS_API_KEY | cloud.cerebras.ai |
| HuggingFace | Yes | HF_TOKEN | huggingface.co |
| Anthropic | No | ANTHROPIC_API_KEY | console.anthropic.com |
| OpenAI | No | OPENAI_API_KEY | platform.openai.com |
For local models (Ollama, LM Studio, vLLM), point --base-url at your endpoint:
ollama pull llama3
cewlai -u https://example.com --ai -p openai -m llama3 \
--base-url http://localhost:11434/v1 --api-key dummy
No internet required. No API key cost. Full AI enrichment running on your own hardware.
Putting It All Together
A realistic engagement command:
cewlai -u https://example.com \
--ai -p groq \
--mode passwords \
--mutate \
--email \
--meta \
--secrets \
--ai-words 500 \
-d 3 \
-t 5 \
-o wordlist.txt \
--email-file emails.txt \
--meta-file authors.txt \
--secrets-file secrets.txt
What happens:
- Crawls 3 levels deep with 5 threads
- Parses all HTML (goquery), JavaScript (jsluice), PDFs and Office docs
- Extracts emails including obfuscated ones
- Scans every response for secrets with 800+ trufflehog detectors
- Sends context to Groq and generates exactly 500 contextual password candidates
- Mutates everything with leet speak, case variations, and suffix patterns
- Outputs the wordlist, emails, secrets, and document authors to separate files
Internal pentest with a file share:
cewlai -u smb://user:pass@192.168.1.10/data \
--secrets --secrets-file findings.txt \
--dump /tmp/loot \
-o wordlist.txt
One command: dump every file, scan for leaked credentials, build a targeted wordlist from the content. Feed wordlist.txt to Hydra. Feed findings.txt to your report. Feed /tmp/loot to manual review.
Under the Hood
A few design decisions that make the output quality better than “just calling an API”.
Why CeWL can’t do this
CeWL treats every page as a bag of words. It doesn’t build context, it doesn’t track relationships between pages, it doesn’t understand what the site is about. It extracts “Acme” from one page and “Paris” from another but never connects them into “AcmeParis”. It sees “hospital” but never infers “HIPAA”. There’s no intelligence layer - it’s a scraper with a word splitter.
CeWL was written in 2012, before LLMs existed. It was the best approach at the time and it served the community well for over a decade. But the tooling landscape has moved on. LLMs are accessible, fast, and cheap (or free). Go produces single binaries that run everywhere without a runtime. Someone just had to sit down and connect the pieces. That’s all this is.
Context depth matters
LLMs are next-token predictors. The quality of their output is directly proportional to the quality of the input context. This is why the crawl step isn’t just about extracting words - it’s about building the richest possible context for the model. A depth-1 crawl on a landing page gives the LLM almost nothing to work with. A depth-3 crawl across 50 pages gives it company names, product names, employee roles, tech stack hints, geographic clues, and industry terminology. The more signal the model sees, the more relevant its predictions become.
This is also why the tool sends a 4000-character context summary rather than just a list of extracted words. The raw text preserves sentence structure and relationships between concepts that a flat word list would lose.
Token efficiency
LLM APIs charge per token. Asking the model to output one word per line wastes a newline token on every single word. CeWL AI asks for comma-separated output instead, cutting token usage by roughly 30-40% for the same number of words. That’s the difference between hitting Groq’s free tier limit at 300 words or at 500.
Retry with deduplication
LLMs don’t reliably produce an exact word count. Ask for 500, you might get 300. CeWL AI handles this with a retry loop: it keeps calling the model with the remaining count until the target is reached. Each batch is deduplicated against previous results, so the model is forced to produce new words on each attempt. If a retry adds zero new words, the loop stops - the model has exhausted its contextual knowledge for that target.
Mutations as a multiplier
AI calls are the expensive step (in time and tokens). Mutations are free and local. By applying CUPP-style transformations after the AI step, a small set of high-quality contextual words becomes a large set of password candidates without any additional API calls. 50 AI words with mutations can produce 800+ candidates. The AI provides the semantic intelligence, the mutations provide the coverage.
Install
go install github.com/Chocapikk/cewlai/cmd/cewlai@latest
Or download a prebuilt binary from releases (Linux, macOS, Windows).
Personal note
CeWL and CUPP were part of my workflow when I started doing CTFs years ago. They were some of the first tools I learned. Coming back to this space and realizing nobody had combined them with AI felt like a gap that shouldn’t still be there. What started as a CeWL replacement grew into something broader - once you’re crawling targets across 5 protocols and parsing every format, adding secret scanning and file dumps is a natural extension. I wish I’d had it when I was starting out.
Source: github.com/Chocapikk/cewlai