Building a Competitor Discovery System: LLMs, Classification, and Guardrails
Building a domain classification system with LLM waterfall prompting, then creating a clever dual-path approach that discovers a website's industry and competitors in seconds, whether it's in our existing dataset or being analyzed for the first time.
When I set out to build automatic competitor discovery for UX Bench, I thought the problem was simple: classify websites by industry, find 5 similar competitors. I was wrong. What started as a straightforward classification task became a journey through LLM-guided classification, NAICS code limitations, and ultimately a clever shortcut that makes the whole system work in production. Here's how I classified 4,500 domains with 100% 6-digit NAICS granularity, then layered on custom business verticals to build a system that handles both known and unknown domains intelligently.
The Challenge: Automatically Finding 5 Relevant Competitors
UX Bench helps users benchmark their site's Core Web Vitals against competitors. For that to work, I needed to automatically identify 5 relevant competitors for any website. A user analyzing walmart.com should see target.com and costco.com, not random e-commerce sites. The system needed to work for thousands of known domains AND handle unknown startups users might analyze.
The Requirements
• Accurate: Walmart's competitors should be Target and Costco, not Etsy and Wayfair
• Handle known domains: Fast lookup for 4,000+ classified domains
• Handle unknown domains: Real-time classification when users analyze new sites
• Nuanced for newer industries: NAICS code 541511 includes Google, Shopify, and Oracle (not competitors), showing how even with updates every 5 years, NAICS hasn't yet developed sufficient granularity for newer digital business models
• Intelligent fallback: Roll up to broader categories if specific ones are too small
This article tells two connected stories: Part 1 covers how I built the classification database (5,500 domains with increasingly granular NAICS codes and custom business verticals). Part 2 shows how I designed an intelligent system that handles both known domains (fast lookup) and unknown domains (brand similarity matching), with automatic rollup strategies when granular matches are sparse.
Final Results
~5,500
Domains classified
100%
6-digit NAICS codes
100%
High confidence
~4,500
In production
Why 5,500 became 4,500: After achieving 100% 6-digit NAICS classification with waterfall prompting, we filtered to domains with CrUX data (real user performance metrics), USA-focused English-language sites, and professional-appropriate content. Then we layered on custom business verticals (like "Traditional Banking" vs "Digital Banking") for even more nuanced competitor matching.
Why Domain Classification Is Harder Than It Looks
Before diving into the methods, it's important to understand why using NAICS (North American Industry Classification System) codes for domain classification is tricky.
NAICS Updates Every 5 Years
The system changes regularly (2022, 2017, 2012...). Codes get retired or merged. LLMs trained on older data may use outdated classifications. Example: Code 541513 "Computer Facilities Management Services" was retired in 2017 and merged into 541519.
Regional Variations
While standardized across the US, Canada, and Mexico, some codes are unique to each country. LLMs may conflate different regional versions, leading to classification errors for international companies.
Not Designed for Digital Businesses
NAICS code 541511 "Custom Computer Programming Services" includes Google (search), Shopify (e-commerce), Salesforce (CRM), and Oracle (databases). They're not competitors, yet NAICS sees them as identical.
The solution? Layer semantic "business verticals" on top of NAICS codes. More on that later.
How Text Classification Evolved (Brief History)
Before showing what I actually built, here's a quick look at how classification approaches evolved over the past 30 years. The first three are educational references showing historical context for the waterfall approach I developed.
Historic Alternative
Keyword/Regex Matching (1990s-2000s)
Simple substring matching: if text contains "restaurant" + "menu" → food service. Lightning fast (microseconds) but brittle and requires manual keyword curation for 1,000+ NAICS codes.
⚠️ Initial attempt: Achieved only 4.4% success rate. No semantic understanding, misses synonyms, breaks easily with small text changes.
Historic Alternative
TF-IDF + Similarity (2000s-2010s)
Statistical word importance scoring. Later evolved into supervised ML (Naive Bayes, SVM) but still lacked semantic understanding.
⚠️ Not pursued: No semantic understanding, struggles with context and homonyms.
Tried Initially
Zero-Shot LLM (2020s)
LLMs classify text using pre-trained semantic understanding without examples. Understands that "plumber" relates to "plumbing" and "We help businesses grow online" signals SaaS/marketing.
⚠️ Better but still limited: Can drift without guidance when choosing from 1,000+ categories. Needs structure.
What I Built
Waterfall LLM Classification
Guide the LLM step-by-step: 2-digit sector (20 choices) → 4-digit group (5-10 choices) → 6-digit code (2-5 choices) → business vertical. Prevents drift, improves accuracy.
✅ Used Chat OSS 20B (LM Studio) for 5,500 domains. Slower (25-45s/domain, CPU/RAM-intensive) but free and achieved 100% 6-digit granularity.
Part 1: Building the Classification Database
The first challenge: classify 5,500 domains with high accuracy and granularity. I needed 6-digit NAICS codes (not just broad 2-digit sectors) to enable meaningful competitor matching, then layer on custom business verticals for even more nuance.
I tried keyword/regex matching first, hoping it would be "good enough" to let me focus on building the rest of the competitor discovery tool. It wasn't. With only a 4.4% success rate and no semantic understanding, I needed a better approach. After experimenting with zero-shot LLM classification (which got to 79%), I developed a waterfall approach that achieved 100% 6-digit granularity.
Technology Stack
First attempt: Keyword/regex matching - Fast but only 4.4% success rate, no semantic understanding
Next attempts: Grok API (xAI) and ChatGPT - Fast, accurate, but costs add up for 5,000+ domains
Production method: Chat OSS 20B via LM Studio on local CPU
Speed: 25-45 seconds per domain (CPU/RAM-intensive, non-NVIDIA GPU couldn't be utilized)
Total time: ~53 hours for 5,500 domains
Benefits: Free, full control, easy monitoring
Input Data
Crawled: Homepage, about page, products overview
Extracted: Title, meta description, H1 tags, body text
Average: 918 words per domain
Uploaded: Official NAICS 2022 reference docs to help LLM
Each Method Brought Significant Improvements to Classification Accuracy
1. Keyword/Regex Matching4.4% 6-digit
85% (2-digit broad)
10.6%
4.4%
2. Zero-Shot LLM79% 6-digit
16%
79% (6-digit)
3. Waterfall LLM Classification100% 6-digit ✅
100% (6-digit specific codes)
2-digit (broad sectors)
4-digit (industry groups)
6-digit (specific codes)
Domains Classified
5,500
6-Digit Codes
100%
(waterfall approach)
High Confidence
100%
(all 4,500 in prod)
Why the LLM Waterfall Approach
Instead of asking the LLM to pick from 1,030 NAICS codes at once, I guided it step-by-step through narrowing choices. This prevents the model from drifting between unrelated categories.
Waterfall Approach to Avoid Model Drift
1
Assign 2-Digit Sector (20 choices)
Example: "51 - Information"
2
Assign 4-Digit Industry Group within that sector (5-10 choices)
Example: "5112 - Software Publishers"
3
Assign 6-Digit Industry Code within that group (2-5 choices)
Example: "511210 - Software Publishers"
4
Assign Business Vertical (for crowded codes like 541511)
Example: "E-commerce Platforms" vs "Enterprise Cloud/SaaS" vs "Developer Tools"
Why Waterfall Works
Narrowing choices at each step prevents the LLM from jumping between unrelated categories. It's easier for the model to pick "Information" vs "Manufacturing" (Step 1), then "Software Publishers" vs "Data Processing" within Information (Step 2), than to choose correctly from 1,030 options at once.
Why Business Verticals Matter
Even with 6-digit NAICS codes, some industries contain hundreds of domains that aren't true competitors. These crowded codes lack the granularity needed for meaningful competitor discovery. For example, NAICS 541511 "Custom Computer Programming Services" included over 500 domains ranging from Google to Shopify to Oracle. By adding a semantic "business vertical" layer on top of NAICS, we can finally distinguish true competitors.
BeforeAll Lumped Together
NAICS 541511: Custom Computer Programming
Google (Search)
Shopify (E-commerce)
Salesforce (CRM)
Oracle (Databases)
HubSpot (Marketing)
Atlassian (Dev Tools)
+ 494 more domains
❌ These are NOT competitors
AfterBroken Into Verticals
E-commerce Platforms
Shopify, WooCommerce, BigCommerce, Wix...
Enterprise Cloud/SaaS
Salesforce, Oracle, SAP, ServiceNow...
Developer Tools
Atlassian, GitHub, GitLab, JetBrains...
Search Engines
Google, Bing, DuckDuckGo, Brave...
Marketing Automation
HubSpot, Marketo, ActiveCampaign...
+ 35 more verticals
✅ Now we can find real competitors
The vertical layer was essential: Without it, competitor discovery would suggest Oracle as a Shopify competitor, which makes no sense despite sharing NAICS code 541511. Business verticals bring semantic meaning to rigid classification codes.
TF-IDF (Term Frequency-Inverse Document Frequency) weighs words by how important they are to a specific document. "Plumbing" appears often on Home Depot's site but rarely across all sites, making it a strong signal. Combine with cosine similarity to find the most similar NAICS category.
# TF-IDF classification algorithm
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Pre-compute NAICS code descriptions as TF-IDF vectors
naics_descriptions = {
"444110": "Home centers, retail hardware stores...",
"722511": "Full-service restaurants, cafeterias...",
# ... 1,030 more codes
}
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
naics_vectors = vectorizer.fit_transform(naics_descriptions.values())
# For each website, compute similarity to all NAICS codes
website_vector = vectorizer.transform([homepage_text])
similarities = cosine_similarity(website_vector, naics_vectors)[0]
# Return top K most similar codes (typically K=40-80)
top_candidates = similarities.argsort()[-40:][::-1]
What Works
• Captures word importance, not just presence
• Excellent for candidate filtering (1,030 codes → 40 candidates)
• Handles synonyms better than pure keyword matching
• Computationally efficient with pre-indexed vectors
• Can't distinguish "bank" (financial) from "bank" (riverbank)
How I used it: TF-IDF became my primary candidate filtering step. Instead of running expensive LLM inference on all 1,030 NAICS codes, I use TF-IDF to narrow down to 40-80 candidates, then apply smarter methods.
Part 2: Designing Intelligent Competitor Selection
After building the classification database, the next challenge was using it in production. UX Bench needed to handle two scenarios: known domains (fast lookup) and unknown domains (requires classification). Here's where I built a clever shortcut.
UX Bench Competitor Discovery Flow
Known Domain Path
User enters "walmart.com"
Database lookup
NAICS 455110 "Department Stores"
Return 5 competitors
⚡ Sub-second
Unknown Domain Path
User enters "new-startup.com"
Crawl + Ask LLM
"Which major brands is this similar to?"
Return matched competitors
⚡ 3-5 seconds
The two-path system leverages pre-classified data for known domains and uses intelligent LLM shortcuts for unknown domains, avoiding expensive re-classification.
Why This Shortcut Works
Instead of running full classification (25-45s, NAICS versioning issues), asking "which major brand is this similar to?" takes 3-5 seconds and sidesteps NAICS versioning entirely while remaining accurate. LLMs excel at brand similarity matching.
Faster: 3-5 seconds vs 25-45 seconds
Simpler: No NAICS versioning or documentation issues
Leverage existing work: Reuses the 5,500 classified domains
Sourcing Competitors with an Intelligent Rollup Strategy
If not enough competitors found at specific level, the system automatically broadens the search:
1
Start: 6-digit NAICS + Business Vertical (most specific)
When many competitors exist, use Tranco rankings (global site popularity) to find similar-sized companies
Uses logarithmic distance to match Walmart (rank 150) with Target (rank 180), not a small local retailer (rank 500,000)
Reflections & Lessons Learned
The system works: sub-second lookups for known domains, 3-5 second classification via brand similarity for unknowns, and intelligent rollup when needed. But competitor discovery has no "end state." Modern business complicates classification:
•
Cross-vertical operations: Amazon operates in retail, cloud, streaming, and logistics. Where does Shopify end and Square begin?
•
Market fluidity: Today's SaaS competitor might be tomorrow's platform partner.
•
Industry convergence: Financial services build payment processors, retailers launch ad platforms.
If you're building a similar system, don't chase perfect classification. Recognize when improvements plateau (keyword matching achieved 4.4%, zero-shot LLM reached 79%, waterfall pushed to 100%), build practical fallbacks (brand similarity for edge cases), and know when to stop refining. The hardest lesson: deciding when continued iteration delivers diminishing returns versus when to move on to higher-impact work. Sometimes "good enough for 90% of cases" is the right answer.
Modern language models like BART, GPT, and Gemini can classify text without explicit training on your specific categories. Feed them website content and NAICS descriptions, and they'll predict the best match using their pre-trained semantic understanding.
# Zero-shot classification with BART
from transformers import pipeline
classifier = pipeline(
"zero-shot-classification",
model="facebook/bart-large-mnli"
)
# Chunk long homepage text (models have token limits)
chunks = split_text_into_chunks(homepage_text, max_tokens=500)
# Classify each chunk against candidate NAICS codes
candidate_labels = [
"Home improvement retail stores",
"Full-service restaurants",
"Software publishing and SaaS"
]
scores = []
for chunk in chunks:
result = classifier(chunk, candidate_labels)
scores.append(result['scores'])
# Max-pool across chunks to get final prediction
final_scores = np.max(scores, axis=0)
best_match = candidate_labels[np.argmax(final_scores)]
What Works
• True semantic understanding (knows "plumber" relates to "plumbing")
• No manual keyword curation needed
• Generalizes well to new domains and industries
• Can handle ambiguous or nuanced content
What Fails
• Slower than keyword or TF-IDF methods
• Requires chunking for long documents
• Can still misclassify edge cases without examples
• Token limits prevent analyzing full websites
Breakthrough moment: When I switched from keyword matching to zero-shot LLM, accuracy jumped from 68% to 81%. The model understood that "We help businesses grow online" signals SaaS/marketing, not just generic consulting.
Method 4: Few-Shot LLM with Examples
Era: 2020s • Accuracy: 80-88% • Speed: ⚡ Moderate to Slow (2-5s per domain)
Few-shot learning provides the LLM with examples of correctly classified domains before asking it to classify a new one. This teaches the model your specific classification standards and edge case handling.
# Few-shot prompting with GPT-4 / Gemini
prompt = """Classify the following website into a NAICS code.
Examples:
- walmart.com → 455110 (Department Stores)
- shopify.com → 541511 (Custom Computer Programming Services)
- mcdonalds.com → 722513 (Limited-Service Restaurants)
Now classify this website:
Domain: {domain}
Content: {homepage_excerpt}
NAICS Code:"""
response = llm.generate(
prompt.format(domain=domain, homepage_excerpt=text[:1000])
)
predicted_code = extract_naics_code(response)
What Works
• Significantly more accurate than zero-shot
• Teaches model your specific classification standards
• Can include edge case examples to handle tricky domains
• Works with any capable LLM (GPT-4, Gemini, Claude)
What Fails
• Slower and more expensive (longer prompts, API costs)
• Quality depends on example selection
• Random examples may not cover the relevant edge cases
• Still limited by context window for very long content
The problem: Random examples help, but what if your test domain is a B2B SaaS company and all your examples are retail stores? You need *relevant* examples.
The breakthrough: use embeddings to retrieve the most semantically similar already-classified domains, then use those as few-shot examples. Instead of random examples, you get examples that are actually relevant to the domain you're classifying.
# Embedding-based example retrieval
from sentence_transformers import SentenceTransformer
# Pre-compute embeddings for all classified domains
model = SentenceTransformer('all-MiniLM-L6-v2')
classified_domains = load_classified_domains() # 4,790 domains
domain_embeddings = model.encode([d.text for d in classified_domains])
# For new domain, find most similar examples
new_domain_embedding = model.encode(new_domain_text)
similarities = cosine_similarity([new_domain_embedding], domain_embeddings)[0]
top_5_examples = similarities.argsort()[-5:][::-1]
# Use those examples in few-shot prompt
examples_text = "\n".join([
f"- {classified_domains[i].domain} → {classified_domains[i].naics}"
for i in top_5_examples
])
prompt = f"""Classify this website using these similar examples:
{examples_text}
Domain: {new_domain}
Content: {new_domain_text}
NAICS Code:"""
What Works
• Consistently high accuracy (85-92% in my tests)
• Examples are always relevant to the domain being classified
• Handles edge cases better because examples guide the model
• Embedding search is fast (sub-second with FAISS index)
• More complex infrastructure (embeddings + vector search + LLM)
• Still slower than keyword or TF-IDF methods
• Embedding quality matters (bad embeddings = bad retrieval)
Real impact: Adding embedding-based retrieval boosted my accuracy from 81% (zero-shot) to 89% (retrieval + few-shot). The LLM stopped making category errors because the examples showed it exactly what "SaaS for healthcare" vs "healthcare services" looks like.
The Missing Layer: Business Verticals
The Problem NAICS Doesn't Solve
After classifying 4,790 companies with 89% accuracy, I discovered a fatal flaw: NAICS works great for distinguishing restaurants from law firms, but completely fails when everyone in tech gets code 541511 ("Custom Computer Programming Services").
All Classified as NAICS 541511:
• Google (Search Engine)
• Shopify (E-commerce Platform)
• Salesforce (CRM SaaS)
• Oracle (Enterprise Software)
• HubSpot (Marketing Automation)
• Atlassian (Developer Tools)
These companies are not competitors, yet NAICS sees them as identical.
The solution? Add a second classification layer: Business Verticals. These are human-readable, semantic groupings that sit between NAICS codes and individual companies.
• Manually curated: I defined ~40 verticals based on patterns in the classified data
• LLM-assigned: After NAICS classification, a second LLM pass assigns the vertical
• Human-readable: "Enterprise Cloud/SaaS" is clearer than "541511"
• Competitor-focused: Verticals group actual competitors, not just similar business models
Method 6: Fine-Tuned Model (Not Pursued)
Era: 2020s • Accuracy: 90-95%+ • Speed: ⚡⚡ Fast (once trained) • Cost: High (training, maintenance)
The theoretical best approach: fine-tune a model specifically for NAICS classification using thousands of labeled examples. This would likely achieve 90-95%+ accuracy and fast inference.
Why It Would Work
• Model learns exact NAICS classification patterns
• Fast inference (no few-shot prompting overhead)
• Can be optimized for specific accuracy/speed tradeoffs
• Best theoretical accuracy ceiling
Why I Didn't Do It
• Requires significant labeled training data (1,000s of examples)
• Training and infrastructure costs
• Model maintenance as NAICS codes evolve
• 89% accuracy with embeddings + LLM was "good enough"
Practical reality: Fine-tuning would push accuracy from 89% to maybe 93%, but the engineering effort wasn't justified for a feature that was already working well. Sometimes "good enough" beats "theoretically optimal."
The Final Architecture: Hybrid Ensemble
After testing all six methods, I didn't pick just one. I built an ensemble that combines the best of each approach, weighted by their strengths.
UX Bench Classification Pipeline
1
TF-IDF Candidate Filtering
Narrow 1,030 NAICS codes down to 40-80 candidates (fast, eliminates obviously wrong categories)
2
Zero-Shot LLM Classification (55% weight)
Primary signal, uses facebook/bart-large-mnli for semantic understanding
3
NAICSkit Library (25% weight)
Rule-based NAICS tool for additional signal
4
Keyword/Regex Matching (20% weight)
Fast baseline signal for unambiguous cases
5
Vertical Classification (LLM)
Second pass to assign semantic vertical for competitor matching
Evaluate if your classification framework has sufficient depth. If your framework lumps competitors together (like NAICS 541511 grouping Google, Shopify, and Salesforce), you may need to augment it with semantic layers that better match your domain.
Guide LLMs, don't let them freewheel. Waterfall classification (20 → 5 → 2 choices per step) beats "pick from 1,030" approaches. Narrowing choices prevents the model from drifting between unrelated categories.
Consider local LLMs for prototyping and batch work. If you have a strong GPU, local models let you validate your approach without API costs. Once proven, evaluate whether you need more powerful API models or if your local setup meets production needs.
Step back when you hit an impasse. I spent days trying to optimize the unknown domain classification, even brainstorming with several LLMs. Nothing worked. Taking a step back, I rethought the challenge entirely and found a completely different approach (brand similarity matching). The new solution worked beautifully and became the production implementation.
Know when to stop refining. The hardest decision isn't improving your system, it's recognizing when continued iteration delivers diminishing returns. Keyword matching achieved 4.4%, zero-shot LLM reached 79%, waterfall pushed to 100%. Each improvement had clear value, but chasing perfection beyond "good enough for 90% of cases" often means sacrificing higher-impact work elsewhere.
See It In Action
The classification system and intelligent competitor discovery described in this article power UX Bench. Enter any domain to see the system in action: known domains get instant results via database lookup, unknown domains use the LLM brand similarity shortcut.