Building a Competitor Discovery System: LLMs, Classification, and Guardrails

When I set out to build automatic competitor discovery for UX Bench, I thought the problem was simple: classify websites by industry, find 5 similar competitors. I was wrong. What started as a straightforward classification task became a journey through LLM-guided classification, NAICS code limitations, and ultimately a clever shortcut that makes the whole system work in production. Here's how I classified 4,500 domains with 100% 6-digit NAICS granularity, then layered on custom business verticals to build a system that handles both known and unknown domains intelligently.

The Challenge: Automatically Finding 5 Relevant Competitors

UX Bench helps users benchmark their site's Core Web Vitals against competitors. For that to work, I needed to automatically identify 5 relevant competitors for any website. A user analyzing walmart.com should see target.com and costco.com, not random e-commerce sites. The system needed to work for thousands of known domains AND handle unknown startups users might analyze.

The Requirements

• Accurate: Walmart's competitors should be Target and Costco, not Etsy and Wayfair
• Handle known domains: Fast lookup for 4,000+ classified domains
• Handle unknown domains: Real-time classification when users analyze new sites
• Nuanced for newer industries: NAICS code 541511 includes Google, Shopify, and Oracle (not competitors), showing how even with updates every 5 years, NAICS hasn't yet developed sufficient granularity for newer digital business models
• Intelligent fallback: Roll up to broader categories if specific ones are too small

This article tells two connected stories: Part 1 covers how I built the classification database (5,500 domains with increasingly granular NAICS codes and custom business verticals). Part 2 shows how I designed an intelligent system that handles both known domains (fast lookup) and unknown domains (brand similarity matching), with automatic rollup strategies when granular matches are sparse.

Final Results

~5,500

Domains classified

100%

6-digit NAICS codes

100%

High confidence

~4,500

In production

Why 5,500 became 4,500: After achieving 100% 6-digit NAICS classification with waterfall prompting, we filtered to domains with CrUX data (real user performance metrics), USA-focused English-language sites, and professional-appropriate content. Then we layered on custom business verticals (like "Traditional Banking" vs "Digital Banking") for even more nuanced competitor matching.

Why Domain Classification Is Harder Than It Looks

Before diving into the methods, it's important to understand why using NAICS (North American Industry Classification System) codes for domain classification is tricky.

NAICS Updates Every 5 Years

The system changes regularly (2022, 2017, 2012...). Codes get retired or merged. LLMs trained on older data may use outdated classifications. Example: Code 541513 "Computer Facilities Management Services" was retired in 2017 and merged into 541519.

Regional Variations

While standardized across the US, Canada, and Mexico, some codes are unique to each country. LLMs may conflate different regional versions, leading to classification errors for international companies.

Not Designed for Digital Businesses

NAICS code 541511 "Custom Computer Programming Services" includes Google (search), Shopify (e-commerce), Salesforce (CRM), and Oracle (databases). They're not competitors, yet NAICS sees them as identical.

The solution? Layer semantic "business verticals" on top of NAICS codes. More on that later.

How Text Classification Evolved (Brief History)

Before showing what I actually built, here's a quick look at how classification approaches evolved over the past 30 years. The first three are educational references showing historical context for the waterfall approach I developed.

Historic Alternative

Keyword/Regex Matching (1990s-2000s)

Simple substring matching: if text contains "restaurant" + "menu" → food service. Lightning fast (microseconds) but brittle and requires manual keyword curation for 1,000+ NAICS codes.

⚠️ Initial attempt: Achieved only 4.4% success rate. No semantic understanding, misses synonyms, breaks easily with small text changes.

Historic Alternative

TF-IDF + Similarity (2000s-2010s)

Statistical word importance scoring. Later evolved into supervised ML (Naive Bayes, SVM) but still lacked semantic understanding.

⚠️ Not pursued: No semantic understanding, struggles with context and homonyms.

Tried Initially

Zero-Shot LLM (2020s)

LLMs classify text using pre-trained semantic understanding without examples. Understands that "plumber" relates to "plumbing" and "We help businesses grow online" signals SaaS/marketing.

⚠️ Better but still limited: Can drift without guidance when choosing from 1,000+ categories. Needs structure.

What I Built

Waterfall LLM Classification

Guide the LLM step-by-step: 2-digit sector (20 choices) → 4-digit group (5-10 choices) → 6-digit code (2-5 choices) → business vertical. Prevents drift, improves accuracy.

✅ Used Chat OSS 20B (LM Studio) for 5,500 domains. Slower (25-45s/domain, CPU/RAM-intensive) but free and achieved 100% 6-digit granularity.

Part 1: Building the Classification Database

The first challenge: classify 5,500 domains with high accuracy and granularity. I needed 6-digit NAICS codes (not just broad 2-digit sectors) to enable meaningful competitor matching, then layer on custom business verticals for even more nuance.

I tried keyword/regex matching first, hoping it would be "good enough" to let me focus on building the rest of the competitor discovery tool. It wasn't. With only a 4.4% success rate and no semantic understanding, I needed a better approach. After experimenting with zero-shot LLM classification (which got to 79%), I developed a waterfall approach that achieved 100% 6-digit granularity.

Technology Stack

First attempt: Keyword/regex matching - Fast but only 4.4% success rate, no semantic understanding
Next attempts: Grok API (xAI) and ChatGPT - Fast, accurate, but costs add up for 5,000+ domains
Production method: Chat OSS 20B via LM Studio on local CPU
Speed: 25-45 seconds per domain (CPU/RAM-intensive, non-NVIDIA GPU couldn't be utilized)
Total time: ~53 hours for 5,500 domains
Benefits: Free, full control, easy monitoring

Input Data

Crawled: Homepage, about page, products overview
Extracted: Title, meta description, H1 tags, body text
Average: 918 words per domain
Uploaded: Official NAICS 2022 reference docs to help LLM

Each Method Brought Significant Improvements to Classification Accuracy

1. Keyword/Regex Matching 4.4% 6-digit

85% (2-digit broad)

10.6%

4.4%

2. Zero-Shot LLM 79% 6-digit

16%

79% (6-digit)

3. Waterfall LLM Classification 100% 6-digit ✅

100% (6-digit specific codes)

2-digit (broad sectors)

4-digit (industry groups)

6-digit (specific codes)

Domains Classified

5,500

6-Digit Codes

100%

(waterfall approach)

High Confidence

100%

(all 4,500 in prod)

Why the LLM Waterfall Approach

Instead of asking the LLM to pick from 1,030 NAICS codes at once, I guided it step-by-step through narrowing choices. This prevents the model from drifting between unrelated categories.

Waterfall Approach to Avoid Model Drift

Assign 2-Digit Sector (20 choices)

Example: "51 - Information"

Assign 4-Digit Industry Group within that sector (5-10 choices)

Example: "5112 - Software Publishers"

Assign 6-Digit Industry Code within that group (2-5 choices)

Example: "511210 - Software Publishers"

Assign Business Vertical (for crowded codes like 541511)

Example: "E-commerce Platforms" vs "Enterprise Cloud/SaaS" vs "Developer Tools"

Why Waterfall Works

Narrowing choices at each step prevents the LLM from jumping between unrelated categories. It's easier for the model to pick "Information" vs "Manufacturing" (Step 1), then "Software Publishers" vs "Data Processing" within Information (Step 2), than to choose correctly from 1,030 options at once.

Why Business Verticals Matter

Even with 6-digit NAICS codes, some industries contain hundreds of domains that aren't true competitors. These crowded codes lack the granularity needed for meaningful competitor discovery. For example, NAICS 541511 "Custom Computer Programming Services" included over 500 domains ranging from Google to Shopify to Oracle. By adding a semantic "business vertical" layer on top of NAICS, we can finally distinguish true competitors.

Before All Lumped Together

NAICS 541511: Custom Computer Programming

Google (Search)

Shopify (E-commerce)

Salesforce (CRM)

Oracle (Databases)

HubSpot (Marketing)

Atlassian (Dev Tools)

+ 494 more domains

❌ These are NOT competitors

After Broken Into Verticals

E-commerce Platforms

Shopify, WooCommerce, BigCommerce, Wix...

Enterprise Cloud/SaaS

Salesforce, Oracle, SAP, ServiceNow...

Developer Tools

Atlassian, GitHub, GitLab, JetBrains...

Search Engines

Google, Bing, DuckDuckGo, Brave...

Marketing Automation

HubSpot, Marketo, ActiveCampaign...

+ 35 more verticals

✅ Now we can find real competitors

The vertical layer was essential: Without it, competitor discovery would suggest Oracle as a Shopify competitor, which makes no sense despite sharing NAICS code 541511. Business verticals bring semantic meaning to rigid classification codes.

Method 2: TF-IDF + Cosine Similarity

Era: 2000s-2010s • Accuracy: 70-80% • Speed: ⚡⚡ Fast (with pre-indexing)

TF-IDF (Term Frequency-Inverse Document Frequency) weighs words by how important they are to a specific document. "Plumbing" appears often on Home Depot's site but rarely across all sites, making it a strong signal. Combine with cosine similarity to find the most similar NAICS category.

# TF-IDF classification algorithm
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Pre-compute NAICS code descriptions as TF-IDF vectors
naics_descriptions = {
    "444110": "Home centers, retail hardware stores...",
    "722511": "Full-service restaurants, cafeterias...",
    # ... 1,030 more codes
}

vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
naics_vectors = vectorizer.fit_transform(naics_descriptions.values())

# For each website, compute similarity to all NAICS codes
website_vector = vectorizer.transform([homepage_text])
similarities = cosine_similarity(website_vector, naics_vectors)[0]

# Return top K most similar codes (typically K=40-80)
top_candidates = similarities.argsort()[-40:][::-1]

What Works

• Captures word importance, not just presence
• Excellent for candidate filtering (1,030 codes → 40 candidates)
• Handles synonyms better than pure keyword matching
• Computationally efficient with pre-indexed vectors

What Fails

• Still no semantic understanding
• Struggles with short or sparse text
• Requires clean, representative NAICS descriptions
• Can't distinguish "bank" (financial) from "bank" (riverbank)

How I used it: TF-IDF became my primary candidate filtering step. Instead of running expensive LLM inference on all 1,030 NAICS codes, I use TF-IDF to narrow down to 40-80 candidates, then apply smarter methods.

Part 2: Designing Intelligent Competitor Selection

After building the classification database, the next challenge was using it in production. UX Bench needed to handle two scenarios: known domains (fast lookup) and unknown domains (requires classification). Here's where I built a clever shortcut.

UX Bench Competitor Discovery Flow

Known Domain Path

User enters "walmart.com"

Database lookup

NAICS 455110 "Department Stores"

Return 5 competitors

⚡ Sub-second

Unknown Domain Path

User enters "new-startup.com"

Crawl + Ask LLM

"Which major brands is this similar to?"

Return matched competitors

⚡ 3-5 seconds

The two-path system leverages pre-classified data for known domains and uses intelligent LLM shortcuts for unknown domains, avoiding expensive re-classification.

Why This Shortcut Works

Instead of running full classification (25-45s, NAICS versioning issues), asking "which major brand is this similar to?" takes 3-5 seconds and sidesteps NAICS versioning entirely while remaining accurate. LLMs excel at brand similarity matching.

Faster: 3-5 seconds vs 25-45 seconds
Simpler: No NAICS versioning or documentation issues
Leverage existing work: Reuses the 5,500 classified domains

Sourcing Competitors with an Intelligent Rollup Strategy

If not enough competitors found at specific level, the system automatically broadens the search:

Start: 6-digit NAICS + Business Vertical (most specific)

Example: 541511 "Custom Computer Programming" + "E-commerce Platforms"

Found: 15 competitors ✅ Return top 5

If < 5 found: Roll up to 4-digit Industry Group

Example: 5415 "Computer Systems Design Services"

Found: 80 competitors ✅ Return top 5

If still < 5: Roll up to 2-digit Sector

Example: 54 "Professional, Scientific, Technical Services"

Found: 500+ competitors ✅ Return top 5

Sort by size similarity using Tranco Rankings

When many competitors exist, use Tranco rankings (global site popularity) to find similar-sized companies

Uses logarithmic distance to match Walmart (rank 150) with Target (rank 180), not a small local retailer (rank 500,000)

Reflections & Lessons Learned

The system works: sub-second lookups for known domains, 3-5 second classification via brand similarity for unknowns, and intelligent rollup when needed. But competitor discovery has no "end state." Modern business complicates classification:

•
Cross-vertical operations: Amazon operates in retail, cloud, streaming, and logistics. Where does Shopify end and Square begin?
•
Market fluidity: Today's SaaS competitor might be tomorrow's platform partner.
•
Industry convergence: Financial services build payment processors, retailers launch ad platforms.

If you're building a similar system, don't chase perfect classification. Recognize when improvements plateau (keyword matching achieved 4.4%, zero-shot LLM reached 79%, waterfall pushed to 100%), build practical fallbacks (brand similarity for edge cases), and know when to stop refining. The hardest lesson: deciding when continued iteration delivers diminishing returns versus when to move on to higher-impact work. Sometimes "good enough for 90% of cases" is the right answer.

Method 3: Zero-Shot LLM Classification

Era: 2020s • Accuracy: 75-85% • Speed: ⚡ Moderate (1-3s per domain)

Modern language models like BART, GPT, and Gemini can classify text without explicit training on your specific categories. Feed them website content and NAICS descriptions, and they'll predict the best match using their pre-trained semantic understanding.

# Zero-shot classification with BART
from transformers import pipeline

classifier = pipeline(
    "zero-shot-classification",
    model="facebook/bart-large-mnli"
)

# Chunk long homepage text (models have token limits)
chunks = split_text_into_chunks(homepage_text, max_tokens=500)

# Classify each chunk against candidate NAICS codes
candidate_labels = [
    "Home improvement retail stores",
    "Full-service restaurants",
    "Software publishing and SaaS"
]

scores = []
for chunk in chunks:
    result = classifier(chunk, candidate_labels)
    scores.append(result['scores'])

# Max-pool across chunks to get final prediction
final_scores = np.max(scores, axis=0)
best_match = candidate_labels[np.argmax(final_scores)]

What Works

• True semantic understanding (knows "plumber" relates to "plumbing")
• No manual keyword curation needed
• Generalizes well to new domains and industries
• Can handle ambiguous or nuanced content

What Fails

• Slower than keyword or TF-IDF methods
• Requires chunking for long documents
• Can still misclassify edge cases without examples
• Token limits prevent analyzing full websites

Breakthrough moment: When I switched from keyword matching to zero-shot LLM, accuracy jumped from 68% to 81%. The model understood that "We help businesses grow online" signals SaaS/marketing, not just generic consulting.

Method 4: Few-Shot LLM with Examples

Era: 2020s • Accuracy: 80-88% • Speed: ⚡ Moderate to Slow (2-5s per domain)

Few-shot learning provides the LLM with examples of correctly classified domains before asking it to classify a new one. This teaches the model your specific classification standards and edge case handling.

# Few-shot prompting with GPT-4 / Gemini
prompt = """Classify the following website into a NAICS code.

Examples:
- walmart.com → 455110 (Department Stores)
- shopify.com → 541511 (Custom Computer Programming Services)
- mcdonalds.com → 722513 (Limited-Service Restaurants)

Now classify this website:
Domain: {domain}
Content: {homepage_excerpt}

NAICS Code:"""

response = llm.generate(
    prompt.format(domain=domain, homepage_excerpt=text[:1000])
)
predicted_code = extract_naics_code(response)

What Works

• Significantly more accurate than zero-shot
• Teaches model your specific classification standards
• Can include edge case examples to handle tricky domains
• Works with any capable LLM (GPT-4, Gemini, Claude)

What Fails

• Slower and more expensive (longer prompts, API costs)
• Quality depends on example selection
• Random examples may not cover the relevant edge cases
• Still limited by context window for very long content

The problem: Random examples help, but what if your test domain is a B2B SaaS company and all your examples are retail stores? You need *relevant* examples.

Method 5: Embedding Retrieval + Few-Shot LLM

Era: 2023+ • Accuracy: 85-92% • Speed: ⚡ Moderate (2-4s per domain)

The breakthrough: use embeddings to retrieve the most semantically similar already-classified domains, then use those as few-shot examples. Instead of random examples, you get examples that are actually relevant to the domain you're classifying.

# Embedding-based example retrieval
from sentence_transformers import SentenceTransformer

# Pre-compute embeddings for all classified domains
model = SentenceTransformer('all-MiniLM-L6-v2')
classified_domains = load_classified_domains()  # 4,790 domains
domain_embeddings = model.encode([d.text for d in classified_domains])

# For new domain, find most similar examples
new_domain_embedding = model.encode(new_domain_text)
similarities = cosine_similarity([new_domain_embedding], domain_embeddings)[0]
top_5_examples = similarities.argsort()[-5:][::-1]

# Use those examples in few-shot prompt
examples_text = "\n".join([
    f"- {classified_domains[i].domain} → {classified_domains[i].naics}"
    for i in top_5_examples
])

prompt = f"""Classify this website using these similar examples:
{examples_text}

Domain: {new_domain}
Content: {new_domain_text}
NAICS Code:"""

What Works

• Consistently high accuracy (85-92% in my tests)
• Examples are always relevant to the domain being classified
• Handles edge cases better because examples guide the model
• Embedding search is fast (sub-second with FAISS index)

What Fails

• Requires pre-classified dataset (cold start problem)
• More complex infrastructure (embeddings + vector search + LLM)
• Still slower than keyword or TF-IDF methods
• Embedding quality matters (bad embeddings = bad retrieval)

Real impact: Adding embedding-based retrieval boosted my accuracy from 81% (zero-shot) to 89% (retrieval + few-shot). The LLM stopped making category errors because the examples showed it exactly what "SaaS for healthcare" vs "healthcare services" looks like.

The Missing Layer: Business Verticals

The Problem NAICS Doesn't Solve

After classifying 4,790 companies with 89% accuracy, I discovered a fatal flaw: NAICS works great for distinguishing restaurants from law firms, but completely fails when everyone in tech gets code 541511 ("Custom Computer Programming Services").

All Classified as NAICS 541511:

• Google (Search Engine)
• Shopify (E-commerce Platform)
• Salesforce (CRM SaaS)
• Oracle (Enterprise Software)
• HubSpot (Marketing Automation)
• Atlassian (Developer Tools)

These companies are not competitors, yet NAICS sees them as identical.

The solution? Add a second classification layer: Business Verticals. These are human-readable, semantic groupings that sit between NAICS codes and individual companies.

# Two-layer classification system
classification = {
    "naics_2digit": "54",  # Professional, Scientific, Technical Services
    "naics_4digit": "5415",  # Computer Systems Design
    "naics_6digit": "541511",  # Custom Computer Programming
    "vertical": "E-commerce Platforms",  # NEW: Semantic vertical
    "confidence": "high"
}

# Now competitors are truly similar:
shopify_competitors = find_competitors(
    naics="541511",
    vertical="E-commerce Platforms"
)
# Returns: WooCommerce, BigCommerce, Wix, Squarespace
# NOT: Google, Salesforce, Oracle

How Verticals Work

• Manually curated: I defined ~40 verticals based on patterns in the classified data
• LLM-assigned: After NAICS classification, a second LLM pass assigns the vertical
• Human-readable: "Enterprise Cloud/SaaS" is clearer than "541511"
• Competitor-focused: Verticals group actual competitors, not just similar business models

The Final Architecture: Hybrid Ensemble

After testing all six methods, I didn't pick just one. I built an ensemble that combines the best of each approach, weighted by their strengths.

UX Bench Classification Pipeline

TF-IDF Candidate Filtering

Narrow 1,030 NAICS codes down to 40-80 candidates (fast, eliminates obviously wrong categories)

Zero-Shot LLM Classification (55% weight)

Primary signal, uses facebook/bart-large-mnli for semantic understanding

NAICSkit Library (25% weight)

Rule-based NAICS tool for additional signal

Keyword/Regex Matching (20% weight)

Fast baseline signal for unambiguous cases

Vertical Classification (LLM)

Second pass to assign semantic vertical for competitor matching

# Final ensemble formula
final_score = (
    0.55 * zero_shot_llm_score +
    0.25 * naicskit_score +
    0.20 * keyword_score
)

best_naics = candidates[argmax(final_score)]

# Then add vertical layer
vertical = llm_classify_vertical(
    domain_text=text,
    naics_code=best_naics
)

Results

89%

Final accuracy

1.8s

Avg classification time

4,790

Domains classified

Key Takeaways

Evaluate if your classification framework has sufficient depth. If your framework lumps competitors together (like NAICS 541511 grouping Google, Shopify, and Salesforce), you may need to augment it with semantic layers that better match your domain.
Guide LLMs, don't let them freewheel. Waterfall classification (20 → 5 → 2 choices per step) beats "pick from 1,030" approaches. Narrowing choices prevents the model from drifting between unrelated categories.
Consider local LLMs for prototyping and batch work. If you have a strong GPU, local models let you validate your approach without API costs. Once proven, evaluate whether you need more powerful API models or if your local setup meets production needs.
Step back when you hit an impasse. I spent days trying to optimize the unknown domain classification, even brainstorming with several LLMs. Nothing worked. Taking a step back, I rethought the challenge entirely and found a completely different approach (brand similarity matching). The new solution worked beautifully and became the production implementation.
Know when to stop refining. The hardest decision isn't improving your system, it's recognizing when continued iteration delivers diminishing returns. Keyword matching achieved 4.4%, zero-shot LLM reached 79%, waterfall pushed to 100%. Each improvement had clear value, but chasing perfection beyond "good enough for 90% of cases" often means sacrificing higher-impact work elsewhere.

See It In Action

The classification system and intelligent competitor discovery described in this article power UX Bench. Enter any domain to see the system in action: known domains get instant results via database lookup, unknown domains use the LLM brand similarity shortcut.

UX Bench

Benchmark your Core Web Vitals against 5 automatically-discovered competitors in your industry.

Try UX Bench

The Challenge: Automatically Finding 5 Relevant Competitors

The Requirements

Final Results

Why Domain Classification Is Harder Than It Looks

NAICS Updates Every 5 Years

Regional Variations

Not Designed for Digital Businesses

How Text Classification Evolved (Brief History)

Keyword/Regex Matching (1990s-2000s)

TF-IDF + Similarity (2000s-2010s)

Zero-Shot LLM (2020s)

Waterfall LLM Classification

Part 1: Building the Classification Database

Technology Stack

Input Data

Each Method Brought Significant Improvements to Classification Accuracy

Why the LLM Waterfall Approach

Waterfall Approach to Avoid Model Drift

Why Waterfall Works

Why Business Verticals Matter

Method 2: TF-IDF + Cosine Similarity

What Works

What Fails

Part 2: Designing Intelligent Competitor Selection

UX Bench Competitor Discovery Flow

Why This Shortcut Works

Sourcing Competitors with an Intelligent Rollup Strategy

Reflections & Lessons Learned

Method 3: Zero-Shot LLM Classification

What Works

What Fails

Method 4: Few-Shot LLM with Examples

What Works

What Fails

Method 5: Embedding Retrieval + Few-Shot LLM

What Works

What Fails

The Missing Layer: Business Verticals

All Classified as NAICS 541511:

How Verticals Work

Method 6: Fine-Tuned Model (Not Pursued)

Why It Would Work

Why I Didn't Do It

The Final Architecture: Hybrid Ensemble

UX Bench Classification Pipeline

Results

Key Takeaways

See It In Action

UX Bench

Sources & Further Reading