Ad Creative Clustering: Ward Hierarchical and HDBSCAN

When a marketing team runs hundreds of creatives across campaigns, the useful question is not "which creatives are performing" but "which creative approaches are performing."

"Lifestyle imagery with testimonial copy" vs "product-only with discount callout" vs "UGC-style with problem framing."

Manual tagging at scale is not viable.

Rule-based grouping misses semantic nuance.

Here is the clustering pipeline I use to identify and track creative approaches automatically.

The four phases

Embed - convert each creative into a vector using Gemini's embedding model
Cluster - group creatives by semantic similarity (Ward or HDBSCAN, selected by silhouette score)
Name - send cluster members to Gemini to generate a descriptive approach name
Track - maintain approach identity across analysis periods using member overlap

Phase 1: Embedding

Each creative produces an embedding from its combined signals:

def build_creative_text(creative: Creative) -> str:
    parts = [
        creative.headline or "",
        creative.description or "",
        " ".join(creative.tags or []),
        creative.format or "",
        creative.placement or "",
    ]
    return " | ".join(p for p in parts if p)
 
embedding = gemini_client.embed_content(
    model="models/text-embedding-004",
    content=build_creative_text(creative),
    task_type="CLUSTERING",
)

Two creatives with different headlines but the same structural approach end up close in embedding space.

Phase 2: Algorithm selection

Neither Ward hierarchical nor HDBSCAN is universally better. The right one depends on cohort size and structure.

Ward works well for small cohorts (under ~200 creatives). It forces every creative into a cluster, produces balanced groups, and handles high-dimensional embeddings cleanly. It requires specifying cluster count, which I solve with a silhouette score sweep.

HDBSCAN works better for large, irregular cohorts. It identifies noise points (creatives that do not belong to any coherent approach), produces variable-density clusters, and does not require specifying cluster count. Trade-off: it can over-fragment small cohorts into single-creative clusters.

The selection logic:

def select_clustering_algorithm(
    embeddings: np.ndarray,
    cohort_size: int
) -> tuple[str, ClusterLabels]:
    if cohort_size < 50:
        labels = run_ward(embeddings)
        return "ward", labels
 
    ward_labels = run_ward(embeddings)
    hdbscan_labels = run_hdbscan(embeddings)
 
    ward_score = silhouette_score(embeddings, ward_labels)
    hdbscan_score = silhouette_score(
        embeddings[hdbscan_labels != -1],
        hdbscan_labels[hdbscan_labels != -1]
    )
 
    if hdbscan_score > ward_score + 0.05:
        return "hdbscan", hdbscan_labels
    return "ward", ward_labels

The 0.05 buffer prevents switching to HDBSCAN for trivial score differences. Ward is preferred when scores are comparable because its deterministic output makes lineage tracking simpler.

Phase 3: Naming

Once clusters form, each one is named by Gemini.

The prompt sends up to 8 member creatives (selected by cosine distance to the centroid) and asks for a short descriptive name.

def name_cluster(members: list[Creative]) -> str:
    sample = select_centroid_nearest(members, k=8)
    prompt = NAMING_PROMPT.format(
        creatives=format_for_prompt(sample)
    )
    response = gemini.generate(
        prompt=prompt,
        response_schema=ApproachName,
    )
    return response.name

Example outputs:

"Lifestyle product in use"
"Problem-agitation copy"
"Social proof with offer"
"Direct response with urgency"

The names label the structural approach, not the creative concept.

Phase 4: Lineage tracking

The hardest part is not the clustering. It is tracking cluster identity across periods.

If an approach exists in period N-1 and the same creatives cluster together in period N, it should carry the same ID. Performance deltas, trend signals, and reporting all depend on stable identity.

I use a 40% member overlap threshold:

def match_approach(
    new_cluster: set[str],
    previous_approaches: list[Approach],
    overlap_threshold: float = 0.40,
) -> str | None:
    for prev in previous_approaches:
        overlap = len(new_cluster & prev.member_ids) / len(new_cluster | prev.member_ids)
        if overlap >= overlap_threshold:
            return prev.approach_id
    return None

This is a Jaccard similarity check. An approach retains its ID if at least 40% of its members (by union) overlap with a previous approach.

New creatives joining an established approach do not break lineage as long as the core cluster identity holds.

When no match is found: new UUID, marked as NEW.

When a previous approach has no match: marked as INACTIVE.

Trend analysis across 4 periods

The pipeline keeps 4 rolling periods of approach snapshots.

Three delta sets:

N vs N-1  →  most recent change
N vs N-2  →  medium-term trend
N vs N-3  →  structural trend

Per-period metrics (CTR, conversion rate, spend efficiency, frequency) are computed from member creatives in each period.

An approach improving over 3 periods is a different recommendation than one that peaked and is now declining.

FDR-corrected tag discrimination

After clustering, I run a Fisher's exact test per tag to identify which creative tags are discriminating features for each approach.

Benjamini-Hochberg FDR correction controls for multiple comparisons across dozens of tags:

from statsmodels.stats.multitest import multipletests
 
pvalues = [
    fisher_exact_test(tag, cluster_members, all_creatives)
    for tag in all_tags
]
_, corrected_pvalues, _, _ = multipletests(pvalues, method="fdr_bh")
 
discriminating_tags = [
    tag for tag, p in zip(all_tags, corrected_pvalues)
    if p < 0.05
]

These discriminating tags feed into the creative brief generation pipeline, giving the LLM grounded evidence about what makes each approach distinctive.

What runs where

BigQuery:

Raw creative ingestion and deduplication
Feature extraction and creative-level metric aggregation
Approach-level performance rollups per period

Python:

Embedding generation (Gemini API)
Ward / HDBSCAN fit
Approach ID matching
Tag discrimination tests
Gemini naming calls

Results write back to BigQuery. BigQuery stays the source of truth for inputs and outputs. Python handles the compute BigQuery cannot do natively.

Stack

Embeddings: Gemini text-embedding-004 (768-dim)
Clustering: scikit-learn (Ward), hdbscan (HDBSCAN)
Algorithm selection: silhouette_score from scikit-learn
Tag stats: statsmodels (Fisher's exact + FDR)
Naming: Gemini 2.5 Pro with structured output schema
Storage: BigQuery
Orchestration: Celery

Related posts in this series:

If you are building creative analytics infrastructure and want to discuss the approach, get in touch.

Ad Creative Clustering with Embeddings, Ward Hierarchical, and HDBSCAN