When a marketing team runs hundreds of creatives across campaigns, the useful question is not "which creatives are performing" but "which creative approaches are performing."
"Lifestyle imagery with testimonial copy" vs "product-only with discount callout" vs "UGC-style with problem framing."
Manual tagging at scale is not viable.
Rule-based grouping misses semantic nuance.
Here is the clustering pipeline I use to identify and track creative approaches automatically.
The four phases
- Embed - convert each creative into a vector using Gemini's embedding model
- Cluster - group creatives by semantic similarity (Ward or HDBSCAN, selected by silhouette score)
- Name - send cluster members to Gemini to generate a descriptive approach name
- Track - maintain approach identity across analysis periods using member overlap
Phase 1: Embedding
Each creative produces an embedding from its combined signals:
def build_creative_text(creative: Creative) -> str:
parts = [
creative.headline or "",
creative.description or "",
" ".join(creative.tags or []),
creative.format or "",
creative.placement or "",
]
return " | ".join(p for p in parts if p)
embedding = gemini_client.embed_content(
model="models/text-embedding-004",
content=build_creative_text(creative),
task_type="CLUSTERING",
)Two creatives with different headlines but the same structural approach end up close in embedding space.
Phase 2: Algorithm selection
Neither Ward hierarchical nor HDBSCAN is universally better. The right one depends on cohort size and structure.
Ward works well for small cohorts (under ~200 creatives). It forces every creative into a cluster, produces balanced groups, and handles high-dimensional embeddings cleanly. It requires specifying cluster count, which I solve with a silhouette score sweep.
HDBSCAN works better for large, irregular cohorts. It identifies noise points (creatives that do not belong to any coherent approach), produces variable-density clusters, and does not require specifying cluster count. Trade-off: it can over-fragment small cohorts into single-creative clusters.
The selection logic:
def select_clustering_algorithm(
embeddings: np.ndarray,
cohort_size: int
) -> tuple[str, ClusterLabels]:
if cohort_size < 50:
labels = run_ward(embeddings)
return "ward", labels
ward_labels = run_ward(embeddings)
hdbscan_labels = run_hdbscan(embeddings)
ward_score = silhouette_score(embeddings, ward_labels)
hdbscan_score = silhouette_score(
embeddings[hdbscan_labels != -1],
hdbscan_labels[hdbscan_labels != -1]
)
if hdbscan_score > ward_score + 0.05:
return "hdbscan", hdbscan_labels
return "ward", ward_labelsThe 0.05 buffer prevents switching to HDBSCAN for trivial score differences. Ward is preferred when scores are comparable because its deterministic output makes lineage tracking simpler.
Phase 3: Naming
Once clusters form, each one is named by Gemini.
The prompt sends up to 8 member creatives (selected by cosine distance to the centroid) and asks for a short descriptive name.
def name_cluster(members: list[Creative]) -> str:
sample = select_centroid_nearest(members, k=8)
prompt = NAMING_PROMPT.format(
creatives=format_for_prompt(sample)
)
response = gemini.generate(
prompt=prompt,
response_schema=ApproachName,
)
return response.nameExample outputs:
"Lifestyle product in use"
"Problem-agitation copy"
"Social proof with offer"
"Direct response with urgency"The names label the structural approach, not the creative concept.
Phase 4: Lineage tracking
The hardest part is not the clustering. It is tracking cluster identity across periods.
If an approach exists in period N-1 and the same creatives cluster together in period N, it should carry the same ID. Performance deltas, trend signals, and reporting all depend on stable identity.
I use a 40% member overlap threshold:
def match_approach(
new_cluster: set[str],
previous_approaches: list[Approach],
overlap_threshold: float = 0.40,
) -> str | None:
for prev in previous_approaches:
overlap = len(new_cluster & prev.member_ids) / len(new_cluster | prev.member_ids)
if overlap >= overlap_threshold:
return prev.approach_id
return NoneThis is a Jaccard similarity check. An approach retains its ID if at least 40% of its members (by union) overlap with a previous approach.
New creatives joining an established approach do not break lineage as long as the core cluster identity holds.
When no match is found: new UUID, marked as NEW.
When a previous approach has no match: marked as INACTIVE.
Trend analysis across 4 periods
The pipeline keeps 4 rolling periods of approach snapshots.
Three delta sets:
N vs N-1 → most recent change
N vs N-2 → medium-term trend
N vs N-3 → structural trendPer-period metrics (CTR, conversion rate, spend efficiency, frequency) are computed from member creatives in each period.
An approach improving over 3 periods is a different recommendation than one that peaked and is now declining.
FDR-corrected tag discrimination
After clustering, I run a Fisher's exact test per tag to identify which creative tags are discriminating features for each approach.
Benjamini-Hochberg FDR correction controls for multiple comparisons across dozens of tags:
from statsmodels.stats.multitest import multipletests
pvalues = [
fisher_exact_test(tag, cluster_members, all_creatives)
for tag in all_tags
]
_, corrected_pvalues, _, _ = multipletests(pvalues, method="fdr_bh")
discriminating_tags = [
tag for tag, p in zip(all_tags, corrected_pvalues)
if p < 0.05
]These discriminating tags feed into the creative brief generation pipeline, giving the LLM grounded evidence about what makes each approach distinctive.
What runs where
BigQuery:
- Raw creative ingestion and deduplication
- Feature extraction and creative-level metric aggregation
- Approach-level performance rollups per period
Python:
- Embedding generation (Gemini API)
- Ward / HDBSCAN fit
- Approach ID matching
- Tag discrimination tests
- Gemini naming calls
Results write back to BigQuery. BigQuery stays the source of truth for inputs and outputs. Python handles the compute BigQuery cannot do natively.
Stack
- Embeddings: Gemini
text-embedding-004(768-dim) - Clustering: scikit-learn (Ward), hdbscan (HDBSCAN)
- Algorithm selection: silhouette_score from scikit-learn
- Tag stats: statsmodels (Fisher's exact + FDR)
- Naming: Gemini 2.5 Pro with structured output schema
- Storage: BigQuery
- Orchestration: Celery
Related posts in this series:
- The 6-Phase Pipeline for Generating Creative Briefs
- How I Split a Marketing AI into 6 Parallel Agents
- Why I Don't Let the LLM Decide Issue State
- Case study: Creative & Campaign Intelligence Data Platform
If you are building creative analytics infrastructure and want to discuss the approach, get in touch.