SEO Title Generator
A single, self-contained Python module that generates AI-powered SEO titles for parts-catalog data stored in Databricks Delta Lake tables.
DeltaTableConfigDomainConfigPipelineConfigOverview
seo_title_generator_v3.py is a 2,159-line standalone Python module designed to run on Databricks. Every table name, column name, schema reference, and domain-knowledge structure is a configurable parameter — nothing is hardcoded.
The module reads upstream Delta Lake tables (product exports, acronym expansions, manual SEO gold examples), performs rule-based acronym disambiguation, retrieves similar examples via RAG (embedding similarity + LLM reranking), constructs a structured LLM prompt, and generates a concise SEO title capped at 65 characters.
A cache-first strategy checks the stored results table before invoking the LLM pipeline, ensuring efficiency for previously processed parts.
What Changed from v2
| Area | v2 (previous) | v3 (this version) |
|---|---|---|
| Table names | Hardcoded in SQL strings | Configurable via DeltaTableConfig |
| Column names | Hardcoded (Part_Number, OCC_Item_Number, etc.) | Configurable via nested column-mapping dataclasses |
| Schema / database | Read from db_config dict | Explicit DeltaTableConfig fields |
| Manufacturer sets | Hardcoded module-level constants | DomainConfig.mixer_manufacturers / .refuse_manufacturers |
| Measurement tokens | Hardcoded frozenset | DomainConfig.measurement_tokens |
| Disambiguation rules | Hardcoded class-level dict | DomainConfig.disambig_rules |
| Domain clusters | Hardcoded class-level dict | DomainConfig.domain_clusters |
| Entry point signature | generate_seo_title(pn, spark, db_config, pipe_cfg) | generate_seo_title(pn, spark, delta_cfg, domain_cfg, pipe_cfg) |
| Cache key prefix | "pass_a_v2" | "pass_a_v3" |
Configuration Hierarchy
The module is controlled by three top-level configuration dataclasses. Each has sensible defaults matching the McNeilus project, so you only override what differs in your environment.
DeltaTableConfig
| Field | Type | Default |
|---|---|---|
| project_database | str | "hive_metastore.seg_env_project" |
| data_warehouse_database | str | "hive_metastore.seg_env_com_dw" |
| cache_table | str | "mcneilus_500_ai_seo_title" |
| acronym_expansions_table | str | "mcneilus_acronym_expansions" |
| product_export_table | str | "mcneilus_product_export" |
| seo_metadata_table | str | "mcneilus_seo_metadata" |
| dim_item_table | str | "dim_item" |
| cache_columns | CacheTableColumns | Nested dataclass |
| acronym_columns | AcronymExpansionsColumns | Nested dataclass |
| product_columns | ProductExportColumns | Nested dataclass |
| dim_item_columns | DimItemColumns | Nested dataclass |
| seo_metadata_columns | SeoMetadataColumns | Nested dataclass |
DomainConfig
| Field | Type | Description |
|---|---|---|
| mixer_manufacturers | Set[str] | Valid mixer manufacturer names for bar segment |
| refuse_manufacturers | Set[str] | Valid refuse manufacturer names for bar segment |
| measurement_tokens | frozenset | 30+ tokens (MM, IN, FT, PSI, NPT, JIC, etc.) never expanded as acronyms |
| disambig_rules | Dict[str, Tuple] | 70+ hand-tuned rules mapping acronym → (expansion, reason) |
| domain_clusters | Dict[str, List] | 7 semantic clusters (hydraulic, electrical, structural, etc.) |
| obsolete_keywords | List[str] | Keywords that flag a part as obsolete |
| mixer_division_key | str | Division key for mixer products (default: "mixer") |
| refuse_division_key | str | Division key for refuse products (default: "refuse") |
PipelineConfig
| Field | Type | Default |
|---|---|---|
| llm | Any (LlamaIndex-style) | REQUIRED |
| embed_model | Any | REQUIRED |
| rerank_llm | Any | None (optional) |
| max_title_length | int | 65 |
| top_k_examples | int | 5 |
| embed_top_k | int | 20 |
| enable_cache | bool | True |
| cache_dir | str | "/dbfs/FileStore/.../cache_v8" |
| high_confidence_threshold | float | 0.85 |
| medium_confidence_threshold | float | 0.65 |
| rate_limit_delay_s | float | 0.0 |
Quick Start (Databricks)
Minimal Usage (all defaults = McNeilus project)
from seo_title_generator_v3 import (
generate_seo_title, DeltaTableConfig, DomainConfig, PipelineConfig,
)
# Only schemas and LLM are required — everything else has defaults
delta_cfg = DeltaTableConfig(
project_database="hive_metastore.mcn_prod_project",
data_warehouse_database="hive_metastore.mcn_prod_com_dw",
)
pipeline_cfg = PipelineConfig(
llm=llm_4o,
...Full Customization (different project)
from seo_title_generator_v3 import (
generate_seo_title,
DeltaTableConfig, CacheTableColumns, AcronymExpansionsColumns,
ProductExportColumns, DimItemColumns, SeoMetadataColumns,
DomainConfig, PipelineConfig,
)
# Point at completely different tables and columns
delta_cfg = DeltaTableConfig(
project_database="catalog.my_schema",
data_warehouse_database="catalog.my_dw",
cache_table="my_seo_cache",
...Batch Processing
part_numbers = ["1234567", "2345678", "3456789"]
results = []
for pn in part_numbers:
r = generate_seo_title(pn, spark, delta_cfg, pipeline_cfg=pipeline_cfg)
results.append(r)
import pandas as pd
df = pd.DataFrame(results)
print(df[["Part_Number", "AI_SEO_Title", "source", "llm_confidence"]])Module Architecture
The pipeline flows through 10 stages, from the synchronous entry point through cache lookup, context assembly, RAG retrieval, LLM extraction, and post-processing:
generate_seo_title() ← sync entry point
└── _generate_seo_title_async() ← async core
├── DataLoader.lookup_cached_title() [cache check]
├── DataLoader.assemble_part_context() [upstream reads]
├── DataLoader.build_parts_df_for_manual_seo()
├── DataLoader.load_manual_seo_with_context()
├── SEOTitlePipelineV3.load_examples()
│ └── RAGExampleSelector.build_index()
├── SEOTitlePipelineV3.process_single()
│ ├── AcronymDisambiguator.disambiguate() [rule-based]
│ ├── RAGExampleSelector.get_fewshot_examples()
│ ├── LLMClient.extract_fields() [Pass A]
│ └── DimensionNormalizer.normalize_title()
├── post_process_title() [manufacturer bar]
└── shorten_ai_seo_title() [≤65 chars]Return Value Schema
| Key | Type | Description |
|---|---|---|
| Part_Number | str | Input part number |
| AI_SEO_Title | str | Final shortened SEO title |
| source | str | "cache_table" or "generated" or "error" |
| cache_hit | bool | Whether result came from cache |
| llm_confidence | float | LLM self-reported confidence (0–1) |
| confidence_level | str | "high", "medium", or "low" |
| AI_SEO_Title_raw | str | Title before post-processing (generated only) |
| AI_SEO_Title_postprocessed | str | Title after manufacturer bar logic (generated only) |
| processing_time_ms | float | Pipeline processing time (generated only) |
| extracted_fields | dict | Full LLM extraction output (generated only) |
| retrieved_examples | list | RAG examples used (generated only) |
| error | str | Error message (error only) |
Dependencies
PySparkDatabricks runtimeLlamaIndex LLM.complete(prompt).textLlamaIndex Embeddings.get_text_embedding()pandasDataFrame operationsnumpyEmbedding mathnest_asyncioSync wrapper for Databricksseo_title_generator_v3.py — 2,159 lines — Fully parameterized standalone module for Databricks Delta Lake