v3.0Fully Parameterized

SEO Title Generator

A single, self-contained Python module that generates AI-powered SEO titles for parts-catalog data stored in Databricks Delta Lake tables.

DeltaTableConfigDomainConfigPipelineConfig

Overview

seo_title_generator_v3.py is a 2,159-line standalone Python module designed to run on Databricks. Every table name, column name, schema reference, and domain-knowledge structure is a configurable parameter — nothing is hardcoded.

The module reads upstream Delta Lake tables (product exports, acronym expansions, manual SEO gold examples), performs rule-based acronym disambiguation, retrieves similar examples via RAG (embedding similarity + LLM reranking), constructs a structured LLM prompt, and generates a concise SEO title capped at 65 characters.

A cache-first strategy checks the stored results table before invoking the LLM pipeline, ensuring efficiency for previously processed parts.

What Changed from v2

Areav2 (previous)v3 (this version)
Table namesHardcoded in SQL stringsConfigurable via DeltaTableConfig
Column namesHardcoded (Part_Number, OCC_Item_Number, etc.)Configurable via nested column-mapping dataclasses
Schema / databaseRead from db_config dictExplicit DeltaTableConfig fields
Manufacturer setsHardcoded module-level constantsDomainConfig.mixer_manufacturers / .refuse_manufacturers
Measurement tokensHardcoded frozensetDomainConfig.measurement_tokens
Disambiguation rulesHardcoded class-level dictDomainConfig.disambig_rules
Domain clustersHardcoded class-level dictDomainConfig.domain_clusters
Entry point signaturegenerate_seo_title(pn, spark, db_config, pipe_cfg)generate_seo_title(pn, spark, delta_cfg, domain_cfg, pipe_cfg)
Cache key prefix"pass_a_v2""pass_a_v3"

Configuration Hierarchy

The module is controlled by three top-level configuration dataclasses. Each has sensible defaults matching the McNeilus project, so you only override what differs in your environment.

DeltaTableConfig

FieldTypeDefault
project_databasestr"hive_metastore.seg_env_project"
data_warehouse_databasestr"hive_metastore.seg_env_com_dw"
cache_tablestr"mcneilus_500_ai_seo_title"
acronym_expansions_tablestr"mcneilus_acronym_expansions"
product_export_tablestr"mcneilus_product_export"
seo_metadata_tablestr"mcneilus_seo_metadata"
dim_item_tablestr"dim_item"
cache_columnsCacheTableColumnsNested dataclass
acronym_columnsAcronymExpansionsColumnsNested dataclass
product_columnsProductExportColumnsNested dataclass
dim_item_columnsDimItemColumnsNested dataclass
seo_metadata_columnsSeoMetadataColumnsNested dataclass

DomainConfig

FieldTypeDescription
mixer_manufacturersSet[str]Valid mixer manufacturer names for bar segment
refuse_manufacturersSet[str]Valid refuse manufacturer names for bar segment
measurement_tokensfrozenset30+ tokens (MM, IN, FT, PSI, NPT, JIC, etc.) never expanded as acronyms
disambig_rulesDict[str, Tuple]70+ hand-tuned rules mapping acronym → (expansion, reason)
domain_clustersDict[str, List]7 semantic clusters (hydraulic, electrical, structural, etc.)
obsolete_keywordsList[str]Keywords that flag a part as obsolete
mixer_division_keystrDivision key for mixer products (default: "mixer")
refuse_division_keystrDivision key for refuse products (default: "refuse")

PipelineConfig

FieldTypeDefault
llmAny (LlamaIndex-style)REQUIRED
embed_modelAnyREQUIRED
rerank_llmAnyNone (optional)
max_title_lengthint65
top_k_examplesint5
embed_top_kint20
enable_cacheboolTrue
cache_dirstr"/dbfs/FileStore/.../cache_v8"
high_confidence_thresholdfloat0.85
medium_confidence_thresholdfloat0.65
rate_limit_delay_sfloat0.0

Quick Start (Databricks)

Minimal Usage (all defaults = McNeilus project)

databricks_notebook.pypython
from seo_title_generator_v3 import (
    generate_seo_title, DeltaTableConfig, DomainConfig, PipelineConfig,
)

# Only schemas and LLM are required — everything else has defaults
delta_cfg = DeltaTableConfig(
    project_database="hive_metastore.mcn_prod_project",
    data_warehouse_database="hive_metastore.mcn_prod_com_dw",
)

pipeline_cfg = PipelineConfig(
    llm=llm_4o,
...

Full Customization (different project)

custom_project.pypython
from seo_title_generator_v3 import (
    generate_seo_title,
    DeltaTableConfig, CacheTableColumns, AcronymExpansionsColumns,
    ProductExportColumns, DimItemColumns, SeoMetadataColumns,
    DomainConfig, PipelineConfig,
)

# Point at completely different tables and columns
delta_cfg = DeltaTableConfig(
    project_database="catalog.my_schema",
    data_warehouse_database="catalog.my_dw",
    cache_table="my_seo_cache",
...

Batch Processing

batch_example.pypython
part_numbers = ["1234567", "2345678", "3456789"]
results = []
for pn in part_numbers:
    r = generate_seo_title(pn, spark, delta_cfg, pipeline_cfg=pipeline_cfg)
    results.append(r)

import pandas as pd
df = pd.DataFrame(results)
print(df[["Part_Number", "AI_SEO_Title", "source", "llm_confidence"]])

Module Architecture

The pipeline flows through 10 stages, from the synchronous entry point through cache lookup, context assembly, RAG retrieval, LLM extraction, and post-processing:

generate_seo_title()sync entry point
Cache LookupDataLoader.lookup_cached_title()
Context AssemblyDataLoader.assemble_part_context()
RAG IndexRAGExampleSelector.build_index()
DisambiguateAcronymDisambiguator.disambiguate()
Retrieve ExamplesRAGExampleSelector.get_fewshot_examples()
Pass A (LLM)LLMClient.extract_fields()
Dim. NormalizeDimensionNormalizer.normalize_title()
Post-Processpost_process_title()
Shortenshorten_ai_seo_title() ≤65 chars
Call hierarchytext
generate_seo_title()                     ← sync entry point
  └── _generate_seo_title_async()        ← async core
        ├── DataLoader.lookup_cached_title()     [cache check]
        ├── DataLoader.assemble_part_context()   [upstream reads]
        ├── DataLoader.build_parts_df_for_manual_seo()
        ├── DataLoader.load_manual_seo_with_context()
        ├── SEOTitlePipelineV3.load_examples()
        │     └── RAGExampleSelector.build_index()
        ├── SEOTitlePipelineV3.process_single()
        │     ├── AcronymDisambiguator.disambiguate()  [rule-based]
        │     ├── RAGExampleSelector.get_fewshot_examples()
        │     ├── LLMClient.extract_fields()   [Pass A]
        │     └── DimensionNormalizer.normalize_title()
        ├── post_process_title()          [manufacturer bar]
        └── shorten_ai_seo_title()        [≤65 chars]

Return Value Schema

KeyTypeDescription
Part_NumberstrInput part number
AI_SEO_TitlestrFinal shortened SEO title
sourcestr"cache_table" or "generated" or "error"
cache_hitboolWhether result came from cache
llm_confidencefloatLLM self-reported confidence (0–1)
confidence_levelstr"high", "medium", or "low"
AI_SEO_Title_rawstrTitle before post-processing (generated only)
AI_SEO_Title_postprocessedstrTitle after manufacturer bar logic (generated only)
processing_time_msfloatPipeline processing time (generated only)
extracted_fieldsdictFull LLM extraction output (generated only)
retrieved_exampleslistRAG examples used (generated only)
errorstrError message (error only)

Dependencies

PySparkDatabricks runtime
LlamaIndex LLM.complete(prompt).text
LlamaIndex Embeddings.get_text_embedding()
pandasDataFrame operations
numpyEmbedding math
nest_asyncioSync wrapper for Databricks

seo_title_generator_v3.py — 2,159 lines — Fully parameterized standalone module for Databricks Delta Lake