FLP Wiki

Semantic Search

Model Card — Case Law Semantic Search Encoder

System / Product: CourtListener Semantic Search (Case Law)
Model Type: Encoder — Retrieval (SentenceTransformer)
Version: v1.0
Date: 2025-11-05
Owner: Rachel Gao, AI Team
Status: [x] Production   [ ] Deprecated


1. Purpose

1.1 What this model does

This model generates dense vector embeddings (768 dimensions) for legal opinions and search queries. Given a text input, it produces a fixed-length vector that captures the semantic meaning of the text. These embeddings are compared using cosine similarity to find semantically related legal opinions for a given query.

The model uses instruction prefixes to distinguish between document and query inputs:

  • search_document: prefix for opinion text
  • search_query: prefix for user queries

1.2 Intended users

  • CourtListener's search infrastructure (embedding generation and retrieval)
  • External developers and researchers via the published model on HuggingFace and downloadable embeddings

1.3 Scope and limitations

  • In scope: U.S. case law opinions
  • Out of scope: Non-U.S. jurisdictions, non-case-law documents (filings, oral arguments, financial disclosures), non-English text
  • The model produces embeddings — it does not generate text or classify documents

2. Base Model

Field Details
Base model name nomic-ai/modernbert-embed-base
Version / checkpoint ModernBERT-base
Provider Nomic AI / Answer.AI
License Apache 2.0
Link to production model location HuggingFace

2.1 Model Modifications

  • Finetuning — Supervised finetuning using triplet loss (MultipleNegativesRankingLoss) on legal opinion chunks paired with synthetic queries. Full finetuning of all weights, 1 epoch on ~2,800 training samples. See Section 3.2.
  • Continued pretraining
  • Prompting only
  • Other

3. Data

3.1 Pretraining Dataset

N/A — no continued pretraining was performed.

3.2 Finetuning Dataset

Dataset: Free-Law-Project/opinions-synthetic-query-512

Source: ~1,000 case law opinions sampled from CourtListener's database, spanning 184 courts (~8% of ~2,000 courts), with the top 20 courts (including SCOTUS) covering ~50% of sampled opinions. The sample distribution was visually verified against the population distribution to ensure representativeness.

Chunking: Opinions were split into 512-token segments using the bert-base-cased tokenizer (max 480 tokens with buffer), with 2-sentence overlap for context continuity. Opinions under 50 words were excluded as unlikely to contain meaningful substance.

Format: Triplet format (anchor, positive, negative):

  • Anchor: Chunked opinion text (prepended with search_document:)
  • Positive: A synthetically generated relevant query
  • Negative: A synthetically generated irrelevant query that appears similar but does not match

Statistics:

  • Training samples: 2,828 chunks (from 315 unique opinions)
  • Dev samples: 489 chunks (from 93 unique opinions)
  • Test opinions: 362 unique opinions (held out, same test set used across all model comparisons)
  • Anchor token length: 33–487 tokens (mean: 407)
  • Query token length: 14–34 tokens (mean: ~20)

Synthetic query generation: GPT-4o generated relevant and irrelevant query pairs for each opinion chunk, following the query-focused triplet approach described in Google's research on synthetic query generation. A slightly different prompt was used for the finetuning dataset than for the evaluation dataset to further ensure robustness.

Data split integrity: Train/val/test splits ensured no overlap in opinion_id, cluster_id, docket_id, or docket_number to prevent data leakage.

Related datasets:

3.3 Prompt Design and Versioning

N/A — this is an encoder model, not a generative model.

3.4 Validation and Test Dataset

  • Dev set: 489 chunks from 93 unique opinions, used during training for early stopping and hyperparameter tuning.
  • Test set: 362 unique opinions held out from training. The same test set was used across all model comparisons to ensure results are directly comparable.
  • Both sets were screened for leakage by ensuring no shared opinion_id, cluster_id, docket_id, or docket_number with the training set.

Evaluation data generation: GPT-4o-mini generated queries for the evaluation dataset (vs GPT-4o for the training set). Relevance was verified using TF-IDF cosine similarity as a first pass, then legal-BERT for edge cases, with manual review of sampled disagreements.

3.5 Label Documentation

Synthetic labels generated by GPT-4o (training) and GPT-4o-mini (evaluation). The labeling schema is binary relevance: each query is either relevant (positive) or irrelevant (negative) to the anchor opinion chunk. No human annotation was performed on the finetuning data.


4. Training & Evaluation

4.1 Design Decisions and Rationale

Model selection process: A systematic comparison was conducted across 15+ models in three categories:

  1. Pretrained SentenceTransformers (4 models): multi-qa-mpnet-base-dot-v1 was the best performer, consistent with prior FLP experiments.
  2. Other pretrained open-source encoders (4 models): nomic-ai/modernbert-embed-base was the best overall, with thenlper/gte-large as runner-up at 512 tokens.
  3. ModernBERT-based models (3 models): nomic-ai/modernbert-embed-base was the clear winner in both 512 and 8192 chunk sizes.

Final model selection: nomic-ai/modernbert-embed-base was selected as the base model for finetuning because:

  • ~5 percentage point advantage in Hit Rate and MRR over multi-qa-mpnet-base-dot-v1 without substantial latency increase
  • Built on ModernBERT, the latest encoder SOTA architecture
  • Supports 8,192-token context windows, better suited for legal opinions
  • Best performance on high-volume courts (e.g., nyappdiv, scotus) which make up a large portion of the corpus

Finetuning approach: Full finetuning with MultipleNegativesRankingLoss on triplet data. 9 models were finetuned for comparison, including foundation models (bert-base-cased, roberta-base, ModernBERT-base, KL3M variants) and already-finetuned models (mpnet-base, modernbert-embed-base). The FLP model (finetuned modernbert-embed-base) achieved the best performance.

Chunk size decision: 512-token chunks were chosen for the production model because:

  • The 512-chunk finetuned model outperformed the 8192-chunk finetuned model, likely due to substantially more training datapoints (an opinion with 2,000 tokens produces ~3 chunks at 512 vs 1 chunk at 8,192)
  • The 512-chunk model also performed comparably when evaluated against 8192-chunk test data
  • Smaller chunks enable more granular retrieval

Task formation: Both QA (question-answering) and IR (information retrieval) task formations were tested. QA generally performed slightly better, but both were viable. The finetuning data uses query-focused triplets.

Open-source only: Closed-source models (OpenAI, Voyage AI, etc.) were excluded per FLP's mission. Only models that can run locally without external API dependencies were considered.

4.2 Metrics and Evaluation

Metrics:

  • Hit Rate: Whether the correct opinion appears in the top-k retrieved results
  • MRR (Mean Reciprocal Rank): Average of the reciprocal rank of the first correct result
  • Cosine Triplet Accuracy: Whether the model ranks the positive query closer to the anchor than the negative query (used during finetuning)

Finetuning results (cosine triplet accuracy):

Stage Accuracy
Before finetuning (baseline) 96.3%
After finetuning (dev set) 99.6%–99.8%

Cross-model comparison (test set, 512 chunk size, QA task):

The finetuned FLP model (modernbert-embed-base_finetune_512) achieved the best Hit Rate and MRR across all models tested, outperforming both the base nomic-ai/modernbert-embed-base and multi-qa-mpnet-base-dot-v1.

Per-stratum observations:

  • Performed better on opinions from recent years (attributed to ModernBERT's more recent training data)
  • Performed better on high-volume courts (e.g., nyappdiv, scotus)
  • No notable difference between opinion sources (opinion_xml_harvard vs opinion_html_with_citations)
  • No notable difference across opinion types or court jurisdictions

Training hyperparameters:

Parameter Value
Epochs 1
Training time ~3 minutes (T4 GPU)
Training loss 0.669
Learning rate 2e-5
Warmup ratio 0.1
Batch size 16
Precision FP16
Optimizer AdamW
Batch sampler no_duplicates

Architecture:

Parameter Value
Embedding dimensions 768
Max sequence length 8,192 tokens
Pooling Mean tokens
Normalization Yes
Similarity function Cosine

4.3 Failure Analysis

  • Synthetic vs real queries: The model was trained and evaluated on synthetic queries generated by GPT-4o/4o-mini. Real user queries may differ in style, specificity, or vocabulary. Once deployed, real user queries will be collected to create a more representative evaluation dataset.

5. Known Limitations

  • Training data size: ~2,800 training samples is relatively small. The 8,192-chunk model showed less improvement from finetuning due to fewer training examples, suggesting more data would help.
  • Synthetic training data: Training queries were generated by GPT-4o, not real users. The evaluation queries were generated by GPT-4o-mini. Actual user behavior may differ.
  • English only: Trained on English-language U.S. case law only.
  • No reranking: The current pipeline does not include a reranking step, which could improve retrieval precision.
  • Context limited to opinions: Only the opinion text is embedded. Other aspects of filings (headmatter, posture, syllabus) are not included but could provide additional context.
  • Development vs production data: The finetuning data was extracted from the development database. Given the rate of production updates, the dataset may not be fully representative of production.

6. Deployment and Monitoring

6.1 Deployment Setup and Data Dependencies

Embedding generation: The Inception microservice wraps the model for inference. It runs as a separate service — CPU for query encoding, GPU for bulk opinion indexing.

Opinion embeddings: Pre-computed from html_with_citations field and stored in Elasticsearch. Re-generated when html_with_citations changes.

6.2 Monitoring Plan

  • Error monitoring: Sentry is configured on both the Inception microservice and CourtListener to track errors, exceptions, and service health.
  • Retrieval quality and latency: No automated monitoring dashboard currently in place. Quality and latency are evaluated manually on a quarterly basis (see Section 6.3).

6.3 Re-evaluation Criteria and Process

Scheduled cadence: Quarterly evaluation of retrieval quality and latency. The next scheduled evaluation is July 2026.

Triggers for unscheduled review:

  • Significant changes to the case law corpus (e.g., major new jurisdiction or data source added)
  • Changes to the Inception microservice or the underlying model
  • User feedback indicating retrieval quality degradation

7. Version History

Version Date Author Summary of changes
v1.0 2025-11-05 Rachel Gao Initial release

8. Contacts

Role Name Contact
Model owner Rachel Gao rachel@free.law
AI team contact Rachel Gao rachel@free.law
26 views Last updated 22 hours, 51 minutes ago
Creator: rachel