Model Card — Case Law Semantic Search Encoder

System / Product: CourtListener Semantic Search (Case Law)
Model Type: Encoder — Retrieval (SentenceTransformer)
Version: v1.0
Date: 2025-11-05
Owner: Rachel Gao, AI Team
Status: [x] Production [ ] Deprecated

1. Purpose

1.1 What this model does

This model generates dense vector embeddings (768 dimensions) for legal opinions and search queries. Given a text input, it produces a fixed-length vector that captures the semantic meaning of the text. These embeddings are compared using cosine similarity to find semantically related legal opinions for a given query.

The model uses instruction prefixes to distinguish between document and query inputs:

search_document: prefix for opinion text
search_query: prefix for user queries

1.2 Intended users

CourtListener's search infrastructure (embedding generation and retrieval)
External developers and researchers via the published model on HuggingFace and downloadable embeddings

1.3 Scope and limitations

In scope: U.S. case law opinions
Out of scope: Non-U.S. jurisdictions, non-case-law documents (filings, oral arguments, financial disclosures), non-English text
The model produces embeddings — it does not generate text or classify documents

2. Base Model

Field	Details
Base model name	nomic-ai/modernbert-embed-base
Version / checkpoint	ModernBERT-base
Provider	Nomic AI / Answer.AI
License	Apache 2.0
Link to production model location	HuggingFace

2.1 Model Modifications

Finetuning — Supervised finetuning using triplet loss (MultipleNegativesRankingLoss) on legal opinion chunks paired with synthetic queries. Full finetuning of all weights, 1 epoch on ~2,800 training samples. See Section 3.2.
Continued pretraining
Prompting only
Other

3. Data

3.1 Pretraining Dataset

N/A — no continued pretraining was performed.

3.2 Finetuning Dataset

Dataset: Free-Law-Project/opinions-synthetic-query-512

Source: ~1,000 case law opinions sampled from CourtListener's database, spanning 184 courts (~8% of ~2,000 courts), with the top 20 courts (including SCOTUS) covering ~50% of sampled opinions. The sample distribution was visually verified against the population distribution to ensure representativeness.

Chunking: Opinions were split into 512-token segments using the bert-base-cased tokenizer (max 480 tokens with buffer), with 2-sentence overlap for context continuity. Opinions under 50 words were excluded as unlikely to contain meaningful substance.

Format: Triplet format (anchor, positive, negative):

Anchor: Chunked opinion text (prepended with search_document:)
Positive: A synthetically generated relevant query
Negative: A synthetically generated irrelevant query that appears similar but does not match

Statistics:

Training samples: 2,828 chunks (from 315 unique opinions)
Dev samples: 489 chunks (from 93 unique opinions)
Test opinions: 362 unique opinions (held out, same test set used across all model comparisons)
Anchor token length: 33–487 tokens (mean: 407)
Query token length: 14–34 tokens (mean: ~20)

Synthetic query generation: GPT-4o generated relevant and irrelevant query pairs for each opinion chunk, following the query-focused triplet approach described in Google's research on synthetic query generation. A slightly different prompt was used for the finetuning dataset than for the evaluation dataset to further ensure robustness.

Data split integrity: Train/val/test splits ensured no overlap in opinion_id, cluster_id, docket_id, or docket_number to prevent data leakage.

Related datasets:

Free-Law-Project/opinions-metadata — source metadata
Free-Law-Project/opinions-synthetic-query-512 — finetuning triplets

3.3 Prompt Design and Versioning

N/A — this is an encoder model, not a generative model.

3.4 Validation and Test Dataset

Dev set: 489 chunks from 93 unique opinions, used during training for early stopping and hyperparameter tuning.
Test set: 362 unique opinions held out from training. The same test set was used across all model comparisons to ensure results are directly comparable.
Both sets were screened for leakage by ensuring no shared opinion_id, cluster_id, docket_id, or docket_number with the training set.

Evaluation data generation: GPT-4o-mini generated queries for the evaluation dataset (vs GPT-4o for the training set). Relevance was verified using TF-IDF cosine similarity as a first pass, then legal-BERT for edge cases, with manual review of sampled disagreements.

3.5 Label Documentation

Synthetic labels generated by GPT-4o (training) and GPT-4o-mini (evaluation). The labeling schema is binary relevance: each query is either relevant (positive) or irrelevant (negative) to the anchor opinion chunk. No human annotation was performed on the finetuning data.

4. Training & Evaluation

4.1 Design Decisions and Rationale

Model selection process: A systematic comparison was conducted across 15+ models in three categories:

Pretrained SentenceTransformers (4 models): multi-qa-mpnet-base-dot-v1 was the best performer, consistent with prior FLP experiments.
Other pretrained open-source encoders (4 models): nomic-ai/modernbert-embed-base was the best overall, with thenlper/gte-large as runner-up at 512 tokens.
ModernBERT-based models (3 models): nomic-ai/modernbert-embed-base was the clear winner in both 512 and 8192 chunk sizes.

Final model selection: nomic-ai/modernbert-embed-base was selected as the base model for finetuning because:

~5 percentage point advantage in Hit Rate and MRR over multi-qa-mpnet-base-dot-v1 without substantial latency increase
Built on ModernBERT, the latest encoder SOTA architecture
Supports 8,192-token context windows, better suited for legal opinions
Best performance on high-volume courts (e.g., nyappdiv, scotus) which make up a large portion of the corpus

Finetuning approach: Full finetuning with MultipleNegativesRankingLoss on triplet data. 9 models were finetuned for comparison, including foundation models (bert-base-cased, roberta-base, ModernBERT-base, KL3M variants) and already-finetuned models (mpnet-base, modernbert-embed-base). The FLP model (finetuned modernbert-embed-base) achieved the best performance.

Chunk size decision: 512-token chunks were chosen for the production model because:

The 512-chunk finetuned model outperformed the 8192-chunk finetuned model, likely due to substantially more training datapoints (an opinion with 2,000 tokens produces ~3 chunks at 512 vs 1 chunk at 8,192)
The 512-chunk model also performed comparably when evaluated against 8192-chunk test data
Smaller chunks enable more granular retrieval

Task formation: Both QA (question-answering) and IR (information retrieval) task formations were tested. QA generally performed slightly better, but both were viable. The finetuning data uses query-focused triplets.

Open-source only: Closed-source models (OpenAI, Voyage AI, etc.) were excluded per FLP's mission. Only models that can run locally without external API dependencies were considered.

4.2 Metrics and Evaluation

Metrics:

Hit Rate: Whether the correct opinion appears in the top-k retrieved results
MRR (Mean Reciprocal Rank): Average of the reciprocal rank of the first correct result
Cosine Triplet Accuracy: Whether the model ranks the positive query closer to the anchor than the negative query (used during finetuning)

Finetuning results (cosine triplet accuracy):

Stage	Accuracy
Before finetuning (baseline)	96.3%
After finetuning (dev set)	99.6%–99.8%

Cross-model comparison (test set, 512 chunk size, QA task):

The finetuned FLP model (modernbert-embed-base_finetune_512) achieved the best Hit Rate and MRR across all models tested, outperforming both the base nomic-ai/modernbert-embed-base and multi-qa-mpnet-base-dot-v1.

Per-stratum observations:

Performed better on opinions from recent years (attributed to ModernBERT's more recent training data)
Performed better on high-volume courts (e.g., nyappdiv, scotus)
No notable difference between opinion sources (opinion_xml_harvard vs opinion_html_with_citations)
No notable difference across opinion types or court jurisdictions

Training hyperparameters:

Parameter	Value
Epochs	1
Training time	~3 minutes (T4 GPU)
Training loss	0.669
Learning rate	2e-5
Warmup ratio	0.1
Batch size	16
Precision	FP16
Optimizer	AdamW
Batch sampler	no_duplicates

Architecture:

Parameter	Value
Embedding dimensions	768
Max sequence length	8,192 tokens
Pooling	Mean tokens
Normalization	Yes
Similarity function	Cosine

4.3 Failure Analysis

Synthetic vs real queries: The model was trained and evaluated on synthetic queries generated by GPT-4o/4o-mini. Real user queries may differ in style, specificity, or vocabulary. Once deployed, real user queries will be collected to create a more representative evaluation dataset.

5. Known Limitations

Training data size: ~2,800 training samples is relatively small. The 8,192-chunk model showed less improvement from finetuning due to fewer training examples, suggesting more data would help.
Synthetic training data: Training queries were generated by GPT-4o, not real users. The evaluation queries were generated by GPT-4o-mini. Actual user behavior may differ.
English only: Trained on English-language U.S. case law only.
No reranking: The current pipeline does not include a reranking step, which could improve retrieval precision.
Context limited to opinions: Only the opinion text is embedded. Other aspects of filings (headmatter, posture, syllabus) are not included but could provide additional context.
Development vs production data: The finetuning data was extracted from the development database. Given the rate of production updates, the dataset may not be fully representative of production.

6. Deployment and Monitoring

6.1 Deployment Setup and Data Dependencies

Embedding generation: The Inception microservice wraps the model for inference. It runs as a separate service — CPU for query encoding, GPU for bulk opinion indexing.

Opinion embeddings: Pre-computed from html_with_citations field and stored in Elasticsearch. Re-generated when html_with_citations changes.

6.2 Monitoring Plan

Error monitoring: Sentry is configured on both the Inception microservice and CourtListener to track errors, exceptions, and service health.
Retrieval quality and latency: No automated monitoring dashboard currently in place. Quality and latency are evaluated manually on a quarterly basis (see Section 6.3).

6.3 Re-evaluation Criteria and Process

Scheduled cadence: Quarterly evaluation of retrieval quality and latency. The next scheduled evaluation is July 2026.

Triggers for unscheduled review:

Significant changes to the case law corpus (e.g., major new jurisdiction or data source added)
Changes to the Inception microservice or the underlying model
User feedback indicating retrieval quality degradation

7. Version History

Version	Date	Author	Summary of changes
v1.0	2025-11-05	Rachel Gao	Initial release

8. Contacts

Role	Name	Contact
Model owner	Rachel Gao	rachel@free.law
AI team contact	Rachel Gao	rachel@free.law