Methodology | Juryvine | Litigation Intelligence

Data Collection: Three Authoritative Sources

1. PACER (Public Access to Court Electronic Records)

We continuously poll PACER RSS feeds for all 94 federal district courts and all appellate circuits. Each feed is checked every 30 minutes, capturing new filings, docket entries, and activity from the moment they are docketed. PACER data includes case captions, docket numbers, filing dates, judge assignments, document titles, and filing parties. This is the authoritative source for all federal case metadata.

2. CourtListener API

We query CourtListener's API hourly to retrieve full court opinions, detailed docket information, and historical case data. CourtListener aggregates opinions from federal courts, state supreme courts, and thousands of state appellate courts. For Juryvine, we focus on federal opinions and use CourtListener's docket snapshots to backfill case history and enrich metadata beyond what PACER provides directly. This source is essential for appellate records and judicial decision text.

3. GDELT Project (Global Database of Events, Language and Tone)

We poll GDELT every 15 minutes for news articles mentioning federal litigation, specific judges, legal parties, and litigation-related keywords. GDELT sources news from over 100,000 online news sources worldwide. This provides media coverage and public discussion of cases, which we integrate into case narratives and use to identify significant litigation. All articles are sourced and date-stamped.

Data Processing Pipeline

Ingestion & Normalization

Raw data from PACER, CourtListener, and GDELT arrives in different formats and schedules. We normalize all incoming data into a common schema: case metadata (court, docket number, caption, dates), docket entries (event type, description, filing party), judge assignments, and media references. Each piece of data is tagged with its source and ingestion timestamp.

Case Identity Resolution

Federal litigation often spans multiple courts. A district court case may be appealed to the circuit, then to the Supreme Court. Different sources (PACER, CourtListener) may assign different internal IDs. Our deduplication system identifies the same case across all sources using three strategies: exact docket number matching within the same court, external ID matching (e.g., PACER case number = CourtListener docket ID), and fuzzy title similarity. Once deduplicated, a single case record unifies filings from district and appellate courts.

Case Creation & Updates

For each newly identified case, we create a record in our database. As PACER RSS feeds report new filings, we append TimelineEvent records (each filing becomes a timestamped event). If the case is appealed and a new docket opens in the appellate court, our deduplication system merges it with the existing case record. This ensures a user can see the full litigation history across courts in a single case view.

Batch Processing (Every 15 Minutes)

After data ingestion, a batch process runs every 15 minutes to trigger case updates, generate AI content (summaries, timeline narratives), update judicial analytics, and refresh clustering results. This ensures Juryvine is never stale — new cases typically appear in search results within 30-60 minutes of filing.

AI Methodology: What AI Does and Doesn't Do

What AI Does: Readability & Synthesis

Juryvine uses Claude (Anthropic) and GPT models for three purposes:

• Case Summaries: Given a case caption and docket entries, generate a plain-English summary explaining the legal dispute, key parties, and current status. The summary is derived from court filings, not invented.
• Timeline Narration: Convert a list of legal docket entries (e.g., "Motion for Preliminary Injunction filed", "Order denying Motion for Preliminary Injunction") into readable English sentences explaining what happened and why it matters. The events are real; the narration is AI-generated prose.
• Article Synthesis: Combine multiple news articles from GDELT about the same case into a cohesive original analysis. The articles are all sourced; the synthesis is AI-generated.

What AI Does NOT Do: No Fact Invention

Our AI systems are explicitly constrained to avoid fabrication:

• AI cannot and does not generate docket numbers, filing dates, judge names, or court assignments. All factual case metadata comes directly from PACER or CourtListener.
• AI cannot invent case outcomes or verdict amounts. All case disposition data is sourced from court records.
• AI cannot hallucinate press coverage. All media citations are real articles from GDELT with source attribution.
• AI is used for readability, explanation, and synthesis — not for fact creation. If AI generates incorrect prose about a case, that's an error we work to fix; if AI invents a fact, that's a fundamental violation of our methodology.

Judicial Analytics: Computed from Actual Outcomes

For each judge in our database, we compute the following metrics directly from case outcomes:

• Total Cases: Count of all cases in which this judge is assigned, drawn from PACER records.
• Average Time to Ruling: Median number of days from case filing to disposition, computed from docket dates.
• Reversal Rate: For appellate judges, percentage of their decisions that are overturned on further appeal, computed from case outcomes.
• Plaintiff Win Rate: In cases with a clear plaintiff/defendant designation, percentage of cases decided in favor of the plaintiff, computed from dispositions.
• Case Type Distribution: Breakdown of the judge's docket by case type (IP, securities fraud, employment, etc.), drawn from case type classifications.

These are not estimates, crowd-sourced opinions, or machine learning inferences. They are computed directly from data we have verified from court records. They can change as new cases are resolved or when we discover errors in our underlying data.

Data Freshness & Latency

• PACER RSS: Polled every 30 minutes from all federal courts.
• CourtListener: Polled hourly for new opinions and docket updates.
• GDELT: Polled every 15 minutes for media coverage.
• Batch Processing: Case updates and AI content generation run every 15 minutes as a catch-all.

Practically, new cases typically appear in Juryvine search results 30-60 minutes after being docketed in PACER. Media coverage is refreshed 15-30 minutes after publication. Judicial analytics are updated whenever new case outcomes are recorded.

Quality Controls & Editorial Review

• Multi-Strategy Deduplication: Cases are matched across sources using docket number, external ID, and fuzzy title matching. Manual review flags are generated for edge cases.
• AI Content Review: AI-generated summaries, timelines, and articles undergo editorial review before publication to flag errors, contradictions, or unclear passages.
• Source Attribution: Every article, case event, and judicial metric is tagged with its source. Users can trace any fact back to the original court record or news article.
• Error Correction: We maintain a feedback channel for users to report errors. Corrections are made and logged.

Coverage & Scale

Current coverage:

• All 94 federal district courts
• All 13 federal circuit courts of appeals
• The U.S. Supreme Court
• Approximately 946 new federal cases filed daily (typical average), all tracked
• 29+ federal court locations currently indexed with PACER feeds

Limitations & Transparency

We acknowledge the following limitations:

• AI errors: AI-generated summaries and timelines can contain errors. We encourage users to verify critical details by consulting the primary court records (PACER filings, opinions).
• Incomplete case histories: Historical cases may not have complete timeline coverage if older events are not available in PACER or CourtListener.
• Media bias: Our press coverage is sourced from news articles, which reflect editorial choices and media bias. We do not attempt to filter or correct for bias.
• State court cases: Juryvine focuses exclusively on federal courts. State litigation is out of scope.

Our commitment is to transparency. When you use Juryvine, you are working with a system that synthesizes data from verified sources and is explicit about where AI is used, where data comes from, and what limitations exist.