DocTruth

Open-source auditable LLM extraction for Java enterprise systems. Every extracted field carries source page + line, confidence, and bi-temporal provenance.

Java
25+
Maven
0.2.0-alpha
License
Apache 2.0
Source contract fields connected to a Java record result with page, line, confidence, and provenance evidence.
DocTruth source-cited extraction overview showing source document citations, confidence, provenance, and audit-ready export.

Enterprise AI dies at one question.

When a Java backend extracts a value from a document, DocTruth keeps the source location, match quality, confidence, and provenance attached.

Audit gate comparison showing extraction output without source citations blocked and evidence-grounded output marked auditable.

Prompt for page numbers

LLMs hallucinate citations when the contract is not enforced.

String-match output back to PDFs

Brittle on tables, columns, scans, and formatting drift.

Glue callbacks onto LangChain4j

The app inherits framework coupling without evidence guarantees.

Run a Python service beside Java

Java enterprise teams reject the extra runtime topology.

Small primitives, auditable output.

Not a LangChain clone. Not Spring AI. A focused evidence layer that drops into any Java backend already calling OpenAI-compatible endpoints, Anthropic, or Gemini.

DocTruth capability flow from parsing PDF, DOCX, XLSX, and CSV to schema validation, evidence attachment, confidence scores, provenance, and audit JSON.
01

Document parsing

PDF/DOCX/XLSX/CSV into structured sections with page, line, and offset preserved.

02

Evidence-attributed extraction

LLM calls are wrapped by a citation contract; missing evidence triggers validation and retry.

03

Smart context assembly

Priority truncation, sliding windows, and hierarchical context for over-window documents.

04

Fluent Java API

DocTruth.from(provider).extract(...).withProvenance().run(doc).

05

Audit export

PROV-O JSON-LD, confidence, retry count, model version, extracted_at, source_published_at.

Install in one coordinate.

Published on Maven Central. No framework runtime, no Python service, no extra deployment topology.

Maven
<dependency>
  <groupId>ai.doctruth</groupId>
  <artifactId>doctruth-java</artifactId>
  <version>0.2.0-alpha</version>
</dependency>
Gradle
implementation "ai.doctruth:doctruth-java:0.2.0-alpha"
record Contract(String partyA, String partyB, BigDecimal totalValue) {}

var doc = PdfDocumentParser.parse(Path.of("contract.pdf"));
var result = DocTruth.from(new OpenAiProvider(System.getenv("OPENAI_API_KEY")))
    .extract("Extract the contract terms", Contract.class)
    .withProvenance()
    .withSourcePublishedAt(Instant.parse("2026-01-01T00:00:00Z"))
    .withBitemporal()
    .withConfidence()
    .run(doc);

Citation cite = result.citations().get("totalValue");
Confidence conf = result.confidence().get("totalValue");
result.toAuditJson(Path.of("audit/contract.jsonld"));

A small explicit API for reconstructable extraction.

The value is not enough. The result also carries the exact quote, source location, match score, confidence rationale, model, version, and timestamps.

  • Per-field citations: page, line, quote, match score
  • Confidence scores with rationale, not only a number
  • Bi-temporal provenance for source time vs extraction time
  • Audit JSON that downstream systems can ingest
Open API documentation

Non-silent by design.

A failed citation match is not allowed to vanish. It emits a warning and surfaces a low match score so the caller can decide how to handle risk.

Source Page 4 · Line 18 · exact quote
Match JaroWinkler score 0.97
Confidence Field score derived from citation match quality
Time source_published_at + extracted_at
Export W3C PROV-O JSON-LD

What is open source.

  • Provider clients for OpenAI-compatible endpoints, Anthropic, and Gemini
  • PDF/DOCX/XLSX/CSV parsing primitives
  • Citation, Confidence, Provenance records
  • Context strategies and audit JSON export
  • Current SPI hooks for signing, audit events, and OCR

What it refuses to become.

  • Not an agent framework
  • Not a vector-store wrapper
  • Not a Spring or LangChain4j plugin
  • Not a UI viewer
  • Not document Q&A or translation