PdfDocumentParser, DocxDocumentParser, XlsxDocumentParser, CsvDocumentParser API docs
Evidence-grounded extraction in Java.
Parse a document, extract a typed Java record, and inspect the citation, confidence, and provenance attached to each field.
Install
Use the published alpha from Maven Central.
<dependency>
<groupId>ai.doctruth</groupId>
<artifactId>doctruth-java</artifactId>
<version>0.2.0-alpha</version>
</dependency> 1. Parse a source document
Parsers preserve document identity, metadata, section structure, and source locations so later extraction can cite page and line.
ParsedDocument doc = PdfDocumentParser.parse(Path.of("contract.pdf"));
String docId = doc.docId();
List<ParsedSection> sections = doc.sections();
DocumentMetadata metadata = doc.metadata(); 2. Extract a typed record
The builder options explicitly turn citation, confidence, retry, and bitemporal provenance into part of the extraction contract.
record Contract(String partyA, String partyB, BigDecimal totalValue) {}
ExtractionResult<Contract> result = DocTruth.from(new OpenAiProvider(System.getenv("OPENAI_API_KEY")))
.extract("Extract the contract terms", Contract.class)
.withProvenance()
.withSourcePublishedAt(Instant.parse("2026-01-01T00:00:00Z"))
.withBitemporal()
.withConfidence()
.withMaxRetries(2)
.run(doc); 3. Inspect evidence
Values are useful only when callers can reconstruct where they came from. Field paths map to citations, confidence rationales, and provenance.
Contract value = result.value();
Citation cite = result.citations().get("totalValue");
SourceLocation loc = cite.location();
double matchScore = cite.matchScore();
Confidence confidence = result.confidence().get("totalValue");
Provenance provenance = result.provenance(); 4. Use a context strategy
Large documents should not be blindly truncated. Prioritize the sections that matter for the extraction task.
ContextStrategy strategy = new PriorityTruncate(
List.of("Qualifications", "Scoring Criteria", "Contract Terms"),
25_000,
OverBudgetPolicy.STRICT
);
ExtractionResult<Contract> result = DocTruth.from(provider)
.extract("Extract contract terms", Contract.class)
.withContextStrategy(strategy)
.withProvenance()
.run(doc); Public surface
ParsedDocument, ParsedSection, TextSection, TableSection, FigureSection, SourceLocation DocTruth, ExtractionBuilder<T>, ExtractionResult<T> Citation, Confidence, Provenance ContextStrategy, PriorityTruncate, SlidingWindow, Hierarchical OpenAiProvider, AnthropicProvider, GeminiProvider, LlmProvider ParseException, ExtractionException, ProviderException Contract rule
Missing source evidence is a validation problem. When citation matching is weak, DocTruth surfaces a low match score instead of silently dropping the field's evidence chain.