Charlie Axelbaumaifinance

What LLMs Get Wrong About Financial Analysis

LLMs are genuinely useful in finance workflows. They're also wrong in ways that are subtle and hard to detect — which makes careful deployment design essential.

January 15, 20257 min read

The Problem with Confident Wrongness

LLMs are confident by default. They produce fluent, well-reasoned-sounding text whether or not the underlying reasoning is correct. In most use cases, a confident wrong answer is obvious — it fails the vibe check. In finance, a confident wrong answer can look indistinguishable from a correct one.

This isn't a reason to avoid LLMs in finance. It's a reason to design systems that account for it.

The Main Failure Modes

Numerical Reasoning

LLMs are not calculators. They can reason about numerical relationships in language ("revenue grew faster than costs, implying margin expansion") but they frequently make arithmetic errors when you ask them to compute things.

Design rule: never ask an LLM to compute. Ask it to reason. Pipe computation through a dedicated tool or code execution environment.

Temporal Confusion

Training data cutoffs create subtle errors. An LLM may confidently describe a company's current management team, strategy, or financial position based on stale data. The error is plausible enough that it may not trigger review.

Design rule: ground any factual claim in freshly retrieved data. Don't rely on the model's parametric knowledge for facts that change.

Source Conflation

LLMs are trained to be helpful. When asked about something they're uncertain about, they'll synthesize a plausible-sounding answer from related information — which in finance often means hallucinating a statistic, misattributing a data point, or conflating two similar but distinct securities.

Design rule: require citations. If the model can't point to a source for a specific claim, the claim should be flagged as unverified.

Regulatory and Legal Misinterpretation

Financial regulations are technical and jurisdiction-specific. LLMs frequently misinterpret or oversimplify regulatory requirements — sometimes in ways that could create material compliance exposure if acted on uncritically.

Design rule: legal and regulatory questions should go through validated retrieval pipelines, not raw LLM generation.

What They're Actually Good At

Despite the failure modes, there's genuine value here:

Language understanding — parsing complex prose in filings, transcripts, agreements
Summarization — producing coherent first-pass summaries of long documents
Pattern recognition — identifying when language in a document is unusual relative to boilerplate
Structured extraction — pulling specific fields from unstructured documents with schema validation
First-pass comparables — surface-level comparisons that a human then validates

The Design Principle

The common thread: LLMs are good at the parts of financial analysis that are about language and pattern. They're unreliable at the parts that require precision, recency, or domain-specific technical judgment.

Design systems that play to the strength and protect against the weakness. That means retrieval for facts, code execution for computation, human review for judgment, and LLMs for the language layer in between.