The Art of Not Being Fooled by Your Own AI

"The first principle is that you must not fool yourself — and you are the easiest person to fool."
— Richard Feynman

He said that about science. But it applies rather beautifully to Large Language Models, which have elevated self-deception to an industrial process. They don't mean to lie. They don't even know what lying is. They just produce sequences of tokens that look spectacularly plausible — and sometimes happen to be complete nonsense.

The question is: what do you do about it?


1. The Problem, Stated Plainly

Here is a sentence a Large Language Model might produce about an industrial motor:

"The hydraulic motor HM-2847 has 18,500 operating hours. The recommended maintenance interval is 25,000 hours. Next maintenance is due at 15,000 hours."

Read it again. If you didn't flinch, congratulations — you think like an LLM. The next maintenance is scheduled in the past. The maintenance interval exceeds the motor's physical lifespan. The whole thing is beautifully written and completely wrong.

This is the hallucination problem, and it is not a bug. It's a feature. LLMs are trained to optimize a loss function over token prediction:

\[\mathcal{L}(\theta) = -\sum_{t=1}^{T} \log P(x_t \mid x_{<t}; \theta)\]

Notice what's being optimized here: the probability of the next word given the previous words. Not truth. Not consistency. Not physical possibility. Just: "what word would a human probably put here?" If confident-sounding nonsense is what humans write on the internet — and oh boy, is it ever — then confident-sounding nonsense is what you get.

The model isn't broken. Your expectations are.


2. The Idea: Separation of Concerns (Finally)

The insight behind Logic-Guard-Layer is embarrassingly simple, the way all good ideas are embarrassingly simple after someone states them.

Neural networks are good at language. Symbolic logic is good at truth. So: let the neural network handle the language, and let symbolic logic handle the truth. Don't ask one system to do both jobs, because asking an LLM to be factually rigorous is like asking a poet to do your taxes — technically possible, but you're going to have a bad time.

The architecture looks like this:

    LLM Output               Logic-Guard-Layer                  Application
   ┌──────────┐      ┌─────────────────────────────┐      ┌──────────────┐
   │ "Motor   │      │                             │      │              │
   │  HM-2847 │─────▶│  Parse → Validate → Repair  │─────▶│  Validated   │
   │  has..." │◀─────│                             │      │  Output      │
   └──────────┘      └─────────────┬───────────────┘      └──────────────┘
                                   │
                       ┌───────────┴───────────┐
                       ▼                       ▼
                 ┌──────────┐          ┌──────────────┐
                 │ Ontology │          │ Knowledge    │
                 │ (Rules)  │          │ Sources      │
                 └──────────┘          └──────────────┘

We don't retrain the LLM. We don't fine-tune it. We don't even talk to it about its feelings. We just check its homework.


3. Making It Precise: Claims and Validation

3.1 From Prose to Propositions

The first thing we need is a function \(\varphi\) that transforms unstructured text into something we can actually reason about. Define a claim as a five-tuple:

\[c = (\text{subject},\; \text{predicate},\; \text{object},\; \text{unit},\; \text{provenance})\]

So the sentence "The water level at Cologne station was 3.45m on July 15, 2024 at 14:00" becomes:

\[c = (\text{Station Köln},\; \text{hasWaterLevel},\; 3.45,\; \text{m},\; \text{2024-07-15T14:00Z})\]

The claim extractor \(\varphi: T \rightarrow 2^{\mathcal{S}}\) maps text to a set of claims. This is where the LLM is actually useful — it's remarkably good at this kind of structured extraction. We're playing to its strengths instead of pretending its weaknesses don't exist.

3.2 Two Flavors of Wrong

Here is where the ontology people in the audience start nodding, and everyone else starts reaching for coffee.

In description logic, a knowledge base has two parts:

  • The TBox (terminological box) — the schema. "Motors have operating hours. Operating hours are non-negative numbers. The maximum lifespan of a hydraulic motor is 20,000 hours."
  • The ABox (assertional box) — the facts. "Motor HM-2847 exists. Its current operating hours are 12,500."

This gives us two completely different ways something can be wrong:

Schema violations (TBox): The claim violates the rules of the domain. A negative number of operating hours. A maintenance interval exceeding the physical lifespan. A temperature measurement in kilograms.

\[\kappa_S: \mathcal{S} \times \mathcal{T} \rightarrow \{\text{VALID}, \text{INVALID}\}\]

Fact violations (ABox): The claim contradicts known data. The motor is claimed to have 15,000 hours but the database says 12,500. The water level is claimed to be 3.45m but the API says 2.87m.

\[\kappa_F: \mathcal{S} \times \mathcal{D} \rightarrow \mathcal{R}\]

Where \(\mathcal{R}\) is — and this is the part I'm actually proud of — not a binary set. More on that in a moment.


4. The Error Algebra, or: Why "Not Found" Is Not "Not True"

Here's where most validation systems go wrong, and here's where we do something that I think is genuinely interesting.

The naive approach says: you check a fact against a database. It's either there (true) or it's not (false). Binary. Simple. Wrong.

Consider: you ask a weather API for the temperature in Berlin on March 15, 1847. The API returns nothing. Is the claim false? Of course not. The API only has data from 2020 onwards. Your query is out of scope.

Or: you query a water level station by name, but the API uses station IDs. Your lookup fails. Is the entity fictional? No. Your lookup strategy was inadequate.

These are epistemically different situations, and collapsing them into a single "false" is how you get a system that screams wolf at everything. So we define a six-valued result algebra:

\[\mathcal{R} = \{\text{MATCH},\; \text{MISMATCH},\; \text{ABSENCE},\; \text{OUT-OF-SCOPE},\; \text{LOOKUP-FAILURE},\; \text{UNKNOWN}\}\]

The decision tree is pleasingly mechanical:

  1. Did the source respond at all? No → LOOKUP_FAILURE (try again later)
  2. Is the query within the source's scope? No → OUT_OF_SCOPE (don't blame the data)
  3. Was the entity found? No → ABSENCE under Closed World Assumption, UNKNOWN under Open World Assumption
  4. Does the value match (within tolerance)? Yes → MATCH, No → MISMATCH

The key innovation — if you'll permit me a moment of immodesty — is the explicit treatment of the Closed World Assumption (CWA) versus the Open World Assumption (OWA). A PostgreSQL database of registered equipment operates under CWA: if motor X isn't in the table, motor X doesn't exist. Period. A SPARQL knowledge graph operates under OWA: if motor X isn't in the graph, we simply don't know.

Different sources, different epistemology. The error algebra makes this operationalizable instead of philosophical.


5. The Self-Correction Loop, or: Teaching the Machine to Fix Its Own Mistakes (Without Making New Ones)

So validation has found errors. Now what?

The naive approach: throw everything away, regenerate. This is like burning down your house because the kitchen faucet leaks. You lose all the correct parts of the output.

The better approach: surgical repair. Tell the LLM exactly what's wrong and ask it to fix only that. Then validate again. Repeat until either everything passes or you give up.

Formally, one iteration is:

\[(t_{i+1}, s_{i+1}) = \text{Correct}\big(t_i,\; \kappa_S(s_i, \mathcal{T}) \cup \kappa_F(s_i, \mathcal{D})\big)\]

The correction prompt contains the original output, the specific violations with explanations, and — critically — the instruction that non-violated claims must remain unchanged.

5.1 When Does This Actually Work?

Let's be honest about convergence, because hand-waving about convergence is how you end up in infinite loops in production at 3 AM on a Saturday.

Theorem (Sufficient Convergence Conditions). The self-correction loop converges if:
1. The constraint set \(\mathcal{T}\) is consistent (no contradictory rules)
2. The LLM produces deterministic corrections
3. Each correction reduces the number of hard violations by at least 1
4. The knowledge sources return consistent results

Proof sketch. Let \(V_i\) be the number of hard violations at iteration \(i\). Under these conditions, \(V_{i+1} < V_i\) whenever \(V_i > 0\). Since \(V_i \in \mathbb{N}_0\) and strictly decreasing, the sequence terminates in finitely many steps at \(V_{i^*} = 0\). \(\square\)

Now, the punchline: conditions (2) and (3) are never satisfied in practice. LLMs are stochastic. A correction that fixes error A can introduce error B. This is not a theoretical concern — it happens all the time. So we need stabilization mechanisms.

5.2 Cycle Detection

The simplest pathology: the LLM oscillates between states.

Iteration 1: "Value is 3.45m"  → Error → Correct to 2.87m
Iteration 2: "Value is 2.87m"  → Error → Correct to 3.45m
Iteration 3: "Value is 3.45m"  → We've been here before.

Solution: hash every output state. Store hashes. If a hash recurs, you're in a cycle. Stop.

5.3 Semantic Drift

This is the subtle one, and it's the one that keeps me up at night.

You ask the LLM to fix the operating hours. It fixes the operating hours — and also quietly changes the weight from 450 kg to 380 kg. The weight was correct before. Now it isn't. Your correction made things worse.

We measure this with a drift metric:

\[\Delta_i = 1 - \frac{|\mathcal{S}_i^+ \cap \mathcal{S}_{i+1}^+|}{|\mathcal{S}_i^+|}\]

Where \(\mathcal{S}_i^+\) is the set of non-violated claims at iteration \(i\). A drift of 0 means perfect non-regression: everything that was correct stayed correct. A drift of 1 means total catastrophe: every correct claim got mangled.

In practice, if \(\Delta_i > \delta_{\max}\) (we use 0.05, i.e., 5%), we abort the correction and prefer the previous state. Better an honest "I don't know" than a repair that introduces more damage than it fixes.


6. The Architecture: Five Layers, Because Four Wasn't Enough

The complete system is organized into five layers, which I'll describe from top to bottom because that's how the data flows, and also because architects love layers the way physicists love symmetry groups — somewhat excessively.

LayerWhat It DoesWhy It Exists
InputClaim extraction, parsingTurn prose into propositions
ValidationTBox + ABox checking, error algebraFind what's wrong and why
KnowledgeSource aggregation, consensusHandle conflicting truths
AdapterPostgreSQL, SPARQL, REST connectorsAbstract over messy reality
CorrectionRepair loop, drift detection, cycle detectionFix it without breaking it

The Knowledge layer deserves a brief elaboration, because it solves a problem that naive fact-checking ignores: sources disagree. Your PostgreSQL database says the motor has 12,500 hours. Your ERP system says 12,800. Your maintenance log says 13,100.

The consensus algorithm uses weighted voting:

\[\text{votes}[r] = \sum_{\text{adapter} \in A} w_{\text{adapter}} \cdot \mathbb{1}[\text{adapter.result} = r]\]

Where weights reflect source reliability. Master data systems get \(w = 1.0\). External APIs get \(w = 0.6\). That one spreadsheet Dave maintains on his desktop gets \(w = 0.0\) (just kidding — Dave's spreadsheet doesn't have an API. Yet.)


7. What This Is Not

Let me save you some time by clarifying what Logic-Guard-Layer is not, because the landscape of "things that claim to make LLMs reliable" is vast and mostly disappointing.

It's not JSON Schema validation. Yes, modern LLM APIs can enforce output schemas. Great — your output is syntactically valid JSON. It can also contain {"maintenance_interval": 25000} for a motor that dies at 20,000 hours. Schema validity is necessary but laughably insufficient.

It's not a guardrail framework. Tools like NeMo Guardrails check for toxicity, bias, and topic drift. Useful for chatbots. Useless for catching the claim that a water level of -3 meters is physically meaningful.

It's not RAG. Retrieval-Augmented Generation gives the LLM access to facts. It does not verify that the LLM actually used those facts correctly. RAG is giving a student a textbook; Logic-Guard-Layer is grading the exam.

CapabilityJSON SchemaGuardrailsRAGLogic-Guard-Layer
Syntactic validation
Type constraints
Value range checks
Relational logic
Formal reasoning
Cross-source verification
Self-correction

8. Evaluation: Real Data, Real Problems

We evaluate against real-world public APIs — no synthetic datasets, no toy problems.

Track A (Physical Measurements):
- Water levels from PEGELONLINE (German federal waterway monitoring, 15-minute intervals)
- Weather data from Bright Sky (DWD open data aggregation)

Track B (Public Procurement):
- Tender notices from TED (Tenders Electronic Daily)

These sources have three properties that make them ideal for testing: they're authoritative (this is the official data), they're public (anyone can reproduce our results), and they're messy (data gaps, revisions, inconsistencies — just like the real world).

The evaluation tests three hypotheses:

H1 (Two-tier validation): Schema + fact validation catches significantly more errors than either alone.

H2 (Error algebra): The six-valued result algebra reduces false positives compared to naive NOT_FOUND = FALSE.

H3 (Repair loop): Self-correction with cycle detection and drift control improves Success@\(k\) while keeping drift under \(\delta_{\max}\).

Target metrics: \(p_{95}\) latency < 3 seconds, Success@3 > 90%, drift rate < 5%. These are goals, not guarantees. This is research, not marketing.


9. Open Questions (The Honest Part)

I want to end with what we don't know, because a scientific paper that pretends to have all the answers is not a scientific paper — it's a press release.

Semantic loss during formalization. The mapping \(\varphi: T \rightarrow 2^{\mathcal{S}}\) is lossy. Natural language contains hedging ("the part should probably be replaced"), irony, and implicit context that binary logic cannot capture. How much information do we lose? We don't know yet.

Convergence in practice. The convergence theorem has conditions that real LLMs don't satisfy. In practice, does the loop converge often enough to be useful? Under what conditions does it fail? We have preliminary numbers. We don't have a general theory.

Ontology creation. The whole system is only as good as the ontology. Creating formal ontologies requires domain experts, and domain experts have better things to do than write OWL files. Can LLMs help create the ontologies that will be used to validate LLMs? The circularity is amusing. The research problem is real.

Latency. Schema reasoning plus multiple API calls plus consensus computation plus potential correction iterations — can all of this happen fast enough for interactive use? Preliminary profiling is encouraging. Production-scale testing hasn't happened yet.


10. The Punchline

Large Language Models are the most impressive bullshit generators ever built. I mean that as a compliment — they have mastered the art of producing text that looks right, and for many applications, looking right is good enough.

But for applications where being right matters — maintenance systems, medical records, legal documents, procurement validation, anything where a wrong number can cost money or lives — looking right isn't good enough. You need a system that actually checks.

Logic-Guard-Layer is that system. It combines the linguistic flexibility of neural networks with the logical rigor of symbolic reasoning. The LLM does what it's good at (language). The ontology does what it's good at (logic). And a carefully designed validation pipeline makes sure they play nicely together.

Is it perfect? No. Is it better than trusting the LLM alone? By a rather comfortable margin.

As Feynman might have put it: the LLM writes the poetry. The Logic-Guard-Layer does the arithmetic. And in any system where arithmetic matters more than poetry, you want both — but you really want someone checking the math.


This post describes experimental development research on neuro-symbolic AI validation systems. The architecture specification, formal definitions, and evaluation framework are documented in the accompanying scientific paper. Feedback, skepticism, and particularly well-constructed counterexamples are welcome.


References

  • Garcez, A. d'Avila et al. (2020). Neurosymbolic AI: The 3rd Wave. arXiv:2012.05876
  • Marcus, G. (2020). The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence. arXiv:2002.06177
  • Hitzler, P. & Sarker, M. K. (2022). Neuro-Symbolic Artificial Intelligence: The State of the Art. IOS Press
  • Ji, Z. et al. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys
  • Pan, S. et al. (2024). Unifying Large Language Models and Knowledge Graphs: A Roadmap. IEEE TKDE
|
← Back to Articles
[ Translating... ]