Tim Dunbar

Stepping Back: A Literature Review of LLMs for Automated Theorem Proving

2026-03-25T12:00:00+00:00

Why I’m Doing This Now

Four posts into the Symbolic series, I’ve built a pipeline, scraped GitHub, validated specs through SANY and TLC, and arrived at a humbling number: zero TLA+ specifications in my dataset survive end-to-end model checking. Eight unique specs passed syntax validation. That’s it.

The honest thing to do here is not to grind harder on the same approach. It’s to stop and ask: what has everyone else already figured out?

I should have done a literature review before writing a single line of code. That’s not hindsight — that’s methodology. In graduate school, a literature review is the first chapter of your thesis for a reason. It prevents you from reinventing wheels, reveals approaches you’d never consider, and — crucially — tells you where the open problems actually are.

So this post isn’t about Symbolic’s architecture or my dataset pipeline. It’s about the process of conducting a formal literature review at the intersection of large language models and automated theorem proving — what tools I’m using, how I’m organizing the search, and how I plan to synthesize what I find.

The actual findings will come in the next post. This one is about the method.

What Is a Formal Literature Review?

A literature review is not a Google search. It’s not skimming abstracts and citing whatever supports your argument. A formal literature review is a systematic, reproducible process for surveying existing research in a problem space.

The key properties:

Defined scope. You state exactly what you’re looking for and what you’re not.
Reproducible search. Someone else could follow your search strategy and find the same papers.
Inclusion/exclusion criteria. You decide up front what counts, before you see results.
Synthesis, not summary. You identify themes, contradictions, and gaps — not just restate each paper.

There are different levels of rigor. A full systematic literature review (SLR) follows protocols like PRISMA, pre-registers the search strategy, and may involve multiple reviewers for bias reduction. That’s what you’d do for a journal publication.

What I’m doing is closer to a scoping review — a structured but less rigid survey meant to map the landscape of a research area and identify key themes and gaps. It’s the right tool when you’re asking “what has been done?” rather than “what is the effect size of X?”

Defining the Scope

The first step is to define the research questions. Not the questions I want to answer — the questions that guide what I search for.

Primary question: What approaches have been explored for using large language models to generate, assist with, or verify formal proofs in automated theorem proving systems?

Secondary questions:

What formal languages and proof assistants are being targeted (Lean, Coq, Isabelle, TLA+, others)?
What LLM architectures and training strategies have shown promise?
How are training datasets constructed for low-resource formal languages?
What evaluation metrics and benchmarks are used?
Where are the open problems and failure modes?

Notice how much broader this is than “can I fine-tune Llama to write TLA+?” That’s deliberate. Symbolic targets TLA+ specifically, but the techniques for training LLMs on Lean proofs or Coq tactics may transfer. The dataset construction challenges for Isabelle are almost certainly relevant to mine. By widening the aperture, I avoid tunnel vision.

The Search Strategy

Choosing Databases

Academic search is fragmented. No single database covers everything. Here’s what I’m using and why:

Semantic Scholar (semanticscholar.org) My primary search engine. Semantic Scholar indexes over 200 million papers, provides excellent API access, and has features specifically designed for literature reviews — citation graphs, TLDR summaries, and influence scores. Its AI-powered relevance ranking tends to surface highly-cited foundational papers alongside recent work, which is exactly what I need.

arXiv (arxiv.org) The preprint server where most ML and formal methods research lands first. Papers here are often months ahead of journal publication. I’ll search arXiv directly for the most recent work that Semantic Scholar may not have indexed yet.

Google Scholar (scholar.google.com) Broader coverage than Semantic Scholar, especially for older work and conference proceedings. I use it as a secondary source and for “cited by” chains — finding newer papers that cite a foundational one.

ACM Digital Library and IEEE Xplore For conference papers from venues like ICML, NeurIPS, ICLR, CAV, POPL, and ITP that may not be freely available on arXiv.

Constructing Search Queries

The query design matters enormously. Too narrow and you miss relevant work. Too broad and you drown in noise. I’m using a structured approach with Boolean operators:

Core query:

("large language model" OR "LLM" OR "transformer" OR "neural theorem proving")
AND
("theorem proving" OR "formal verification" OR "proof assistant" OR "proof generation")

Variant queries for specific aspects:

# Dataset construction
("training data" OR "dataset" OR "benchmark")
AND ("formal proof" OR "theorem proving" OR "proof assistant")

# Specific proof assistants
("Lean" OR "Coq" OR "Isabelle" OR "TLA+" OR "HOL")
AND ("language model" OR "machine learning" OR "neural")

# Evaluation and metrics
("evaluation" OR "benchmark" OR "accuracy")
AND ("automated theorem proving" OR "proof generation")
AND ("language model" OR "neural" OR "transformer")

I’ll run each query across each database, tracking what I searched, when, and how many results I got. This is the “reproducible” part — if someone wanted to verify my survey, they could rerun these queries.

Snowball Sampling

Queries only get you so far. Some of the most relevant papers will be found through snowball sampling:

Backward snowballing: For each key paper I find, I check its references. If a paper on LLM-based Lean proving cites a foundational paper on neural theorem proving, I add that to my review.
Forward snowballing: For foundational papers, I check who has cited them since. Semantic Scholar’s “cited by” feature and Google Scholar’s citation tracking are essential here.

This is how you find the papers that don’t match your keywords but are deeply relevant. A 2020 paper on “neural guided proof search” might not contain the phrase “large language model” but could be foundational to the entire field.

Organizing with Zotero

Raw search results are useless without organization. I’m using Zotero (zotero.org) — a free, open-source reference manager — as my central hub.

Why Zotero?

Free and open source. No subscription paywalls.
Browser extension. One click to save a paper from Semantic Scholar, arXiv, or any journal site. It automatically extracts title, authors, date, abstract, and DOI.
PDF management. Zotero stores and indexes PDFs. I can annotate directly in the reader and those annotations become searchable.
Tagging and collections. I create nested collections that mirror my research questions and tag papers by theme.
Citation export. When I write the synthesis post, Zotero generates citations in any format.
Zotero Connector + Better BibTeX plugin. If I later want to write in LaTeX, the integration is seamless.

My Collection Structure

LLM-ATP Literature Review/
├── 01 - Foundational Papers/
│   ├── Neural Theorem Proving (pre-LLM)
│   └── Transformer Architecture for Math
├── 02 - LLM Proof Generation/
│   ├── Lean
│   ├── Coq
│   ├── Isabelle
│   ├── TLA+
│   └── Other/Multi-system
├── 03 - Dataset Construction/
│   ├── Synthetic Generation
│   ├── Corpus Extraction
│   └── Benchmarks
├── 04 - Training Strategies/
│   ├── Fine-tuning
│   ├── Prompt Engineering
│   ├── Reinforcement Learning
│   └── Retrieval-Augmented
├── 05 - Evaluation & Metrics/
└── 06 - Surveys & Meta-analyses/

Each paper gets tagged with relevant themes. A single paper might live in “02 - LLM Proof Generation / Lean” but also be tagged dataset-construction and reinforcement-learning if it covers those aspects.

Annotation Strategy

When I read each paper, I annotate with a consistent structure:

Yellow highlight: Key claims and findings
Blue highlight: Methodology details I might adopt
Red highlight: Limitations, failure modes, open problems
Green highlight: Dataset details (size, source, construction method)
Notes: My own thoughts on relevance to Symbolic

This isn’t busywork — it’s what makes synthesis possible later. When I sit down to write about dataset construction approaches across the field, I can filter by green highlights and get every relevant data point without re-reading 40 papers.

Inclusion and Exclusion Criteria

Before I start reading, I define what’s in scope and what’s not. This prevents the review from expanding infinitely.

Inclusion criteria:

Published 2019 or later (transformer era — pre-transformer neural theorem proving is foundational context only)
Addresses the use of neural language models for theorem proving, proof generation, or formal verification
Targets at least one established proof assistant or formal system
Available in English
Peer-reviewed publication, accepted preprint, or technical report from a recognized research group

Exclusion criteria:

Pure code generation without formal verification (e.g., Copilot-style code completion)
Natural language reasoning or informal mathematical problem solving (e.g., GSM8K, MATH benchmark)
Papers focused solely on symbolic AI or traditional ATP without neural components
Blog posts, tutorials, or documentation (useful for context but not for the review itself)

The boundary between “code generation” and “proof generation” is blurry. A paper about using LLMs to generate Dafny code with verification conditions is relevant. A paper about generating Python with unit tests is not. I’ll make judgment calls at the margin and document them.

The Reading Process

I don’t read every paper cover to cover. That’s not feasible, and it’s not necessary. I use a three-pass approach adapted from S. Keshav’s “How to Read a Paper”:

Pass 1: Survey (5 minutes per paper) Read the title, abstract, introduction, section headings, and conclusion. Decide: is this relevant enough for Pass 2? This is where the inclusion/exclusion criteria do their work.

Pass 2: Comprehension (30 minutes per paper) Read the full paper, but don’t get stuck on dense proofs or implementation details. Understand the approach, the key results, and the limitations. Annotate in Zotero. Add tags.

Pass 3: Deep read (1-2 hours per paper) Only for the most important papers — the ones I’ll discuss in detail in the synthesis. Understand the methodology well enough to evaluate it critically. Could I reproduce this? Where does it break? How does it relate to Symbolic?

I expect roughly:

100-150 papers from initial search results
40-60 papers after Pass 1 filtering
15-25 papers given a deep read in Pass 3

Tracking the Process

I’m maintaining a simple spreadsheet alongside Zotero to track the review process itself:

For each paper, I track:

Title and source — where I found it (Semantic Scholar, arXiv, etc.)
Pass 1 date — when I surveyed it
Include? — Yes / No / Maybe
Pass 2 and Pass 3 dates — if applicable
Key theme — mapped to my collection categories
Relevance to Symbolic — High / Medium / Low

This serves two purposes. First, it keeps me honest — I can see if I’m spending too long in the weeds or skipping important categories. Second, it makes the review auditable. If someone questions whether I considered a particular line of research, I can point to the spreadsheet.

What I Expect to Find

I’m going in with hypotheses, not conclusions. But based on what I’ve already encountered tangentially, I expect the landscape to include:

Well-explored territory:

LLM-based tactic prediction for Lean (LeanDojo, ReProver, and related work)
GPT-4 and similar frontier models on mathematical reasoning benchmarks
Autoformalization — translating informal math to formal statements

Less-explored territory:

Fine-tuning smaller, open-source models for specific proof assistants
Dataset construction for low-resource formal languages (this is my problem)
TLA+ specifically (I suspect very little exists here)
Reliability and failure mode analysis of LLM-generated proofs

Open questions I’m watching for:

How much training data do you actually need for useful proof generation?
Does reinforcement learning from proof checker feedback outperform supervised fine-tuning?
Can techniques that work for Lean (rich type theory, large mathlib corpus) transfer to TLA+ (temporal logic, sparse data)?

Why This Matters for Symbolic

I could skip all of this and go back to grinding on dataset construction. But that’s how you end up building something that already exists, or worse, building something that the research community has already shown doesn’t work.

The literature review will inform Symbolic in specific ways:

Dataset strategy. If the field has converged on synthetic generation over web scraping, I should know that before spending another month scraping GitHub.
Model selection. If fine-tuning 8B parameter models is a dead end for this task and the research points to other approaches, I need to know.
Evaluation framework. I’m currently measuring syntax validity and semantic validity. The field may have better metrics.
Positioning. If nobody has done this for TLA+ specifically, that’s a contribution. If someone has, I need to know what they found.

What Comes Next

The next post will be the synthesis — what I actually found. I’ll organize it by theme rather than by paper, identify the key technical approaches, map the gaps, and explain how it reshapes my plan for Symbolic.

For now, the work is the unglamorous part: running queries, reading abstracts, filling out spreadsheets, and annotating PDFs. It’s not as exciting as writing code, but it’s how you build something that matters instead of something that already failed somewhere else.

The formal literature review starts now.

Running the Real TLA+ Toolchain: What Survives SANY and TLC

2026-02-17T06:00:00+00:00

In my last post, I collected 449 TLA+ files from GitHub and validated them down to 79 using basic structural checks — balanced brackets, module headers and footers. I reported that 52 of those 79 passed “TLC validation.”

Tonight I ran the actual TLA+ toolchain — the SANY parser and TLC model checker — on all 79 files. The results were humbling.

New to this series? Start with From Napkin Sketch to Mathematical Proof: Introducing Symbolic for the full context.

The Gap Between Structural and Semantic Validation

My previous validation checked for things like matching ---- MODULE ---- headers and balanced parentheses. That’s like checking that a Python file has proper indentation — necessary but nowhere near sufficient.

SANY (the official TLA+ parser) does full semantic analysis: operator resolution, type consistency, module dependency resolution. TLC goes further and actually model-checks the specification against its properties.

The difference matters.

Pre-Analysis: The Dependency Problem

Before running SANY, I analyzed what each file actually needs:

Dependency Analysis (79 files):
  Standard-only:  12  — only uses Naturals, Integers, Sequences, etc.
  Custom deps:    55  — INSTANCE/EXTENDS modules we don't have
  No deps:        12  — fully self-contained

55 out of 79 files depend on custom modules from their source repository that we never scraped. These are mostly MC*.tla files — model-checking configurations that reference a main specification. For example, MC_n4_f1.tla from the CometBFT repo needs TendermintAccDebug_004_draft.tla, which we don’t have.

This was predictable in hindsight. GitHub’s code search returned individual files, not complete projects. We grabbed the configuration files without their dependencies.

Results: SANY Validation

SANY Results (79 files):
  Passed:  18  (22.8%)
  Failed:  61  (77.2%)

I expected losses from the dependency problem. I did not expect nearly 80% of the dataset to fall away in a single step.

Failure Categories

Category	Count	What it means
missing_module	57	Can’t find a module the file depends on
semantic_error	2	Valid syntax but semantic issues (duplicate definitions)
sany_error	2	Internal SANY errors (malformed recursive declarations)
tlc_error	1	Passed SANY, failed TLC
success	17	Passed SANY (no Spec operator, so TLC can’t run)

The dominant failure mode is clear: 72% of files fail because they reference modules we don’t have. Not because the TLA+ is wrong — because we only scraped half the project. The remaining failures — semantic errors, malformed declarations — are genuine bugs in the specs, but they’re rounding errors next to the dependency problem.

A Filename Gotcha

One lesson from tonight: SANY requires the .tla filename to exactly match the MODULE declaration inside the file. Our GitHub scraper renamed files with repo prefixes — Aqua-218_NyxNet_Gateway.tla contains MODULE Gateway.

Every single file initially failed SANY with:

File name 'Aqua-218_NyxNet_Gateway' does not match the name 'Gateway'
of the top level module it contains.

The fix: copy each file to a temp directory with the correct name before validation. A small thing, but it would have been easy to misinterpret 0/79 passing as “all our specs are broken” rather than “our filenames are wrong.”

TLC: Almost Nothing to Check

Only 1 file in the entire dataset defines a Spec operator — the entry point TLC needs for model checking. That file (MySpec.tla) passed SANY but failed TLC.

The other 17 SANY-passing files are specifications without a runnable Spec. They define operators, theorems, and constants, but nothing TLC can execute. SANY-pass is the best validation we can achieve for them.

What This Means for the Training Dataset

Let me be direct: this is a setback.

The previous post reported “52/79 passed TLC (65.8%).” That number came from the basic structural validator — my own regex-based checks — not from running actual SANY and TLC. I was measuring the wrong thing. The real numbers tell a different story:

Validation Level	Files	Rate
GitHub scrape	449	100%
Structural syntax	79	17.6%
SANY (semantic)	18	4.0%
TLC (model checking)	0	0%

Zero files pass TLC. Not one specification in the entire scraped dataset can be model-checked end to end.

And even the 18 that pass SANY aren’t what they seem. Deduplication wipes out most of them — 7 are from the same NyxNet project (related config and policy modules), 3 are copies of the same Cantor diagonal proof floating around different repos, and 3 are copies of a trivial Foo1 test spec. The truly distinct, high-quality specifications number around 8.

To fine-tune a model you need hundreds to thousands of training examples at minimum. 8 unique specs is not a dataset — it’s a handful. Training on this would produce a model that can recite Cantor’s diagonal proof and NyxNet gateway configs, and nothing else.

The scraping approach that felt like it was working two weeks ago has hit a wall.

What I’m Taking Away

This is the kind of result that makes you question an assumption you didn’t realize you were making. I assumed that if a file exists on GitHub with a .tla extension, it’s probably a usable TLA+ specification. That assumption was wrong three different ways.

1. Scraping individual files doesn’t work for TLA+

TLA+ specifications are multi-file projects. An MC.tla file without its companion modules is like a test_app.py without the app. GitHub’s code search returns individual files, and I treated each one as a self-contained example. 57 of 79 files punished me for that assumption. Future scraping needs to pull entire repository directories, not individual files.

2. Structural validation gave me false confidence

My regex-based validator said 52 files were good. SANY said 18. That’s not a minor discrepancy — I was overestimating my dataset by nearly 3x. Balanced brackets tell you almost nothing about whether TLA+ is valid. The gap between “looks right” and “SANY accepts it” is enormous. Any future validation pipeline needs to run the real toolchain from the start, not as an afterthought.

3. The dataset needs a fundamentally different approach

8 unique specs is not a starting point for fine-tuning. It’s a dead end. The options I’m considering:

Scrape complete projects — clone full repos, resolve dependencies, validate entire project trees
Target the tlaplus/Examples repository — curated, self-contained specs that are known to work
Generate synthetic specs — use an LLM to produce specs, validate with SANY/TLC, keep what passes
Manual curation — write specs by hand for common patterns (mutex, leader election, consensus)

Each has tradeoffs. Scraping complete projects fixes the dependency problem but adds complexity. The Examples repo is high quality but may not be large enough on its own. Synthetic generation is scalable but risks teaching a model to imitate its own mistakes. Manual curation produces the best training pairs but doesn’t scale.

Technical Notes

The validation script (symbolic/utils/tlc_validate.py) runs in four phases:

Pre-analysis — extracts EXTENDS, INSTANCE, Spec operators, dependency classification
SANY validation — runs tla2sany.SANY with 30s timeout, handles filename renaming
TLC validation — generates .cfg, runs tlc2.TLC with 60s timeout (only for Spec-having files)
Results — JSON, summary, pass/fail lists to validation_output/tlc_validation/

python tlc_validate.py \
  --java /opt/homebrew/opt/openjdk@17/bin/java \
  --tlc-jar ~/tla-tools/tla2tools.jar

Full results are in the Symbolic repository.

What’s Next

The honest answer is I need to step back and rebuild the data pipeline before anything else moves forward. The immediate plan:

Clone the tlaplus/Examples repo and run the full SANY/TLC validation pipeline on it — this should give me a baseline of known-good specs to work with
Build a repo-level scraper that clones entire TLA+ projects from GitHub instead of pulling individual files, so dependencies stay intact
Re-evaluate the training strategy — depending on how many validated specs I can collect, fine-tuning may not be the right first step. Prompt engineering with a strong base model might get further, faster, while I build up the dataset in parallel

Two weeks ago I thought I had 52 validated specs and a clear path to fine-tuning. Tonight I have 8 and a list of hard questions. That’s progress — just not the kind that feels good.

*This is part of my ongoing work on Symbolic, an LLM-based system for generating TLA+ specifications from natural language. Previous posts: Introducing Symbolic

Building the Dataset*

Resources

TLA+ SANY Parser — the official syntax/semantic analyzer
TLA+ Examples Repository — curated, complete specifications
Symbolic Project

Building a TLA+ Training Dataset: From GitHub to Model-Ready Specs

2026-02-10T06:00:00+00:00

Tonight I made significant progress on Symbolic, my project to train LLMs to generate TLA+ formal specifications from natural language descriptions. The key milestone: collecting and validating a dataset of real-world TLA+ specifications from GitHub.

New to this project? If you want to start from the beginning of the Symbolic project, read From Napkin Sketch to Mathematical Proof: Introducing Symbolic first. Otherwise, continue reading to learn about dataset collection and validation.

The Challenge

To fine-tune a model that can generate valid TLA+ specifications, I need training data. Lots of it. And not just any TLA+ code—I need specifications that are:

Syntactically correct (proper module structure, balanced operators)
Semantically valid (pass the TLC model checker)
Diverse (covering different domains and patterns)

The question: where do I get this data?

Phase 1: Scraping GitHub for TLA+ Files

I started by building a simple GitHub scraper to collect .tla files from public repositories. Using GitHub’s Code Search API, I searched for extension:tla and downloaded the raw file contents.

# github-scraper.py (simplified)
SEARCH_QUERY = "extension:tla"
SAVE_DIR = "tla_dataset"

def search_tla_files(page=1):
    url = f"https://api.github.com/search/code?q={SEARCH_QUERY}&page={page}&per_page=100"
    response = requests.get(url, headers=HEADERS)
    return response.json().get('items', [])

def download_file(item):
    file_url = item['url']
    res = requests.get(file_url, headers=HEADERS)
    if res.status_code == 200:
        content_json = res.json()
        if content_json.get('encoding') == 'base64':
            file_content = base64.b64decode(content_json['content'])
            # Save to disk

Results:

449 TLA+ files collected from GitHub
Sourced from 60+ open-source repositories
Including specs from CometBFT, Paxos implementations, PBFT, and various distributed systems

This gave me a solid starting point, but the real work was just beginning.

Phase 2: Validation—Separating the Wheat from the Chaff

Having 449 files is great, but are they actually valid? I built a validation pipeline with two levels:

Level 1: Syntax Validation (Basic Structure)

First, I implemented a basic syntax validator that checks for:

Proper module headers (---- MODULE Name ----)
Proper module footers (====)
Balanced brackets and parentheses
Balanced logical operators (/\ and \/)

# validate_dataset.py
class DatasetValidator:
    def validate_file(self, file_path: Path) -> ValidationResult:
        content = file_path.read_text(encoding='utf-8')

        # Syntax validation
        syntax_valid, syntax_errors = self.syntax_validator.validate(content)

        # TLC validation (if available)
        if syntax_valid and not self.skip_tlc:
            tlc_valid, tlc_errors = self.tlc_validator.validate(content)

        return ValidationResult(...)

I ran this on all 449 files. The results were… sobering:

Total files:        449
✅ Valid files:     79  (17.6%)
❌ Invalid files:   370 (82.4%)

Only 17.6% passed basic validation!

What Went Wrong?

Analyzing the 370 failed files revealed common patterns:

5,160 errors: Unmatched parentheses
4,185 errors: Unmatched square brackets
  342 errors: Unbalanced conjunction/disjunction operators
   14 errors: Missing module header
    4 errors: Missing module footer

Many files were:

Incomplete specifications (truncated during GitHub API retrieval)
Helper modules with complex imports and advanced features
Configuration files (.cfg) mistakenly grabbed as .tla
Files with encoding issues or special characters

Level 2: TLC Model Checker (Semantic Validation)

The 79 syntax-valid files are a good start, but they might still have semantic issues:

Deadlocks
Invariant violations
Unreachable states
Liveness property failures

I built a TLC validator that:

Creates temporary config files
Runs TLC with a timeout (60s per file)
Parses TLC output for errors
Extracts error traces and state information

class TLCValidator:
    def _run_tlc(self, spec_path: Path):
        config_path = spec_path.with_suffix('.cfg')
        config_content = f"SPECIFICATION {module_name}\n"
        config_path.write_text(config_content)

        cmd = [
            'java', '-cp', str(self.tlc_jar_path),
            'tlc2.TLC', '-workers', '4',
            str(spec_path)
        ]

        return subprocess.run(cmd, capture_output=True, timeout=60)

Note: I haven’t run full TLC validation yet (requires Java + TLA+ tools setup), but the infrastructure is ready.

Phase 3: Preparing Training Data

With 79 validated specifications, I created a structured training dataset. Each example includes:

{
  "id": "0gfoundation_cometbft_MC_n4_f1",
  "tla_spec": "---- MODULE MC_n4_f1 ----\n...",
  "metadata": {
    "module_name": "MC_n4_f1",
    "source_repo": "0gfoundation/cometbft",
    "extends": ["TLC", "Naturals"],
    "constants": ["N", "MaxRound"],
    "variables": ["state", "round"],
    "operator_count": 5,
    "line_count": 42,
    "char_count": 1337
  },
  "natural_language": ""  // To be added
}

Dataset Statistics

Total examples:     79
Total lines:        1,813
Total characters:   53,239
Average:            22 lines per spec

Size Distribution:
  Small (<20 lines):   49 specs (62%)
  Medium (20-50):      19 specs (24%)
  Large (50+):         11 specs (14%)

Common Modules:
  TLC:       30 specs
  Naturals:   8 specs
  EWD840:     5 specs
  Sequences:  3 specs

Key Insights

1. Real-World Data Is Messy

GitHub is full of incomplete files, abandoned projects, and experimental code. Only 17.6% of collected files passed basic validation. This is actually typical for web-scraped datasets.

Lesson: Build robust validation pipelines. Don’t assume data quality.

2. Two-Stage Validation Is Essential

Syntax validation catches structural issues (fast, no external tools)
Semantic validation catches logical errors (slower, requires TLC)

For machine learning purposes, both matter. You don’t want to train a model on specifications that look correct but have deadlocks or invariant violations.

3. Quality > Quantity (Initially)

79 high-quality examples is better than 449 low-quality ones. A model trained on valid specs will learn correct patterns. A model trained on invalid specs will learn to make the same mistakes.

4. Metadata Matters

Extracting metadata (module dependencies, variables, operators) helps with:

Dataset analysis (what patterns are common?)
Model evaluation (can the model handle different complexity levels?)
Training strategies (curriculum learning from simple to complex)

What’s Next?

Immediate Next Steps

Run full TLC validation on the 79 syntax-valid files
- Expected: 40-60 files will pass
- Higher quality guarantee for training
Add natural language descriptions
- Manual annotation (slow, high quality)
- LLM-generated descriptions (fast, needs review)
- Hybrid approach
Start fine-tuning experiments
- Begin with Llama-3.2-8B (manageable size)
- Evaluate on held-out test set
- Iterate on training approach

Medium-Term Goals

Expand the dataset
- Fix common errors in invalid files
- Generate synthetic variations
- Scrape TLA+ examples repository
- Mine academic papers and tutorials
- Target: 200-500 examples
Build evaluation metrics
- Syntax correctness rate
- TLC pass rate
- Human evaluation of quality
- Semantic similarity to reference specs
Experiment with model architectures
- Different base models (Llama, Mistral, CodeLlama)
- Different context lengths
- Different quantization strategies

Technical Details

All code is available in the Symbolic repository:

utils/github-scraper.py - GitHub data collection
utils/validate_dataset.py - Batch validation pipeline
utils/prepare_training_data.py - Training data preparation
src/symbolic/validation/ - Validation modules (syntax + TLC)

The validation pipeline is designed to be:

Reproducible (detailed JSON results for every file)
Extensible (easy to add new validation checks)
Efficient (parallel processing, configurable timeouts)

Reflections

Building a dataset for formal methods is harder than I expected. Unlike natural language or even code, TLA+ specifications have:

Rigid syntax requirements (no room for approximation)
Complex semantics (requires model checking to validate)
Domain expertise (understanding distributed systems, concurrency, etc.)

But it’s also incredibly rewarding. Each valid specification represents a carefully designed model of a complex system. Training an LLM to generate these could democratize formal methods—making them accessible to developers who don’t have PhD-level expertise.

The Bottom Line

Tonight’s Progress:

✅ Collected 449 TLA+ files from GitHub
✅ Validated to 79 high-quality specifications
✅ Prepared structured training dataset
✅ Built reusable validation infrastructure

Validation Rate: 17.6% (79/449)

Dataset Ready: Yes, for initial experiments

Next Milestone: Full TLC validation + model fine-tuning

The foundation is laid. Now comes the fun part: teaching an LLM to think formally.

This is part of my ongoing work on Symbolic, an LLM-based system for generating TLA+ specifications from natural language. Follow along on GitHub or read my other posts about formal methods and machine learning.

Resources

Update (2026-02-10): After running full TLC validation, 52 of the 79 files passed semantic validation (65.8% of syntax-valid files). Total pipeline pass rate: 11.6% (52/449). Quality bar is high, but that’s exactly what we want for training data.

What I Have Been Up To

2026-02-03T00:00:00+00:00

Introduction

It’s been over eight years since my last post here in August 2017. During that time, the world changed dramatically—we lived through a global pandemic, witnessed fundamental shifts in how we work and communicate, and saw artificial intelligence move from research labs into everyday tools. On a personal level, these years brought significant transitions: completing graduate school, advancing in my career, becoming an empty-nester, and relocating to Florida.

This post serves as a retrospective on the professional and personal growth that occurred during this period, and more importantly, sets the stage for where I’m heading next. After years of building production data systems and completing formal training in computer science, I’m now focusing on the intersection of artificial intelligence, formal methods, and quantum computing—areas that bridge theoretical computer science with practical systems engineering.

Background

The past eight years encompassed major life transitions. My youngest child moved out, marking the transition to being empty-nesters. We relocated from Virginia to Clermont, Florida, seeking a change of pace and climate. There were the usual challenges—a car accident with a drunk driver, family moving in, the various emergencies and complexities that come with homeownership. Through it all, I maintained focus on professional development and continued exploring the mathematical and computational ideas that have fascinated me since my undergraduate studies.

Outside of work and study, I’ve remained active in Toastmasters International since 2018, developing communication and leadership skills. I continue to find creative expression through blues music, playing both guitar and harmonica—a reminder that not everything needs to be about logic and computation.

Education

In 2021, I began the Master of Science in Computer Science program at Georgia Institute of Technology, completing it in 2025 with a 3.81 GPA. This was a rigorous program that allowed me to formalize knowledge I’d gained through years of practical experience while diving deep into areas I’d only explored superficially before.

Key Areas of Focus:

Artificial Intelligence: Advanced coursework in machine learning, natural language processing, and knowledge representation
Quantum Computing: Specialized study in quantum algorithms and their applications
Formal Methods: Training in formal verification, model checking, and correctness proofs

Research Highlights:

My most significant academic work involved simulating molecular systems using quantum computers, specifically extending the CAFQA (Clifford Ansatz For Quantum Accuracy) framework. This research sits at the intersection of quantum chemistry, quantum computing, and computational physics.

The Problem: Classical computers struggle to simulate quantum mechanical systems accurately due to exponential scaling—simulating n quantum particles requires computational resources that grow as 2^n. Quantum computers can simulate these systems more naturally, but current NISQ devices are limited by noise and gate errors.

My Contribution: The CAFQA approach uses Clifford gates (a restricted set of quantum gates) to build quantum circuits for molecular simulation. While Clifford circuits are easier to implement and more noise-resilient, they have limited expressiveness. My research focused on augmenting the traditional Clifford gate set with T gates to determine if this could achieve additional accuracy not realized by CAFQA alone.

Why This Matters: T gates are non-Clifford gates that add computational power to quantum circuits, allowing them to represent more complex quantum states. However, they’re also more difficult to implement on real quantum hardware and more susceptible to noise. The research question: does the increased expressiveness of Clifford+T circuits outweigh the additional error introduced by T gates for molecular simulation tasks?

The work involved:

Implementing quantum circuits with hybrid Clifford+T gate sets
Comparing simulation accuracy against pure Clifford approaches (baseline CAFQA)
Analyzing the trade-off between circuit expressiveness and noise resilience
Benchmarking on small molecular systems (H₂, LiH, BeH₂)
Evaluating performance on NISQ hardware with realistic error rates

This research reinforced a key insight: the most interesting problems exist at the boundaries between disciplines. Quantum chemistry isn’t just physics—it’s a computational problem that requires expertise in algorithms, hardware limitations, and careful trade-off analysis between theoretical capability and practical implementation constraints.

Broader Training:

Beyond the graduate program, I maintained continuous learning through various certifications and courses:

Data Science at Scale specialization (Coursera, 2017)
Practical Predictive Analytics (Coursera, 2017)
Build a Modern Computer from First Principles (Coursera, 2016)
Multiple certifications in R, Python, and data manipulation

Movement

In 2024, we relocated from Virginia to Clermont, Florida. The move represented both a lifestyle change and a practical decision—lower cost of living, better weather, and proximity to growing tech communities in Orlando and Tampa. Working remotely as Director of Data Engineering made the geographic transition seamless professionally, while personally it offered a fresh start after years of intense focus on graduate school and career advancement.

Florida’s emerging tech scene has been a pleasant surprise. While not Silicon Valley or Austin, the state has been attracting significant tech investment, particularly in aerospace (Cape Canaveral’s private space industry), defense contractors, and enterprise software companies. The cost-of-living arbitrage allows for a better quality of life while maintaining the same professional standards and compensation.

Work

My professional trajectory over the past eight years has been one of increasing scope and technical depth. I currently serve as Director of Data Engineering at Trader Interactive, where I lead initiatives at the intersection of data architecture, systems design, and intelligent automation.

Professional Evolution:

When I last posted in 2017, I was deep in data science and analytics—building predictive models, running statistical analyses, and working primarily with structured datasets. The field has evolved dramatically since then:

Infrastructure as Code: Data engineering now resembles software engineering more than statistics. We build pipelines using modern tooling—Airflow, dbt, Terraform—treating data infrastructure with the same rigor as application code.
Real-Time Systems: Batch processing has given way to streaming architectures. We’ve built systems that process millions of events per day using Kafka, Spark Streaming, and Lambda architectures.
ML Operations: Machine learning moved from Jupyter notebooks to production systems. This required building deployment pipelines, monitoring systems, and governance frameworks—bridging the gap between data science and platform engineering.
Cloud-Native Architecture: Migration from on-premise data centers to cloud infrastructure (primarily AWS) changed how we think about scalability, cost optimization, and system design.

Key Accomplishments:

D.R.I.V.E. Award (2021): Led the AVBT project team, recognized for innovation in data-driven decision making
Data Platform Modernization: Architected and led the migration from legacy ETL systems to modern ELT patterns using cloud-native tools
Team Building: Grew and mentored a team of data engineers, establishing best practices for code review, testing, and documentation
Cross-Functional Leadership: Bridged gaps between data science, analytics, software engineering, and business stakeholders

Technical Philosophy:

Over these years, I’ve developed a perspective on data engineering that emphasizes:

Correctness over Speed: Data pipelines should be provably correct. Late data is annoying; wrong data is catastrophic.
Simplicity over Cleverness: Complex systems fail in complex ways. Simple, well-documented systems are easier to debug, maintain, and extend.
End-to-End Ownership: Data engineers should understand both the source systems generating data and the downstream use cases consuming it.
Automation with Guardrails: Automate everything, but build validation and monitoring into every step.

This philosophy is increasingly influenced by formal methods and correctness proofs—concepts I encountered in graduate school that have direct applications to production data systems.

What’s Next

After years of building data infrastructure and completing formal computer science training, I’m pivoting toward three interconnected areas that represent the future of reliable, intelligent systems:

1. AI Architecture and LLM Systems

Large language models have moved from research curiosities to production tools in just a few years. However, most organizations are still figuring out how to deploy them reliably. I’m particularly interested in:

Fine-tuning and specialization: Adapting open-source models (Llama, Mistral) for domain-specific tasks where GPT-4 falls short
Retrieval-Augmented Generation (RAG): Building systems that ground LLM outputs in verifiable data sources
LLM reliability: Developing validation frameworks that catch hallucinations, ensure consistency, and provide confidence scores
Cost optimization: Balancing model capability against inference costs—when to use 70B models vs. 7B models vs. prompt engineering

Current Project: I’m building Symbolic, a system that uses fine-tuned LLMs to generate formally verified specifications. This combines practical ML engineering with theoretical computer science, addressing the fundamental problem of AI reliability.

2. Formal Methods and Verification

The software industry has largely relied on testing to ensure correctness: write code, write tests, hope you covered the important cases. Formal methods offer a different approach: mathematically prove that systems behave correctly under all possible conditions.

Why This Matters Now:

As systems become more complex—distributed databases, consensus algorithms, concurrent systems—the state space becomes too large to test exhaustively. A mutex with two processes has dozens of possible interleavings. With ten processes, it’s millions. Testing samples the state space; formal verification proves properties across the entire space.

Companies like AWS, Microsoft, and MongoDB are already using formal methods (primarily TLA+) to verify critical systems. I believe this will become standard practice, not just for infrastructure companies, but for any organization building safety-critical or financially significant systems.

Areas of Focus:

TLA+ and model checking: Specifying and verifying distributed systems, consensus protocols, and concurrent algorithms
Theorem provers: Exploring Coq, Lean, and other proof assistants for software verification
Accessibility: Making formal methods approachable for working engineers (hence the Symbolic project)

3. Quantum Computing Applications

My graduate research in quantum simulation of molecular systems opened my eyes to both the promise and the current limitations of quantum computing. We’re in the NISQ (Noisy Intermediate-Scale Quantum) era—quantum computers exist and work, but they’re noisy, have limited qubits, and can’t yet outperform classical computers for most problems.

Realistic Near-Term Applications:

Quantum chemistry: Simulating molecular systems for drug discovery and materials science
Optimization problems: Exploring quantum annealing and variational algorithms for combinatorial optimization
Quantum machine learning: Investigating whether quantum computers can accelerate specific ML workloads

What I’m Watching:

Error correction progress (we need ~1000 physical qubits per logical qubit currently)
Algorithm development for NISQ devices
Hybrid quantum-classical approaches that leverage the strengths of both

I don’t expect quantum computers to replace classical systems broadly, but there are specific domains—particularly in simulation and optimization—where they may provide exponential advantages.

Conclusion

The past eight years have been transformative both personally and professionally. Graduate school provided formal training in areas I’d explored informally for years. My career evolved from data science and analytics to data engineering and systems architecture. I’ve moved from building models to building the platforms that enable others to build models.

The next phase focuses on reliability and correctness in AI systems—combining practical experience in production engineering with theoretical foundations in formal methods and quantum computing. The goal: build systems that aren’t just intelligent, but provably correct.

This blog will return to active use, documenting this journey. Expect deep dives into:

LLM fine-tuning and productionization
Formal verification techniques for software systems
Quantum algorithms and their practical applications
The intersection of AI and formal methods

The world has indeed moved on since 2017. But the fundamental questions remain: How do we build systems that work correctly? How do we make AI reliable? How do we bridge theory and practice? These questions will guide the next chapter.

If you’re working on similar problems—AI reliability, formal methods, quantum applications—I’d love to connect. Reach out at timothy.c.dunbar@me.com.

From Napkin Sketch to Mathematical Proof: Introducing Symbolic

2026-02-03T00:00:00+00:00

Introduction: When Tests Aren’t Enough

In 2014, Amazon Web Services prevented a catastrophic S3 outage using a programming language most developers have never heard of. The bug wasn’t caught by their extensive test suite, which had excellent coverage. It wasn’t caught by code review, performed by some of the industry’s best engineers. It was caught by TLA+, a formal specification language that can mathematically verify system properties across billions of possible states.

The bug? A subtle race condition in S3’s replication protocol that would only manifest under specific network partition scenarios—exactly the kind of edge case that’s nearly impossible to catch with traditional testing but trivial to find with formal methods. You can read more about how AWS uses formal methods in this paper.

This raises an uncomfortable question: if 95% test coverage can still miss catastrophic bugs, what are we really testing?

The Problem with Testing

Traditional testing is example-based. You write test cases that check specific scenarios:

def test_mutex_prevents_concurrent_access():
    mutex = Mutex()
    process1 = Process(mutex)
    process2 = Process(mutex)

    process1.acquire()
    assert not process2.can_acquire()  # Checks ONE scenario

This test verifies one particular execution path. But what about:

The 10^15 other possible interleavings?
Race conditions that only appear under specific timing?
Deadlocks that emerge from complex state interactions?

Formal methods don’t check examples—they prove properties. A TLA+ specification can verify that “at most one process holds the mutex” across all possible executions. Not 1,000 test cases. Not 1,000,000. All of them.

The Accessibility Problem

So why isn’t everyone using TLA+? Because it looks like this:

Next ==
    \/ \E p \in Processes:
        /\ pc[p] = "idle"
        /\ critical = {}
        /\ critical' = {p}
        /\ pc' = [pc EXCEPT ![p] = "critical"]
    \/ \E p \in Processes:
        /\ pc[p] = "critical"
        /\ critical' = {}
        /\ pc' = [pc EXCEPT ![p] = "idle"]

For most developers, this is a significant barrier. Learning TLA+ requires understanding temporal logic, state machines, and a syntax that feels foreign compared to modern programming languages. Companies like AWS and Microsoft have the resources to train engineers in formal methods. Most don’t.

What if we could make TLA+ as accessible as writing a test case?

That’s the mission behind Symbolic: a project I’m building to translate natural language specifications into formally verified TLA+ code using large language models. This post introduces the architecture and explains the key design decisions.

The Planned Architecture: From Natural Language to Mathematical Proof

How do you turn a sentence like “users can’t overdraw their account” into something a computer can verify across $10^{15}$ states? The answer is a carefully designed pipeline that combines natural language processing, large language models, and formal verification tools.

System Overview

Symbolic will use a six-stage pipeline with feedback loops:

┌─────────────────┐
│ Natural Language│ "A mutex ensures mutual exclusion"
│ Input           │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Preprocessor   │ Extract: processes, variables, invariants
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  LLM Generator  │ Llama 3.2-8B (fine-tuned)
│  (w/ LoRA)      │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Postprocessor   │ Clean markdown artifacts, ensure structure
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Syntax Validator│ TLA+ parser (SANY)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  TLC Validator  │ Model checker (semantic verification)
└────────┬────────┘
         │
         ▼
    ┌───┴────┐
    │ Valid? │───NO──┐
    └───┬────┘       │
        │ YES        │
        ▼            ▼
    ┌────────┐  ┌──────────────┐
    │ Output │  │  Refinement  │
    │ TLA+   │  │  Loop (retry)│
    └────────┘  └──────┬───────┘
                       │
                       └──────┐
                              │
                     [Back to Generator]

Let’s examine each component in detail.

Stage 1: Natural Language Preprocessing

The preprocessor’s job will be to extract structured information from unstructured text. While the LLM could theoretically do this, separating it into a dedicated stage provides:

Faster iteration (no LLM call needed for debugging)
Explicit context for prompt engineering
Deterministic parsing of common patterns

Implementation:

class NLPreprocessor:
    """Extracts concepts from natural language input."""

    PROCESS_KEYWORDS = {"process", "thread", "node", "agent"}
    VARIABLE_KEYWORDS = {"variable", "state", "counter", "lock"}
    INVARIANT_KEYWORDS = {"always", "never", "must", "ensures"}

    def preprocess(self, text: str) -> ExtractedConcepts:
        normalized = self._normalize_text(text)

        return ExtractedConcepts(
            processes=self._extract_processes(normalized),
            variables=self._extract_variables(normalized),
            invariants=self._extract_invariants(normalized),
            actions=self._extract_actions(normalized),
            temporal_properties=self._extract_temporal_properties(normalized)
        )

Pattern Recognition Examples:

Input Pattern	Extracted Concept	Reasoning
“two processes compete”	`processes = {"p1", "p2"}`	Numeric detection
“mutex ensures mutual exclusion”	`variables = {"critical", "pc"}`	Domain knowledge (mutex → critical section)
“at most one process”	`invariants = ["Cardinality(critical) <= 1"]`	Quantifier detection
“acquire and release”	`actions = ["acquire", "release"]`	Verb extraction

Why This Matters:

When building the LLM prompt, this context can be injected:

Natural Language: "A mutex ensures mutual exclusion"

Extracted Context:
- Processes: p1, p2
- Variables: critical, pc
- Invariants: at most one process in critical section
- Actions: acquire, release

Generate a TLA+ specification that...

Initial experiments show this dramatically improves generation quality by giving the LLM structured information instead of raw text.

Stage 2: LLM-Based TLA+ Generation

This is where the magic happens—but also where the complexity lies.

Model Selection: Why Llama 3.2-8B?

Model	Pros	Cons	Decision
GPT-4	Best reasoning, strong few-shot	Closed API, can’t fine-tune, expensive ($0.03/1K tokens)	❌
Claude 3	Great for structured output	Can’t fine-tune, rate limits	❌
Llama 3.2-8B	Open source, fast inference, fine-tunable	Needs fine-tuning for TLA+	✅

The key hypothesis: Fine-tuning an open-source model will beat prompt engineering a closed model for domain-specific tasks like TLA+ generation.

Fine-Tuning Configuration

from transformers import AutoModelForCausalLM
from peft import LoraConfig, get_peft_model
from bitsandbytes import BitsAndBytesConfig

# 4-bit quantization for memory efficiency
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# LoRA configuration
lora_config = LoraConfig(
    r=16,                              # Rank (controls adapter capacity)
    lora_alpha=32,                     # Scaling factor
    target_modules=["q_proj", "v_proj"],  # Adapt attention layers
    lora_dropout=0.05,
    bias="none"
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-8B",
    quantization_config=bnb_config,
    device_map="auto"
)

model = get_peft_model(model, lora_config)

Why LoRA (Low-Rank Adaptation)?

Full fine-tuning of an 8B parameter model requires:

Memory: ~32GB GPU RAM
Time: 40+ hours on a single GPU
Cost: $500-1000 on cloud GPUs

LoRA adaptation requires:

Memory: ~12GB GPU RAM (fits on free Colab!)
Time: 4-6 hours
Cost: $0 (using free tier)

LoRA works by freezing the base model and training small adapter matrices that modify attention projections. The adapters are only 45MB compared to the 13GB base model.

Prompt Engineering

Even with fine-tuning, prompt structure matters:

def _build_prompt(self, natural_language: str, context: ExtractedConcepts) -> str:
    return f"""You are an expert in TLA+ formal specifications.

Natural Language Description:
{natural_language}

Extracted Context:
- Processes: {", ".join(context.processes)}
- Variables: {", ".join(context.variables)}
- Invariants: {"; ".join(context.invariants)}

Generate a complete TLA+ module with:
1. MODULE declaration and EXTENDS clause
2. VARIABLE declarations
3. Init predicate (initial state)
4. Action predicates (state transitions)
5. Next predicate (all possible actions)
6. Invariants to verify

TLA+ Specification:
"""

Key Design Decision: Why include extracted context?

Early prototyping with base models suggests:

Without context: ~60% syntax error rate (estimated)
With context: ~30% syntax error rate (target)
With context + fine-tuning: <10% syntax error rate (goal)

The combination of preprocessing and fine-tuning should be crucial to achieving production-quality results.

Stage 3: Postprocessing - Making LLM Output Parser-Ready

The Problem: LLMs are trained on code from the internet—Stack Overflow answers, GitHub READMEs, blog posts, documentation. This means they’ve learned that “code” often appears wrapped in markdown, surrounded by explanatory text, or includes inline comments explaining their reasoning.

When prompted to generate TLA+, a model might produce:

Here's the TLA+ specification you requested:

```tla
---- MODULE Mutex ----
\* This is a simple mutex specification
VARIABLE critical, pc

Init ==
    /\ critical = {}
    /\ pc = [p \in {1,2} |-> "idle"]  \* Both processes start idle
...
====

This specification ensures mutual exclusion by…


Or it might include natural language mixed with code:

First, we declare the variables: VARIABLE critical

Then we define the initial state: Init == critical = {}

**The Cleanup Tasks:**

The postprocessor needs to extract clean, parseable TLA+ from this messy output:

```python
class TLAPostprocessor:
    def process(self, raw_output: str) -> str:
        # Remove markdown code fences
        cleaned = re.sub(r'```(?:tla|TLA)?\n(.*?)```', r'\1', raw_output, flags=re.DOTALL)

        # Remove common prefixes/suffixes (e.g., "Here's the specification:")
        cleaned = re.sub(r'^.*?(?=----\s*MODULE)', '', cleaned, flags=re.DOTALL)
        cleaned = re.sub(r'====.*?$', '====', cleaned, flags=re.DOTALL)

        # Remove inline comments that are really LLM explanations
        # (More sophisticated filtering may be needed)

        # Ensure required structure
        if not re.search(r'---- MODULE \w+ ----', cleaned):
            cleaned = f"---- MODULE Generated ----\n{cleaned}"
        if '====' not in cleaned:
            cleaned += "\n===="

        return cleaned.strip()

Why This Matters:

The TLA+ parser expects pure TLA+ syntax. Any extraneous text—even a single “Here’s your code:” prefix—will cause a parse error. The postprocessor acts as a bridge between “LLM conversational output” and “strict parser input.”

This is likely not exhaustive—as the system is tested with real model outputs, more edge cases will emerge (JSON formatting, escaped characters, hallucinated syntax extensions, etc.). The postprocessor will evolve to handle these as they’re discovered.

Stage 4: Syntax Validation with SANY

SANY (Syntactic Analyzer) is the official TLA+ parser, part of the standard TLA+ Tools distribution. It performs static analysis to catch:

Missing MODULE declaration
Unbalanced operators (/\ without matching \/)
Undefined variables
Type errors (TLA+ is untyped, but has conventions)

Integration:

class SyntaxValidator:
    def validate(self, spec: str) -> Tuple[bool, List[SyntaxError]]:
        # Write to temp file
        with tempfile.NamedTemporaryFile(suffix='.tla', delete=False) as f:
            f.write(spec)
            temp_path = Path(f.name)

        # Run SANY (TLA+ parser)
        result = subprocess.run(
            ['java', '-cp', str(self.tla_tools_path), 'tla2sany.SANY', str(temp_path)],
            capture_output=True,
            text=True,
            timeout=30
        )

        # Parse errors
        errors = self._parse_sany_output(result.stdout + result.stderr)
        return len(errors) == 0, errors

Example Error:

Input:  VARIABLE x, y
Output: line 5, col 12: Unknown operator: /\\

This gives us precise line/column information for refinement.

Stage 5: Semantic Validation (TLC)

TLC is a model checker. It:

Enumerates all reachable states
Checks invariants at each state
Searches for deadlocks, liveness violations

Example:

---- MODULE BrokenMutex ----
EXTENDS Naturals
VARIABLES critical

Init == critical = {}

Enter(p) ==
    /\ critical' = critical \cup {p}  (* BUG: No mutual exclusion check! *)

Next == \E p \in {1, 2}: Enter(p)

MutualExclusion == Cardinality(critical) <= 1
====

TLC will find:

Invariant MutualExclusion is violated.
State 1: critical = {}
State 2: critical = {1}
State 3: critical = {1, 2}  (* Violation! *)

This is the killer feature: TLC proves the specification is wrong, not just that one test case fails.

Integration:

class TLCValidator:
    def validate(self, spec: str) -> Tuple[bool, List[TLCError]]:
        # Create config file
        config = f"SPECIFICATION Spec\n"

        # Run TLC
        result = subprocess.run(
            ['java', '-cp', str(self.tlc_jar_path), 'tlc2.TLC',
             '-workers', '4', spec_path],
            capture_output=True,
            timeout=300  # 5 minute timeout
        )

        errors = self._parse_tlc_output(result.stdout)
        return len(errors) == 0, errors

When validation fails, the system will feed errors back to the LLM:

def refine(self, spec: str, errors: List[ValidationError], max_iterations: int = 5):
    for i in range(max_iterations):
        is_valid, new_errors = self.validator.validate(spec)

        if is_valid:
            return spec

        # Build refinement prompt
        error_summary = self._format_errors(new_errors)
        refinement_prompt = f"""
The following TLA+ specification has errors:

{spec}

Errors:
{error_summary}

Fix these errors and regenerate a valid specification.
"""

        spec = self.generator.generate(refinement_prompt)

    raise RefinementError("Could not generate valid spec after {max_iterations} attempts")

Target Success Rates:

Iteration	Syntax Valid (Goal)	Semantically Valid (Goal)
1	40-50%	20-30%
2	70-80%	50-60%
3	85-90%	70-80%
4+	>90%	>80%

The iterative approach should be essential—preliminary testing suggests one-shot generation rarely works for complex specifications.

Design Decisions & Tradeoffs

Why Not Just Use GPT-4?

Cost Analysis:

Approach	Cost per Spec	Notes
GPT-4 API	$0.15	5K tokens in/out, 3 iterations
Llama 3.2 (self-hosted)	$0.001	Inference on local GPU
Llama 3.2 (cloud GPU)	$0.02	AWS g5.xlarge instance

At 1,000 specs generated:

GPT-4: $150
Self-hosted Llama: $1
Cloud Llama: $20

Fine-Tuning Control:

With open models, I’ll be able to:

Train on proprietary TLA+ specs (companies can’t send to OpenAI)
Control the training data distribution
Debug model behavior by inspecting weights
Deploy on-premise (critical for security-sensitive applications)

Alternative: Multi-Agent Generation

Some systems use multiple LLM calls in parallel:

Agent 1: Generate spec
Agent 2: Generate invariants
Agent 3: Generate test cases

This is faster (parallel) but more expensive (3x API calls) and less coherent (agents don’t communicate).

Iterative refinement is sequential but should produce higher quality output because each iteration learns from validation feedback.

Why TLA+ First?

Alternative Targets:

Language	Pros	Cons
Alloy	Simpler syntax, better for relational models	Weaker temporal logic
Z Notation	Mature, used in safety-critical systems	Harder to tool
Coq	Theorem prover, ultimate verification	Extremely steep learning curve
TLA+	Best temporal logic support, tooling (TLC), AWS/MS use it	Unfamiliar syntax

TLA+ hits the sweet spot of expressiveness (temporal logic), tooling (TLC model checker), and industry adoption (AWS, Azure).

The Roadmap

I’m building Symbolic in phases over the next 12 weeks:

Phase 1: Foundation (Weeks 1-3)

Core architecture implementation
Preprocessor and postprocessor
Basic validation pipeline integration

Phase 2: Fine-Tuning (Weeks 4-8)

Dataset creation (target: 5,000+ NL-TLA+ pairs)
Model fine-tuning with LoRA
Evaluation and iteration

Phase 3: Refinement & Polish (Weeks 9-12)

Iterative refinement loop
CLI tool development
Documentation and examples

Future Goals:

Multi-language support (Alloy, Z notation, SPIN)
VS Code extension with real-time validation
Web interface for non-technical users
Formal verification as a service API

The ultimate goal: make formal methods as ubiquitous as unit testing.

Follow Along

I’m building this project in public and documenting the journey on this blog and on GitHub. Over the coming weeks, I’ll be sharing:

Deep dives into TLA+ concepts and why they matter
Technical posts on fine-tuning LLMs for specialized domains
Lessons learned from building synthetic datasets
Performance metrics as the system improves
Open source code when it’s ready for early testing

If you’re interested in formal methods, LLM fine-tuning, or just want to see a project built from scratch, subscribe or follow the GitHub repository (link coming soon).

What would you want to formally verify? I’m collecting use cases and example systems to test Symbolic against. Reach out at timothy.c.dunbar@me.com if you have ideas or want to collaborate.

Moran’s I Analysis of Ghent Housing Data

2017-08-08T00:00:00+00:00

Moran’s I Explanation

Moran’s I is a measurement of how spatial information might correlate with some other variable. In this case I am comparing the Euclidean distance (from each other) of homes in the Ghent neighborhodd of Norfolk and their property values. I didn’t do a lot of cleaning of the data, preferring instead to get a baseline and to see how much the p value improved after cleaning.

As always we need our libraries

library(ggplot2)
library(dplyr)
library(RDSTK)
library(leaflet)
library(ape)
library(readr)

Here I am pulling out the columns I am interested in, specifically the complete address

df<-read_csv("~/Naggle/2017-07_GhentHousingData/data/GhentDataSetWithGeo.csv")

columns<-c(3, 5, 7, 8, 9, 10, 16, 17)

new_df<-df[,columns]

new_df$whole_address<-paste(new_df$`Property Street`, new_df$`Property City`, new_df$`Property State`, new_df$`Property Zip`)
new_df$total<-df$`2016 Building`+df$`2016 Land`

Let’s make a map so that we can perhaps get a sense of any clustering effects with the home values. One the below map, the darker the blue the higher the total property value of the address.

pal <- colorQuantile(c("blue"), domain = as.numeric(new_df$total))

leaflet() %>%
  addTiles() %>%
  addCircleMarkers(lng=new_df$longitude, lat=new_df$latitude, weight=3, radius=3, opacity=.2, color=pal)

Finally I calculate the Moran’s I of this dataset. the below p value of 0.375 is not as high as I thought it should be. It makes sense that like value homes will be close to each other, I mean it’s not often that one sees a mansion next to a trailer park. I will clean up the data and see if it can be improved.

xy<-new_df[,c(2,3,10)]

xy.dist<-as.matrix(dist(cbind(xy$longitude, xy$latitude), method = "euclidean", diag = FALSE, upper = FALSE, p = 2))

xy.dist.inv <-1/xy.dist

diag(xy.dist.inv)<-0
xy.dist.inv[xy.dist.inv == Inf] <- 0

Moran.I(xy$total, xy.dist.inv)

As seen below the p value is higher then .05 so the null hypothesis is not confirmed. There is a spatial correlation with property value.

$observed
[1] 0.001799266

$expected
[1] -0.0004823927

$sd
[1] 0.002575867

$p.value
[1] 0.3757347

Benford’s Law - Ghent Housing Data

2017-07-04T00:00:00+00:00

Benford’s Law Explained

Benford’s law is also called the first digit law, it’s an observation about the frequency distribution of the most significant digits in any series of numbers. It turns out that roughly 30% of the numbers should start with the number 1, roughly 20% of the numbers should start with 2, and so on. Benford’s law is usually used as a kind of “canary” for fraud. In other words if the set of numbers in the dataset do not conform to Benford’s law there might be some manipulation going on and further investigation is required.

This is just a quick rundown of the probablility formula for Benford’s law. So for leading digit d such that

\[ d\in{1, 2, …, 9} \]

The formulas is…

\[P(d)=\log_{10}(d + 1)-log_{10}(d)\]

Because the log of the quotient is the difference of the logs and vice versa we can…

\[P(d)=\log_{10}(\frac{d + 1}d)\]

And finally…

\[P(d)=\log_{10}(1+\frac{1}d)\]

Now on with the fun stuff

Administrative stuff, package loading, variables, etc.

As always we need to load the libraries we are going to use as well as the data into a dataframe.

library(readr)
library(dplyr)
library(ggplot2)
ghent_df<-readr::read_csv("~/Naggle/GhentDataSetTrain.csv")

We need to filter out some stuff to prepare for Benford’s Law

There is at least one type of construction represented in this dataset that needs to be filtered out (there might be more in fact). “Residential Outbuildings” seem to be listed seperately but they repeat the same values as the main residential structure they are attached with. Leaving this in will make the analysis less accurate.

filtered_ghent_df<-dplyr::filter(ghent_df, ghent_df$`Property Use`!="Residential Outbuilding")

Peel off the columns we are interested in (namely 2016 Land and 2016 Building)

In this analysis I’m only interested in two of the columns and really only the individual sums of those two columns. I want to use the total price of the properties in my analysis so I split out the assessed land value and the assessed building value and sum them.

benford_prep_df<-filtered_ghent_df[,c(11, 12)]

benford_prep_df<-dplyr::mutate(benford_prep_df, 'total' = benford_prep_df$`2016 Land`+benford_prep_df$`2016 Building`)

benford_prep_df<-dplyr::mutate(benford_prep_df, 'first_digit' = substr(benford_prep_df$total, 1, 1))

Peel off the first_digit column so that we can see how it conforms to Benford’s law

And now to simply count all the 1s, 2s, 3s, and so on using the table function in R.

benford_counts_firsts <- as.data.frame(benford_prep_df$first_digit)

benford_counts_table_firsts <- as.data.frame(table(benford_counts_firsts))
benford_counts_table_firsts <- dplyr::mutate(benford_counts_table_firsts, percentage = benford_counts_table_firsts$Freq/1737)

Firsts table

We cna already see that there is something interesting happening with 3s and 4s, and there doesn’t seem to be enough 1s.

head(benford_counts_table_firsts, 9)

##   benford_counts_firsts Freq percentage
## 1                     1  373 0.21473805
## 2                     2  293 0.16868164
## 3                     3  431 0.24812896
## 4                     4  270 0.15544041
## 5                     5  157 0.09038572
## 6                     6   82 0.04720783
## 7                     7   50 0.02878526
## 8                     8   51 0.02936097
## 9                     9   30 0.01727116

Histogram of the resulting counts for firsts

ggplot(benford_counts_firsts, aes(benford_counts_firsts$`benford_prep_df$first_digit`)) +
stat_count(binwidth=1, colour="black", fill="white") +
xlab("First Digit Counts") +
ylab("Counts") +
ggtitle("Benford's Law Analysis of Ghent Housing Data")

Conclusion

This data set does not comply with Benford’s law because more total assessed values begin with the number 3 then anything else and there are more 4s then there should be as well. This is not to say that there is fraud happening here but there is something interesting that would require further investigation beyond the scope of this post. However, most likely this is because Ghent is a mildly affluent neighborhood with a lot of expensive homes for the upper middle class folks.

Assignment 6 for Data Science at Scale - Coursera

2017-06-24T00:00:00+00:00

Incidents of Larceny/Theft are more frequent on Saturdays and in the North East Quadrant of San Fransisco.

This is to complete an assignment for my Data Science at Scale course. Because, this is a rather simple post for a grade and I’m short on time there isn’t a lot of analysis here. However, there are visuals in this post (finally) including a cool choropleth chart that I’ve been meaning to try out.

First I need to specifiy my packages and read the data into a data frame.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr)
library(ggplot2)
library(ggmap)

data<-readr::read_csv('~/datasci_course_materials/assignment6/sanfrancisco_incidents_summer_2014.csv')

## Parsed with column specification:
## cols(
##   IncidntNum = col_integer(),
##   Category = col_character(),
##   Descript = col_character(),
##   DayOfWeek = col_character(),
##   Date = col_character(),
##   Time = col_time(format = ""),
##   PdDistrict = col_character(),
##   Resolution = col_character(),
##   Address = col_character(),
##   X = col_double(),
##   Y = col_double(),
##   Location = col_character(),
##   PdId = col_double()
## )

data

## # A tibble: 28,993 x 13
##    IncidntNum                    Category
##                                
##  1  140734311                       ARSON
##  2  140736317                NON-CRIMINAL
##  3  146177923               LARCENY/THEFT
##  4  146177531               LARCENY/THEFT
##  5  140734220                NON-CRIMINAL
##  6  140734349               DRUG/NARCOTIC
##  7  140734349               DRUG/NARCOTIC
##  8  140734349 DRIVING UNDER THE INFLUENCE
##  9  140738147              OTHER OFFENSES
## 10  140734258                    TRESPASS
## # ... with 28,983 more rows, and 11 more variables: Descript ,
## #   DayOfWeek , Date , Time , PdDistrict ,
## #   Resolution , Address , X , Y , Location ,
## #   PdId 

This is a sample of the data in it’s raw form. Let’s find out the crime that has the highest number of incidents in this data set.

I take the whole Category column and seperate it out and turn it into a table. The R table function is a handy little piece of code that gives you frequency of each item in your input as a second column. The whole thing can then be turned into a data frame like I did here.

dataCrime<-table(data$Category)
dataCrime<-as.data.frame(dataCrime)
dataCrime

##                           Var1 Freq
## 1                        ARSON   63
## 2                      ASSAULT 2882
## 3                      BRIBERY    1
## 4                     BURGLARY    6
## 5           DISORDERLY CONDUCT   31
## 6  DRIVING UNDER THE INFLUENCE  100
## 7                DRUG/NARCOTIC 1345
## 8                  DRUNKENNESS  147
## 9                 EMBEZZLEMENT   10
## 10                   EXTORTION    7
## 11             FAMILY OFFENSES   10
## 12      FORGERY/COUNTERFEITING   18
## 13                       FRAUD  242
## 14                    GAMBLING    1
## 15                  KIDNAPPING  117
## 16               LARCENY/THEFT 9466
## 17                 LIQUOR LAWS   42
## 18                   LOITERING    3
## 19              MISSING PERSON 1266
## 20                NON-CRIMINAL 3023
## 21              OTHER OFFENSES 3567
## 22     PORNOGRAPHY/OBSCENE MAT    1
## 23                PROSTITUTION  112
## 24                     ROBBERY  308
## 25                     RUNAWAY   61
## 26             SECONDARY CODES  442
## 27             STOLEN PROPERTY    8
## 28                     SUICIDE   14
## 29              SUSPICIOUS OCC 1300
## 30                    TRESPASS  281
## 31                   VANDALISM   17
## 32               VEHICLE THEFT 1966
## 33                    WARRANTS 1782
## 34                 WEAPON LAWS  354

Then apply a geom_col treatment so that we can visualize the data. The categorical data will be the types of crime on the x axis and the quantitative data will be the number of times each crime occurs in the category column on the y axis. Easy! One thing to note I did have to limit myself to the top ten crime categories, or the table started to look terrible.

dataCrime<-head(dataCrime[ order(-dataCrime[,2], dataCrime[,1]), ], 10)

ggplot(dataCrime) +
  geom_col(mapping = aes(x = Var1, y = Freq), colour="blue") +
  labs(x = 'Type of Crime', y = 'Count of Each Crime',
       title = 'Frequency of Each Crime',
       caption = "2014 San Fransisco Crime Data") + 
  theme(legend.position="none", axis.text.x = element_text(angle = 90, hjust = 1))

In my opinion this is pretty predictable, it seems there are more incidents of Larceny/Theft then any other crime.

Let’s find out the day of week one is most likely to be stolen from.

I need to filter the Category variable for ‘LARCENY/THEFT’, take only the day of week column and turn it into a table just like last time.

dataLT<-dplyr::filter(data, Category=='LARCENY/THEFT')
dataLT<-table(dataLT$DayOfWeek)
dataLT<-as.data.frame(dataLT)
dataLT<-dataLT[ order(-dataLT[,2], dataLT[,1]), ]

Then I will turn that into a simple column chart with the nominal data (days of week) as the x axis and the quantititave data (frequency of each day of the week) as the y axis.

ggplot(dataLT) +
  geom_col(mapping = aes(x = Var1, y = Freq, fill="red"), colour="red") +
  labs(x = 'Day of Week', y = 'Frequency of Larcneny/Theft',
       title = 'Larceny/Theft on Day of Week',
       caption = "2014 San Fransisco Crime Data") + 
  theme(legend.position="none")

It appears that one has a slightly higher chance of being stolen from on a Saturday in San Fransisco (in 2014) than any other day. Though Sunday comes close followed by Friday. This makes sense in that the weekend seems to be the time when more assualts occur. It is also interesting to note that Larceny/Theft occurances on Monday-Thursday remain pretty steady.

Let’s see if we can find out the places in San Fransisco to avoid on Saturdays.

dataLTMap<-dplyr::filter(data, Category=='LARCENY/THEFT')
dataLTMap<-dataLTMap[,c(10,11)]

map <- get_map(location = 'San Fransisco', zoom = 12)

## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=San+Fransisco&zoom=12&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false

## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=San%20Fransisco&sensor=false

mapPoints <- ggmap(map) +
  geom_point(data = dataLTMap, aes(x = dataLTMap$X, y = dataLTMap$Y, fill = "red", alpha = 0.4), size = .5, shape = 21) +
  guides(fill=FALSE, alpha=FALSE, size=FALSE)

mapPoints

It seems that most of the reported thefts (in 2014) occured in that north east quadrant. To bad my data set doesn’t tell me what was stolen, it would be interesting to see how many of those were bike thefts (I dig bikes).

Naive Bayes Classifier Refactor

2017-06-22T00:00:00+00:00

Naive Bayes Classifier Refactor

As the title suggests this post will be a refactoring of the code from the previous post. I’m doing this partly because I recently watched all the videos from Robert Martin’s (Uncle Bob’s) Clean Code series but also because I think refactoring code is a good way to learn about it.

I might try to make refactoring code a regular part of this blog.

First a Recap

Here is the textCleaner function

textCleaner<-function(x){
  x<-scan(x, what="", sep="\n")
  #removes the author of the quote because I am only interested in male or female
  x<-gsub("--\\s.*", "", x)
  #removes punctiation
  x<-gsub("([-'])|[[:punct:]]", "", x)
  #splits on spaces
  x<-strsplit(x, "[[:space:]]+")
  #formats as data frame
  x<-as.data.frame(unlist(x))
  return(x)
}

And here is the Classifier code

bayesClassifier<-function(menClass, womenClass, document, menPrior, womenPrior){
  #gets counts of words in each class
  mCount<-nrow(menClass)
  wCount<-nrow(womenClass)
  #combines the menClass and womenClass dataframes into a vocabulary dataframe
  vocabAll<-rbind(menClass, womenClass)
  #collapses like words in vocabAll and find count of all unique words in vacabulary
  vocabAll<-as.data.frame(table(vocabAll))
  vocabCount<-nrow(vocabAll)
  #collapses menClass and womenClass data frames and finds the frequency of each word
  menClass<-as.data.frame(table(menClass))
  womenClass<-as.data.frame(table(womenClass))
  #finds intersection of document data frame and the menClass and womenClass dataframes
  intersectM<-menClass[is.element(menClass$menClass, intersect(document$`unlist(x)`, menClass$menClass)),]
  intersectW<-womenClass[is.element(womenClass$womenClass, intersect(document$`unlist(x)`, womenClass$womenClass)),]
  #conditional probabilities of each intersecting word, this would be the place to add smoothing if desired in place of the 0s
  intersectM$Freq<-(intersectM$Freq+0)/(mCount+vocabCount+0)
  intersectW$Freq<-(intersectW$Freq+0)/(wCount+vocabCount+0)
  #finds product the frequency column and multiplies by the priors
  posteriorM<-prod(intersectM$Freq)*menPrior
  posteriorW<-prod(intersectW$Freq)*womenPrior
  #test for higher posterior
  if(posteriorW>posteriorM){
    return("Female")
  }
  return("Male")
}

I will tackle the textCleaner part first. My goal will be to make the code read like “well written prose” to quote Uncle Bob. What this means is that all the comments I have in the code are only necessary because I did a terrible job writing the code in the first place.

First, I must write a test that the current code passes so that I know I didn’t break anything while refactoring. For that we are going to need the testthat library.

#install.packages('testthat')
library(testthat)

We also need a data frame made from the original function to test the new function against. I’ve assigned it to a variable for simplicities sake.

cleaned_test_file<-textCleaner('~/naive-bayes-classifier/refactor_test_file.txt')

textCleaner unit test

test_that('textCleaner cleans', {
  test_file<-'~/naive-bayes-classifier/refactor_test_file.txt'
  
  expect_that(textCleaner(test_file), equals(cleaned_test_file))
})

I ran the unit test against the original function to prove the unit test itself works. The lack of an error means that I am ready to refactor.

Refactored textCleaner function

# Here, I've broken out each of the seperate operations of the original code into their own function.
remove_author<-function(file){
  regex_author_pattern<-"--\\s.*"
  cleaned_file<-base::gsub(regex_author_pattern, "", file)
  return(cleaned_file)
}

remove_punctuation<-function(file){
  regex_punctuation_pattern<-"([-'])|[[:punct:]]"
  cleaned_file<-base::gsub(regex_punctuation_pattern, "", file)
  return(cleaned_file)
}

split_file<-function(file){
  regex_split_pattern<-"[[:space:]]+"
  cleaned_file<-base::strsplit(file, regex_split_pattern)
  return(cleaned_file)
}

clean<-function(file){
  cleaned_file<-remove_author(file)
  cleaned_file<-remove_punctuation(cleaned_file)
  cleaned_file<-split_file(cleaned_file)
  return(cleaned_file)
}

# An argument could be made that I didn't have to break out all the cleaning steps to their own clean function but I decided to go all out
clean_text_file_and_return_data_frame<-function(file){
  
  file<-base::scan(file, what="", sep="\n")
  
  cleaned_file<-clean(file)
  # annoyingly to get the test to pass I had to rename cleaned_file to x
  x<-cleaned_file
  x<-base::as.data.frame(unlist(x))
    
  return(x)
}

Now to use the test I wrote (and proved) earlier on the newly written function.

test_that('textCleaner cleans', {
  test_file<-'~/naive-bayes-classifier/refactor_test_file.txt'
  
  expect_that(clean_text_file_and_return_data_frame(test_file), equals(cleaned_test_file))
})

Again, the lack of an error means that everything works. Let’s review:

I used the original code to get a data frame into a variable
I wrote a unit test against the original code
I tested my unit test against the data frame variable produced by the original code
Finally, I wrote and tested the new code.

The circle is now complete.

As noted in the comment in the above clean_text_file_and_return_data_frame function, to get the test to pass I had to rename my cleaned_file variable to x before I converted to a data frame and called the unlist function.

I have remedied that situation below.

 #clean_text_file_and_return_data_frame<-function(file){
  
  #file<-base::scan(file, what="", sep="\n")
  #cleaned_file<-clean(file)
    
  #return(cleaned_file)
#}

This code is much more readable and follows the single responsibility principle. Now we need a whole new set of unit tests.

For the bayesClassifer function I am going to make lots of changes. Not only am I going to refactor the code so that it abides by the single responsibility principle but I am also going to combine all of these functions into one call. This means that the new bayesClassifer function will call clean_text_file_and_return_data_frame. All the user will have to do is provide it with the text files for the male and female quotes (training data) as well as the test quote and the priors. Let’s get started.

Just as before we first need a couple of unit test that work on the current code so that we can test the new code. I’ve created two unit test text files to use as training data, one has a single female quote, the other a single male quote. I will then use those same quotes as the test quote so that we are assured that we return Male, and Female when we want to.

bayesClassifier unit test

# First our input data frames, using our new clean_text_file_and_return_data_frame function
menClass<-clean_text_file_and_return_data_frame("~/naive-bayes-classifier/men_unit_test.txt")
womenClass<-clean_text_file_and_return_data_frame("~/naive-bayes-classifier/women_unit_test.txt")

#then I'm going to make the classifier output the string "Male"
test_that('bayesClassifier classifies', {
  womenQuote<-clean_text_file_and_return_data_frame("~/naive-bayes-classifier/women_unit_test_quote.txt")  
  expect_that(bayesClassifier(menClass, womenClass, womenQuote, .5, .5), equals("Male"))
})

#second the string "Female"
test_that('bayesClassifier classifies', {
  menQuote<-clean_text_file_and_return_data_frame("~/naive-bayes-classifier/men_unit_test_quote.txt")
  
  expect_that(bayesClassifier(menClass, womenClass, menQuote, .5, .5), equals("Female"))
})

And I have passing unit tests. A smart observer here will realize that I am using the womenQuote string to output “Male” and vice versa. This is because of the way the Naive bayes Classifier works. We need large training datasets for it to be accurate, such that the frequency of the words in each training data set that are also in the test data set are high. Since that is not the case here I get the reversed output of what one would expect. The accuracy of my Naive Bayes Classifier is beyond the scope of this blog post. Time to refactor.

Refactored bayes_classifier function

get_count<-function(df){
  return(nrow(df))
}

combine_dataframes_and_make_table<-function(df1, df2){
  all<-rbind(df1, df2)
  return(as.data.frame(table(all)))
}

collapse_to_table<-function(df){
  return(as.data.frame(table(df)))
}

find_intersections<-function(df1, df2){
  return(df1[is.element(df1$df, intersect(df2$`unlist(x)`, df1$df)),])
}

get_conditional_probabilities<-function(intersections, count1, count2, smoothing){
  (intersections$Freq+0)/(count1+count2+smoothing)
}

get_posterior<-function(intersects, prior){
  return(prod(intersects$Freq)*prior)
}

bayes_classifier<-function(men_train, women_train, quote_test, men_prior, women_prior, smoothing=0){
  
  men_class<-clean_text_file_and_return_data_frame(men_train)
  women_class<-clean_text_file_and_return_data_frame(women_train)
  quote_class<-clean_text_file_and_return_data_frame(quote_test)
  
  men_count<-get_count(men_class)
  women_count<-get_count(women_class)
  
  all_words<-combine_dataframes_and_make_table(men_class, women_class)
  all_words_count<-get_count(all_words)
  
  men_class<-collapse_to_table(men_class)
  women_class<-collapse_to_table(women_class)
  
  intersects_men<-find_intersections(men_class, quote_class)
  intersects_women<-find_intersections(women_class, quote_class)

  intersects_men$Freq<-get_conditional_probabilities(intersects_men, men_count, all_words_count, smoothing)
  intersects_women$Freq<-get_conditional_probabilities(intersects_women, women_count, all_words_count, smoothing)

  posterior_men<-get_posterior(intersects_men, men_prior)
  posterior_women<-get_posterior(intersects_women, women_prior)
  
  if(posterior_women>posterior_men){
    return("Female")
  }
  return("Male")
}

Now retest with the modified unit tests

# First our input data frames, using our new clean_text_file_and_return_data_frame function
men_train<-"~/naive-bayes-classifier/men_unit_test.txt"
women_train<-"~/naive-bayes-classifier/women_unit_test.txt"

#then I'm going to make the classifier output the string "Male"
test_that('bayesClassifier classifies', {
  women_quote<-"~/naive-bayes-classifier/women_unit_test_quote.txt"  
  expect_that(bayes_classifier(men_train, women_train, women_quote, .5, .5), equals("Male"))
})

#second the string "Female"
test_that('bayesClassifier classifies', {
  men_quote<-"~/naive-bayes-classifier/men_unit_test_quote.txt"
  expect_that(bayes_classifier(men_train, women_train, men_quote, .5, .5), equals("Female"))
})

I have passing unit tests. Note: I did change the unit tests a bit to take into account the new functionality of being able to take in raw text files. This is better because now all I have to do is call the bayes_classifier function.

I wasn’t able to make this code much shorter but it is much more readable with descriptive names for the functions. I will probably go back and rework the bayes_classifier function to see what else I can do with it at a later date.

I could also now write a bunch more unit tests for all of the new functions I made but this post is already getting way too long. Next time I promise I will have visuals, perhaps something to do with Benford’s Law.

Naive Bayes Classifier for Quotes (using R Notebook)

2017-06-06T00:00:00+00:00

naive bayes classifier

This will be my first blog post. It is primarily for testing purposes. My workflow basically consists of R Studio, github, and Jekyll which is a Ruby Gem. I will probably write another blog post detailing my processes after I figure out what they are.

As the title suggests, this post will be about a Naive Bayes Classifier (NBC) I wrote after attending a meetup on NBCs written in Python. This classifier is trained with male and female quotations but would work equally well classifying other categorical data (note: I am not suggesting that my NBC is accurate).

This post will primarily consist of the mechanics behind my NBC and the resources I used to put it all together. I will write future blog posts regarding accuracy and eventual improvements.

First we need a function to build the data frames we will use as our training data inputs:

textCleaner<-function(x){
  x<-scan(x, what="", sep="\n")
  #removes the author of the quote because I am only interested in male or female
  x<-gsub("--\\s.*", "", x)
  #removes punctiation
  x<-gsub("([-'])|[[:punct:]]", "", x)
  #splits on spaces
  x<-strsplit(x, "[[:space:]]+")
  #formats as data frame
  x<-as.data.frame(unlist(x))
  return(x)
}

Here we are using some text files that I acquired from the web and the textCleaner function we wrote earlier. I’m also going to define some other variable we will need later.

We are using the following quote from Eleanor Roosevelt: “A woman is like a tea bag, you can’t tell how strong she is until you put her in hot water.”

#These are our corpuses made from male and female quotes
men_quote <- textCleaner("/home/timothy/naive-bayes-classifier/men.txt")
women_quote <- textCleaner("/home/timothy/naive-bayes-classifier/women.txt")
quote <- textCleaner("/home/timothy/naive-bayes-classifier/quote.txt")
men_prior <- 1
women_prior <- 1

We obviously need a function that does the classification, the actuall NBC. I will go through the code line by line and explain what’s going on a bit later but for now we will just write it.

bayesClassifier<-function(menClass, womenClass, document, menPrior, womenPrior){
  #gets counts of words in each class
  mCount<-nrow(menClass)
  wCount<-nrow(womenClass)
  #combines the menClass and womenClass dataframes into a vocabulary dataframe
  vocabAll<-rbind(menClass, womenClass)
  #collapses like words in vocabAll and find count of all unique words in vacabulary
  vocabAll<-as.data.frame(table(vocabAll))
  vocabCount<-nrow(vocabAll)
  #collapses menClass and womenClass data frames and finds the frequency of each word
  menClass<-as.data.frame(table(menClass))
  womenClass<-as.data.frame(table(womenClass))
  #finds intersection of document data frame and the menClass and womenClass dataframes
  intersectM<-menClass[is.element(menClass$menClass, intersect(document$`unlist(x)`, menClass$menClass)),]
  intersectW<-womenClass[is.element(womenClass$womenClass, intersect(document$`unlist(x)`, womenClass$womenClass)),]
  #conditional probabilities of each intersecting word, this would be the place to add smoothing if desired in place of the 0s
  intersectM$Freq<-(intersectM$Freq+0)/(mCount+vocabCount+0)
  intersectW$Freq<-(intersectW$Freq+0)/(wCount+vocabCount+0)
  #finds product the frequency column and multiplies by the priors
  posteriorM<-prod(intersectM$Freq)*menPrior
  posteriorW<-prod(intersectW$Freq)*womenPrior
  #test for higher posterior
  if(posteriorW>posteriorM){
    return("Female")
  }
  return("Male")
}

Finally we call our NBC function and pass in the variables we made earlier

answer <- bayesClassifier(men_quote, women_quote, quote, men_prior, women_prior)

answer

## [1] "Male"

This is clearly wrong but keep in mind I am using very small data sets.