<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://realtimdunbar.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://realtimdunbar.github.io/" rel="alternate" type="text/html" /><updated>2026-03-25T12:57:37+00:00</updated><id>https://realtimdunbar.github.io/feed.xml</id><title type="html">Tim Dunbar</title><subtitle>Explorer</subtitle><entry><title type="html">Stepping Back: A Literature Review of LLMs for Automated Theorem Proving</title><link href="https://realtimdunbar.github.io/stepping-back-a-formal-literature-review/" rel="alternate" type="text/html" title="Stepping Back: A Literature Review of LLMs for Automated Theorem Proving" /><published>2026-03-25T12:00:00+00:00</published><updated>2026-03-25T12:00:00+00:00</updated><id>https://realtimdunbar.github.io/stepping-back-a-formal-literature-review</id><content type="html" xml:base="https://realtimdunbar.github.io/stepping-back-a-formal-literature-review/"><![CDATA[<h2 id="why-im-doing-this-now">Why I’m Doing This Now</h2>

<p>Four posts into the Symbolic series, I’ve built a pipeline, scraped GitHub, validated specs through SANY and TLC, and arrived at a humbling number: <strong>zero</strong> TLA+ specifications in my dataset survive end-to-end model checking. Eight unique specs passed syntax validation. That’s it.</p>

<p>The honest thing to do here is not to grind harder on the same approach. It’s to stop and ask: <strong>what has everyone else already figured out?</strong></p>

<p>I should have done a literature review before writing a single line of code. That’s not hindsight — that’s methodology. In graduate school, a literature review is the first chapter of your thesis for a reason. It prevents you from reinventing wheels, reveals approaches you’d never consider, and — crucially — tells you where the open problems actually are.</p>

<p>So this post isn’t about Symbolic’s architecture or my dataset pipeline. It’s about the process of conducting a formal literature review at the intersection of large language models and automated theorem proving — what tools I’m using, how I’m organizing the search, and how I plan to synthesize what I find.</p>

<p>The actual findings will come in the next post. This one is about the method.</p>

<h2 id="what-is-a-formal-literature-review">What Is a Formal Literature Review?</h2>

<p>A literature review is not a Google search. It’s not skimming abstracts and citing whatever supports your argument. A formal literature review is a <strong>systematic, reproducible process</strong> for surveying existing research in a problem space.</p>

<p>The key properties:</p>

<ul>
  <li><strong>Defined scope.</strong> You state exactly what you’re looking for and what you’re not.</li>
  <li><strong>Reproducible search.</strong> Someone else could follow your search strategy and find the same papers.</li>
  <li><strong>Inclusion/exclusion criteria.</strong> You decide up front what counts, before you see results.</li>
  <li><strong>Synthesis, not summary.</strong> You identify themes, contradictions, and gaps — not just restate each paper.</li>
</ul>

<p>There are different levels of rigor. A full <strong>systematic literature review</strong> (SLR) follows protocols like PRISMA, pre-registers the search strategy, and may involve multiple reviewers for bias reduction. That’s what you’d do for a journal publication.</p>

<p>What I’m doing is closer to a <strong>scoping review</strong> — a structured but less rigid survey meant to map the landscape of a research area and identify key themes and gaps. It’s the right tool when you’re asking “what has been done?” rather than “what is the effect size of X?”</p>

<h2 id="defining-the-scope">Defining the Scope</h2>

<p>The first step is to define the research questions. Not the questions I want to <em>answer</em> — the questions that guide what I <em>search for</em>.</p>

<p><strong>Primary question:</strong> What approaches have been explored for using large language models to generate, assist with, or verify formal proofs in automated theorem proving systems?</p>

<p><strong>Secondary questions:</strong></p>
<ul>
  <li>What formal languages and proof assistants are being targeted (Lean, Coq, Isabelle, TLA+, others)?</li>
  <li>What LLM architectures and training strategies have shown promise?</li>
  <li>How are training datasets constructed for low-resource formal languages?</li>
  <li>What evaluation metrics and benchmarks are used?</li>
  <li>Where are the open problems and failure modes?</li>
</ul>

<p>Notice how much broader this is than “can I fine-tune Llama to write TLA+?” That’s deliberate. Symbolic targets TLA+ specifically, but the techniques for training LLMs on Lean proofs or Coq tactics may transfer. The dataset construction challenges for Isabelle are almost certainly relevant to mine. By widening the aperture, I avoid tunnel vision.</p>

<h2 id="the-search-strategy">The Search Strategy</h2>

<h3 id="choosing-databases">Choosing Databases</h3>

<p>Academic search is fragmented. No single database covers everything. Here’s what I’m using and why:</p>

<p><strong>Semantic Scholar</strong> (<a href="https://www.semanticscholar.org">semanticscholar.org</a>)
My primary search engine. Semantic Scholar indexes over 200 million papers, provides excellent API access, and has features specifically designed for literature reviews — citation graphs, TLDR summaries, and influence scores. Its AI-powered relevance ranking tends to surface highly-cited foundational papers alongside recent work, which is exactly what I need.</p>

<p><strong>arXiv</strong> (<a href="https://arxiv.org">arxiv.org</a>)
The preprint server where most ML and formal methods research lands first. Papers here are often months ahead of journal publication. I’ll search arXiv directly for the most recent work that Semantic Scholar may not have indexed yet.</p>

<p><strong>Google Scholar</strong> (<a href="https://scholar.google.com">scholar.google.com</a>)
Broader coverage than Semantic Scholar, especially for older work and conference proceedings. I use it as a secondary source and for “cited by” chains — finding newer papers that cite a foundational one.</p>

<p><strong>ACM Digital Library and IEEE Xplore</strong>
For conference papers from venues like ICML, NeurIPS, ICLR, CAV, POPL, and ITP that may not be freely available on arXiv.</p>

<h3 id="constructing-search-queries">Constructing Search Queries</h3>

<p>The query design matters enormously. Too narrow and you miss relevant work. Too broad and you drown in noise. I’m using a structured approach with Boolean operators:</p>

<p><strong>Core query:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>("large language model" OR "LLM" OR "transformer" OR "neural theorem proving")
AND
("theorem proving" OR "formal verification" OR "proof assistant" OR "proof generation")
</code></pre></div></div>

<p><strong>Variant queries for specific aspects:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Dataset construction
("training data" OR "dataset" OR "benchmark")
AND ("formal proof" OR "theorem proving" OR "proof assistant")

# Specific proof assistants
("Lean" OR "Coq" OR "Isabelle" OR "TLA+" OR "HOL")
AND ("language model" OR "machine learning" OR "neural")

# Evaluation and metrics
("evaluation" OR "benchmark" OR "accuracy")
AND ("automated theorem proving" OR "proof generation")
AND ("language model" OR "neural" OR "transformer")
</code></pre></div></div>

<p>I’ll run each query across each database, tracking what I searched, when, and how many results I got. This is the “reproducible” part — if someone wanted to verify my survey, they could rerun these queries.</p>

<h3 id="snowball-sampling">Snowball Sampling</h3>

<p>Queries only get you so far. Some of the most relevant papers will be found through <strong>snowball sampling</strong>:</p>

<ul>
  <li><strong>Backward snowballing:</strong> For each key paper I find, I check its references. If a paper on LLM-based Lean proving cites a foundational paper on neural theorem proving, I add that to my review.</li>
  <li><strong>Forward snowballing:</strong> For foundational papers, I check who has cited them since. Semantic Scholar’s “cited by” feature and Google Scholar’s citation tracking are essential here.</li>
</ul>

<p>This is how you find the papers that don’t match your keywords but are deeply relevant. A 2020 paper on “neural guided proof search” might not contain the phrase “large language model” but could be foundational to the entire field.</p>

<h2 id="organizing-with-zotero">Organizing with Zotero</h2>

<p>Raw search results are useless without organization. I’m using <strong>Zotero</strong> (<a href="https://www.zotero.org">zotero.org</a>) — a free, open-source reference manager — as my central hub.</p>

<h3 id="why-zotero">Why Zotero?</h3>

<ul>
  <li><strong>Free and open source.</strong> No subscription paywalls.</li>
  <li><strong>Browser extension.</strong> One click to save a paper from Semantic Scholar, arXiv, or any journal site. It automatically extracts title, authors, date, abstract, and DOI.</li>
  <li><strong>PDF management.</strong> Zotero stores and indexes PDFs. I can annotate directly in the reader and those annotations become searchable.</li>
  <li><strong>Tagging and collections.</strong> I create nested collections that mirror my research questions and tag papers by theme.</li>
  <li><strong>Citation export.</strong> When I write the synthesis post, Zotero generates citations in any format.</li>
  <li><strong>Zotero Connector + Better BibTeX plugin.</strong> If I later want to write in LaTeX, the integration is seamless.</li>
</ul>

<h3 id="my-collection-structure">My Collection Structure</h3>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>LLM-ATP Literature Review/
├── 01 - Foundational Papers/
│   ├── Neural Theorem Proving (pre-LLM)
│   └── Transformer Architecture for Math
├── 02 - LLM Proof Generation/
│   ├── Lean
│   ├── Coq
│   ├── Isabelle
│   ├── TLA+
│   └── Other/Multi-system
├── 03 - Dataset Construction/
│   ├── Synthetic Generation
│   ├── Corpus Extraction
│   └── Benchmarks
├── 04 - Training Strategies/
│   ├── Fine-tuning
│   ├── Prompt Engineering
│   ├── Reinforcement Learning
│   └── Retrieval-Augmented
├── 05 - Evaluation &amp; Metrics/
└── 06 - Surveys &amp; Meta-analyses/
</code></pre></div></div>

<p>Each paper gets tagged with relevant themes. A single paper might live in “02 - LLM Proof Generation / Lean” but also be tagged <code class="language-plaintext highlighter-rouge">dataset-construction</code> and <code class="language-plaintext highlighter-rouge">reinforcement-learning</code> if it covers those aspects.</p>

<h3 id="annotation-strategy">Annotation Strategy</h3>

<p>When I read each paper, I annotate with a consistent structure:</p>

<ul>
  <li><strong>Yellow highlight:</strong> Key claims and findings</li>
  <li><strong>Blue highlight:</strong> Methodology details I might adopt</li>
  <li><strong>Red highlight:</strong> Limitations, failure modes, open problems</li>
  <li><strong>Green highlight:</strong> Dataset details (size, source, construction method)</li>
  <li><strong>Notes:</strong> My own thoughts on relevance to Symbolic</li>
</ul>

<p>This isn’t busywork — it’s what makes synthesis possible later. When I sit down to write about dataset construction approaches across the field, I can filter by green highlights and get every relevant data point without re-reading 40 papers.</p>

<h2 id="inclusion-and-exclusion-criteria">Inclusion and Exclusion Criteria</h2>

<p>Before I start reading, I define what’s in scope and what’s not. This prevents the review from expanding infinitely.</p>

<p><strong>Inclusion criteria:</strong></p>
<ul>
  <li>Published 2019 or later (transformer era — pre-transformer neural theorem proving is foundational context only)</li>
  <li>Addresses the use of neural language models for theorem proving, proof generation, or formal verification</li>
  <li>Targets at least one established proof assistant or formal system</li>
  <li>Available in English</li>
  <li>Peer-reviewed publication, accepted preprint, or technical report from a recognized research group</li>
</ul>

<p><strong>Exclusion criteria:</strong></p>
<ul>
  <li>Pure code generation without formal verification (e.g., Copilot-style code completion)</li>
  <li>Natural language reasoning or informal mathematical problem solving (e.g., GSM8K, MATH benchmark)</li>
  <li>Papers focused solely on symbolic AI or traditional ATP without neural components</li>
  <li>Blog posts, tutorials, or documentation (useful for context but not for the review itself)</li>
</ul>

<p>The boundary between “code generation” and “proof generation” is blurry. A paper about using LLMs to generate Dafny code with verification conditions is relevant. A paper about generating Python with unit tests is not. I’ll make judgment calls at the margin and document them.</p>

<h2 id="the-reading-process">The Reading Process</h2>

<p>I don’t read every paper cover to cover. That’s not feasible, and it’s not necessary. I use a three-pass approach adapted from S. Keshav’s “<a href="http://ccr.sigcomm.org/online/files/p83-keshavA.pdf">How to Read a Paper</a>”:</p>

<p><strong>Pass 1: Survey (5 minutes per paper)</strong>
Read the title, abstract, introduction, section headings, and conclusion. Decide: is this relevant enough for Pass 2? This is where the inclusion/exclusion criteria do their work.</p>

<p><strong>Pass 2: Comprehension (30 minutes per paper)</strong>
Read the full paper, but don’t get stuck on dense proofs or implementation details. Understand the approach, the key results, and the limitations. Annotate in Zotero. Add tags.</p>

<p><strong>Pass 3: Deep read (1-2 hours per paper)</strong>
Only for the most important papers — the ones I’ll discuss in detail in the synthesis. Understand the methodology well enough to evaluate it critically. Could I reproduce this? Where does it break? How does it relate to Symbolic?</p>

<p>I expect roughly:</p>
<ul>
  <li>100-150 papers from initial search results</li>
  <li>40-60 papers after Pass 1 filtering</li>
  <li>15-25 papers given a deep read in Pass 3</li>
</ul>

<h2 id="tracking-the-process">Tracking the Process</h2>

<p>I’m maintaining a simple spreadsheet alongside Zotero to track the review process itself:</p>

<p>For each paper, I track:</p>

<ul>
  <li><strong>Title and source</strong> — where I found it (Semantic Scholar, arXiv, etc.)</li>
  <li><strong>Pass 1 date</strong> — when I surveyed it</li>
  <li><strong>Include?</strong> — Yes / No / Maybe</li>
  <li><strong>Pass 2 and Pass 3 dates</strong> — if applicable</li>
  <li><strong>Key theme</strong> — mapped to my collection categories</li>
  <li><strong>Relevance to Symbolic</strong> — High / Medium / Low</li>
</ul>

<p>This serves two purposes. First, it keeps me honest — I can see if I’m spending too long in the weeds or skipping important categories. Second, it makes the review auditable. If someone questions whether I considered a particular line of research, I can point to the spreadsheet.</p>

<h2 id="what-i-expect-to-find">What I Expect to Find</h2>

<p>I’m going in with hypotheses, not conclusions. But based on what I’ve already encountered tangentially, I expect the landscape to include:</p>

<p><strong>Well-explored territory:</strong></p>
<ul>
  <li>LLM-based tactic prediction for Lean (LeanDojo, ReProver, and related work)</li>
  <li>GPT-4 and similar frontier models on mathematical reasoning benchmarks</li>
  <li>Autoformalization — translating informal math to formal statements</li>
</ul>

<p><strong>Less-explored territory:</strong></p>
<ul>
  <li>Fine-tuning smaller, open-source models for specific proof assistants</li>
  <li>Dataset construction for low-resource formal languages (this is my problem)</li>
  <li>TLA+ specifically (I suspect very little exists here)</li>
  <li>Reliability and failure mode analysis of LLM-generated proofs</li>
</ul>

<p><strong>Open questions I’m watching for:</strong></p>
<ul>
  <li>How much training data do you actually need for useful proof generation?</li>
  <li>Does reinforcement learning from proof checker feedback outperform supervised fine-tuning?</li>
  <li>Can techniques that work for Lean (rich type theory, large mathlib corpus) transfer to TLA+ (temporal logic, sparse data)?</li>
</ul>

<h2 id="why-this-matters-for-symbolic">Why This Matters for Symbolic</h2>

<p>I could skip all of this and go back to grinding on dataset construction. But that’s how you end up building something that already exists, or worse, building something that the research community has already shown doesn’t work.</p>

<p>The literature review will inform Symbolic in specific ways:</p>

<ol>
  <li><strong>Dataset strategy.</strong> If the field has converged on synthetic generation over web scraping, I should know that before spending another month scraping GitHub.</li>
  <li><strong>Model selection.</strong> If fine-tuning 8B parameter models is a dead end for this task and the research points to other approaches, I need to know.</li>
  <li><strong>Evaluation framework.</strong> I’m currently measuring syntax validity and semantic validity. The field may have better metrics.</li>
  <li><strong>Positioning.</strong> If nobody has done this for TLA+ specifically, that’s a contribution. If someone has, I need to know what they found.</li>
</ol>

<h2 id="what-comes-next">What Comes Next</h2>

<p>The next post will be the synthesis — what I actually found. I’ll organize it by theme rather than by paper, identify the key technical approaches, map the gaps, and explain how it reshapes my plan for Symbolic.</p>

<p>For now, the work is the unglamorous part: running queries, reading abstracts, filling out spreadsheets, and annotating PDFs. It’s not as exciting as writing code, but it’s how you build something that matters instead of something that already failed somewhere else.</p>

<p>The formal literature review starts now.</p>]]></content><author><name>Tim Dunbar</name></author><summary type="html"><![CDATA[After hitting a wall with dataset construction, I'm doing what I should have done first — a formal literature review of the intersection of large language models and automated theorem proving.]]></summary></entry><entry><title type="html">Running the Real TLA+ Toolchain: What Survives SANY and TLC</title><link href="https://realtimdunbar.github.io/running-the-real-tla-toolchain/" rel="alternate" type="text/html" title="Running the Real TLA+ Toolchain: What Survives SANY and TLC" /><published>2026-02-17T06:00:00+00:00</published><updated>2026-02-17T06:00:00+00:00</updated><id>https://realtimdunbar.github.io/running-the-real-tla-toolchain</id><content type="html" xml:base="https://realtimdunbar.github.io/running-the-real-tla-toolchain/"><![CDATA[<p>In my <a href="/validating-tlaplus-dataset/">last post</a>, I collected 449 TLA+ files from GitHub and validated them down to 79 using basic structural checks — balanced brackets, module headers and footers. I reported that 52 of those 79 passed “TLC validation.”</p>

<p>Tonight I ran the <em>actual</em> TLA+ toolchain — the SANY parser and TLC model checker — on all 79 files. The results were humbling.</p>

<blockquote>
  <p><strong>New to this series?</strong> Start with <a href="/From-Napkin-Sketch-to-Mathematical-Proof/">From Napkin Sketch to Mathematical Proof: Introducing Symbolic</a> for the full context.</p>
</blockquote>

<h2 id="the-gap-between-structural-and-semantic-validation">The Gap Between Structural and Semantic Validation</h2>

<p>My previous validation checked for things like matching <code class="language-plaintext highlighter-rouge">---- MODULE ----</code> headers and balanced parentheses. That’s like checking that a Python file has proper indentation — necessary but nowhere near sufficient.</p>

<p>SANY (the official TLA+ parser) does full semantic analysis: operator resolution, type consistency, module dependency resolution. TLC goes further and actually model-checks the specification against its properties.</p>

<p>The difference matters.</p>

<h2 id="pre-analysis-the-dependency-problem">Pre-Analysis: The Dependency Problem</h2>

<p>Before running SANY, I analyzed what each file actually needs:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Dependency Analysis (79 files):
  Standard-only:  12  — only uses Naturals, Integers, Sequences, etc.
  Custom deps:    55  — INSTANCE/EXTENDS modules we don't have
  No deps:        12  — fully self-contained
</code></pre></div></div>

<p><strong>55 out of 79 files depend on custom modules from their source repository that we never scraped.</strong> These are mostly <code class="language-plaintext highlighter-rouge">MC*.tla</code> files — model-checking configurations that reference a main specification. For example, <code class="language-plaintext highlighter-rouge">MC_n4_f1.tla</code> from the CometBFT repo needs <code class="language-plaintext highlighter-rouge">TendermintAccDebug_004_draft.tla</code>, which we don’t have.</p>

<p>This was predictable in hindsight. GitHub’s code search returned individual files, not complete projects. We grabbed the configuration files without their dependencies.</p>

<h2 id="results-sany-validation">Results: SANY Validation</h2>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SANY Results (79 files):
  Passed:  18  (22.8%)
  Failed:  61  (77.2%)
</code></pre></div></div>

<p>I expected losses from the dependency problem. I did not expect nearly 80% of the dataset to fall away in a single step.</p>

<h3 id="failure-categories">Failure Categories</h3>

<table>
  <thead>
    <tr>
      <th>Category</th>
      <th>Count</th>
      <th>What it means</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>missing_module</strong></td>
      <td>57</td>
      <td>Can’t find a module the file depends on</td>
    </tr>
    <tr>
      <td><strong>semantic_error</strong></td>
      <td>2</td>
      <td>Valid syntax but semantic issues (duplicate definitions)</td>
    </tr>
    <tr>
      <td><strong>sany_error</strong></td>
      <td>2</td>
      <td>Internal SANY errors (malformed recursive declarations)</td>
    </tr>
    <tr>
      <td><strong>tlc_error</strong></td>
      <td>1</td>
      <td>Passed SANY, failed TLC</td>
    </tr>
    <tr>
      <td><strong>success</strong></td>
      <td>17</td>
      <td>Passed SANY (no Spec operator, so TLC can’t run)</td>
    </tr>
  </tbody>
</table>

<p>The dominant failure mode is clear: <strong>72% of files fail because they reference modules we don’t have.</strong> Not because the TLA+ is wrong — because we only scraped half the project. The remaining failures — semantic errors, malformed declarations — are genuine bugs in the specs, but they’re rounding errors next to the dependency problem.</p>

<h2 id="a-filename-gotcha">A Filename Gotcha</h2>

<p>One lesson from tonight: SANY requires the <code class="language-plaintext highlighter-rouge">.tla</code> filename to exactly match the <code class="language-plaintext highlighter-rouge">MODULE</code> declaration inside the file. Our GitHub scraper renamed files with repo prefixes — <code class="language-plaintext highlighter-rouge">Aqua-218_NyxNet_Gateway.tla</code> contains <code class="language-plaintext highlighter-rouge">MODULE Gateway</code>.</p>

<p>Every single file initially failed SANY with:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>File name 'Aqua-218_NyxNet_Gateway' does not match the name 'Gateway'
of the top level module it contains.
</code></pre></div></div>

<p>The fix: copy each file to a temp directory with the correct name before validation. A small thing, but it would have been easy to misinterpret 0/79 passing as “all our specs are broken” rather than “our filenames are wrong.”</p>

<h2 id="tlc-almost-nothing-to-check">TLC: Almost Nothing to Check</h2>

<p>Only <strong>1 file</strong> in the entire dataset defines a <code class="language-plaintext highlighter-rouge">Spec</code> operator — the entry point TLC needs for model checking. That file (<code class="language-plaintext highlighter-rouge">MySpec.tla</code>) passed SANY but failed TLC.</p>

<p>The other 17 SANY-passing files are specifications without a runnable <code class="language-plaintext highlighter-rouge">Spec</code>. They define operators, theorems, and constants, but nothing TLC can execute. SANY-pass is the best validation we can achieve for them.</p>

<h2 id="what-this-means-for-the-training-dataset">What This Means for the Training Dataset</h2>

<p>Let me be direct: this is a setback.</p>

<p>The previous post reported “52/79 passed TLC (65.8%).” That number came from the basic structural validator — my own regex-based checks — not from running actual SANY and TLC. I was measuring the wrong thing. The real numbers tell a different story:</p>

<table>
  <thead>
    <tr>
      <th>Validation Level</th>
      <th>Files</th>
      <th>Rate</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>GitHub scrape</td>
      <td>449</td>
      <td>100%</td>
    </tr>
    <tr>
      <td>Structural syntax</td>
      <td>79</td>
      <td>17.6%</td>
    </tr>
    <tr>
      <td>SANY (semantic)</td>
      <td>18</td>
      <td>4.0%</td>
    </tr>
    <tr>
      <td>TLC (model checking)</td>
      <td>0</td>
      <td>0%</td>
    </tr>
  </tbody>
</table>

<p><strong>Zero files pass TLC.</strong> Not one specification in the entire scraped dataset can be model-checked end to end.</p>

<p>And even the 18 that pass SANY aren’t what they seem. Deduplication wipes out most of them — 7 are from the same NyxNet project (related config and policy modules), 3 are copies of the same Cantor diagonal proof floating around different repos, and 3 are copies of a trivial <code class="language-plaintext highlighter-rouge">Foo1</code> test spec. The truly distinct, high-quality specifications number around <strong>8</strong>.</p>

<p>To fine-tune a model you need hundreds to thousands of training examples at minimum. 8 unique specs is not a dataset — it’s a handful. Training on this would produce a model that can recite Cantor’s diagonal proof and NyxNet gateway configs, and nothing else.</p>

<p>The scraping approach that felt like it was working two weeks ago has hit a wall.</p>

<h2 id="what-im-taking-away">What I’m Taking Away</h2>

<p>This is the kind of result that makes you question an assumption you didn’t realize you were making. I assumed that if a file exists on GitHub with a <code class="language-plaintext highlighter-rouge">.tla</code> extension, it’s probably a usable TLA+ specification. That assumption was wrong three different ways.</p>

<h3 id="1-scraping-individual-files-doesnt-work-for-tla">1. Scraping individual files doesn’t work for TLA+</h3>

<p>TLA+ specifications are multi-file projects. An <code class="language-plaintext highlighter-rouge">MC.tla</code> file without its companion modules is like a <code class="language-plaintext highlighter-rouge">test_app.py</code> without the app. GitHub’s code search returns individual files, and I treated each one as a self-contained example. 57 of 79 files punished me for that assumption. Future scraping needs to pull entire repository directories, not individual files.</p>

<h3 id="2-structural-validation-gave-me-false-confidence">2. Structural validation gave me false confidence</h3>

<p>My regex-based validator said 52 files were good. SANY said 18. That’s not a minor discrepancy — I was overestimating my dataset by nearly 3x. Balanced brackets tell you almost nothing about whether TLA+ is valid. The gap between “looks right” and “SANY accepts it” is enormous. Any future validation pipeline needs to run the real toolchain from the start, not as an afterthought.</p>

<h3 id="3-the-dataset-needs-a-fundamentally-different-approach">3. The dataset needs a fundamentally different approach</h3>

<p>8 unique specs is not a starting point for fine-tuning. It’s a dead end. The options I’m considering:</p>

<ul>
  <li><strong>Scrape complete projects</strong> — clone full repos, resolve dependencies, validate entire project trees</li>
  <li><strong>Target the tlaplus/Examples repository</strong> — curated, self-contained specs that are known to work</li>
  <li><strong>Generate synthetic specs</strong> — use an LLM to produce specs, validate with SANY/TLC, keep what passes</li>
  <li><strong>Manual curation</strong> — write specs by hand for common patterns (mutex, leader election, consensus)</li>
</ul>

<p>Each has tradeoffs. Scraping complete projects fixes the dependency problem but adds complexity. The Examples repo is high quality but may not be large enough on its own. Synthetic generation is scalable but risks teaching a model to imitate its own mistakes. Manual curation produces the best training pairs but doesn’t scale.</p>

<h2 id="technical-notes">Technical Notes</h2>

<p>The validation script (<code class="language-plaintext highlighter-rouge">symbolic/utils/tlc_validate.py</code>) runs in four phases:</p>

<ol>
  <li><strong>Pre-analysis</strong> — extracts EXTENDS, INSTANCE, Spec operators, dependency classification</li>
  <li><strong>SANY validation</strong> — runs <code class="language-plaintext highlighter-rouge">tla2sany.SANY</code> with 30s timeout, handles filename renaming</li>
  <li><strong>TLC validation</strong> — generates <code class="language-plaintext highlighter-rouge">.cfg</code>, runs <code class="language-plaintext highlighter-rouge">tlc2.TLC</code> with 60s timeout (only for Spec-having files)</li>
  <li><strong>Results</strong> — JSON, summary, pass/fail lists to <code class="language-plaintext highlighter-rouge">validation_output/tlc_validation/</code></li>
</ol>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python tlc_validate.py <span class="se">\</span>
  <span class="nt">--java</span> /opt/homebrew/opt/openjdk@17/bin/java <span class="se">\</span>
  <span class="nt">--tlc-jar</span> ~/tla-tools/tla2tools.jar
</code></pre></div></div>

<p>Full results are in the <a href="https://github.com/realtimdunbar/symbolic">Symbolic repository</a>.</p>

<h2 id="whats-next">What’s Next</h2>

<p>The honest answer is I need to step back and rebuild the data pipeline before anything else moves forward. The immediate plan:</p>

<ol>
  <li><strong>Clone the tlaplus/Examples repo</strong> and run the full SANY/TLC validation pipeline on it — this should give me a baseline of known-good specs to work with</li>
  <li><strong>Build a repo-level scraper</strong> that clones entire TLA+ projects from GitHub instead of pulling individual files, so dependencies stay intact</li>
  <li><strong>Re-evaluate the training strategy</strong> — depending on how many validated specs I can collect, fine-tuning may not be the right first step. Prompt engineering with a strong base model might get further, faster, while I build up the dataset in parallel</li>
</ol>

<p>Two weeks ago I thought I had 52 validated specs and a clear path to fine-tuning. Tonight I have 8 and a list of hard questions. That’s progress — just not the kind that feels good.</p>

<hr />

<table>
  <tbody>
    <tr>
      <td>*This is part of my ongoing work on <a href="https://github.com/realtimdunbar/symbolic">Symbolic</a>, an LLM-based system for generating TLA+ specifications from natural language. Previous posts: <a href="/From-Napkin-Sketch-to-Mathematical-Proof/">Introducing Symbolic</a></td>
      <td><a href="/validating-tlaplus-dataset/">Building the Dataset</a>*</td>
    </tr>
  </tbody>
</table>

<h2 id="resources">Resources</h2>

<ul>
  <li><a href="https://github.com/tlaplus/tlaplus">TLA+ SANY Parser</a> — the official syntax/semantic analyzer</li>
  <li><a href="https://github.com/tlaplus/Examples">TLA+ Examples Repository</a> — curated, complete specifications</li>
  <li><a href="https://github.com/realtimdunbar/symbolic">Symbolic Project</a></li>
</ul>]]></content><author><name></name></author><category term="formal-methods" /><category term="tla-plus" /><category term="machine-learning" /><category term="tla+" /><category term="sany" /><category term="tlc" /><category term="validation" /><category term="dataset" /><category term="symbolic" /><summary type="html"><![CDATA[In my last post, I collected 449 TLA+ files from GitHub and validated them down to 79 using basic structural checks — balanced brackets, module headers and footers. I reported that 52 of those 79 passed “TLC validation.”]]></summary></entry><entry><title type="html">Building a TLA+ Training Dataset: From GitHub to Model-Ready Specs</title><link href="https://realtimdunbar.github.io/validating-tlaplus-dataset/" rel="alternate" type="text/html" title="Building a TLA+ Training Dataset: From GitHub to Model-Ready Specs" /><published>2026-02-10T06:00:00+00:00</published><updated>2026-02-10T06:00:00+00:00</updated><id>https://realtimdunbar.github.io/validating-tlaplus-dataset</id><content type="html" xml:base="https://realtimdunbar.github.io/validating-tlaplus-dataset/"><![CDATA[<p>Tonight I made significant progress on <a href="https://github.com/realtimdunbar/symbolic">Symbolic</a>, my project to train LLMs to generate TLA+ formal specifications from natural language descriptions. The key milestone: <strong>collecting and validating a dataset of real-world TLA+ specifications from GitHub</strong>.</p>

<blockquote>
  <p><strong>New to this project?</strong> If you want to start from the beginning of the Symbolic project, read <a href="/From-Napkin-Sketch-to-Mathematical-Proof/">From Napkin Sketch to Mathematical Proof: Introducing Symbolic</a> first. Otherwise, continue reading to learn about dataset collection and validation.</p>
</blockquote>

<h2 id="the-challenge">The Challenge</h2>

<p>To fine-tune a model that can generate valid TLA+ specifications, I need training data. Lots of it. And not just any TLA+ code—I need specifications that are:</p>

<ol>
  <li><strong>Syntactically correct</strong> (proper module structure, balanced operators)</li>
  <li><strong>Semantically valid</strong> (pass the TLC model checker)</li>
  <li><strong>Diverse</strong> (covering different domains and patterns)</li>
</ol>

<p>The question: where do I get this data?</p>

<h2 id="phase-1-scraping-github-for-tla-files">Phase 1: Scraping GitHub for TLA+ Files</h2>

<p>I started by building a simple GitHub scraper to collect <code class="language-plaintext highlighter-rouge">.tla</code> files from public repositories. Using GitHub’s Code Search API, I searched for <code class="language-plaintext highlighter-rouge">extension:tla</code> and downloaded the raw file contents.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># github-scraper.py (simplified)
</span><span class="n">SEARCH_QUERY</span> <span class="o">=</span> <span class="s">"extension:tla"</span>
<span class="n">SAVE_DIR</span> <span class="o">=</span> <span class="s">"tla_dataset"</span>

<span class="k">def</span> <span class="nf">search_tla_files</span><span class="p">(</span><span class="n">page</span><span class="o">=</span><span class="mi">1</span><span class="p">):</span>
    <span class="n">url</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"https://api.github.com/search/code?q=</span><span class="si">{</span><span class="n">SEARCH_QUERY</span><span class="si">}</span><span class="s">&amp;page=</span><span class="si">{</span><span class="n">page</span><span class="si">}</span><span class="s">&amp;per_page=100"</span>
    <span class="n">response</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">headers</span><span class="o">=</span><span class="n">HEADERS</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">response</span><span class="p">.</span><span class="n">json</span><span class="p">().</span><span class="n">get</span><span class="p">(</span><span class="s">'items'</span><span class="p">,</span> <span class="p">[])</span>

<span class="k">def</span> <span class="nf">download_file</span><span class="p">(</span><span class="n">item</span><span class="p">):</span>
    <span class="n">file_url</span> <span class="o">=</span> <span class="n">item</span><span class="p">[</span><span class="s">'url'</span><span class="p">]</span>
    <span class="n">res</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">file_url</span><span class="p">,</span> <span class="n">headers</span><span class="o">=</span><span class="n">HEADERS</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">res</span><span class="p">.</span><span class="n">status_code</span> <span class="o">==</span> <span class="mi">200</span><span class="p">:</span>
        <span class="n">content_json</span> <span class="o">=</span> <span class="n">res</span><span class="p">.</span><span class="n">json</span><span class="p">()</span>
        <span class="k">if</span> <span class="n">content_json</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'encoding'</span><span class="p">)</span> <span class="o">==</span> <span class="s">'base64'</span><span class="p">:</span>
            <span class="n">file_content</span> <span class="o">=</span> <span class="n">base64</span><span class="p">.</span><span class="n">b64decode</span><span class="p">(</span><span class="n">content_json</span><span class="p">[</span><span class="s">'content'</span><span class="p">])</span>
            <span class="c1"># Save to disk
</span></code></pre></div></div>

<p><strong>Results:</strong></p>
<ul>
  <li><strong>449 TLA+ files</strong> collected from GitHub</li>
  <li>Sourced from <strong>60+ open-source repositories</strong></li>
  <li>Including specs from CometBFT, Paxos implementations, PBFT, and various distributed systems</li>
</ul>

<p>This gave me a solid starting point, but the real work was just beginning.</p>

<h2 id="phase-2-validationseparating-the-wheat-from-the-chaff">Phase 2: Validation—Separating the Wheat from the Chaff</h2>

<p>Having 449 files is great, but are they actually valid? I built a validation pipeline with two levels:</p>

<h3 id="level-1-syntax-validation-basic-structure">Level 1: Syntax Validation (Basic Structure)</h3>

<p>First, I implemented a basic syntax validator that checks for:</p>
<ul>
  <li>Proper module headers (<code class="language-plaintext highlighter-rouge">---- MODULE Name ----</code>)</li>
  <li>Proper module footers (<code class="language-plaintext highlighter-rouge">====</code>)</li>
  <li>Balanced brackets and parentheses</li>
  <li>Balanced logical operators (<code class="language-plaintext highlighter-rouge">/\</code> and <code class="language-plaintext highlighter-rouge">\/</code>)</li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># validate_dataset.py
</span><span class="k">class</span> <span class="nc">DatasetValidator</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">validate_file</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">file_path</span><span class="p">:</span> <span class="n">Path</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">ValidationResult</span><span class="p">:</span>
        <span class="n">content</span> <span class="o">=</span> <span class="n">file_path</span><span class="p">.</span><span class="n">read_text</span><span class="p">(</span><span class="n">encoding</span><span class="o">=</span><span class="s">'utf-8'</span><span class="p">)</span>

        <span class="c1"># Syntax validation
</span>        <span class="n">syntax_valid</span><span class="p">,</span> <span class="n">syntax_errors</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">syntax_validator</span><span class="p">.</span><span class="n">validate</span><span class="p">(</span><span class="n">content</span><span class="p">)</span>

        <span class="c1"># TLC validation (if available)
</span>        <span class="k">if</span> <span class="n">syntax_valid</span> <span class="ow">and</span> <span class="ow">not</span> <span class="bp">self</span><span class="p">.</span><span class="n">skip_tlc</span><span class="p">:</span>
            <span class="n">tlc_valid</span><span class="p">,</span> <span class="n">tlc_errors</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">tlc_validator</span><span class="p">.</span><span class="n">validate</span><span class="p">(</span><span class="n">content</span><span class="p">)</span>

        <span class="k">return</span> <span class="n">ValidationResult</span><span class="p">(...)</span>
</code></pre></div></div>

<p>I ran this on all 449 files. The results were… sobering:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Total files:        449
✅ Valid files:     79  (17.6%)
❌ Invalid files:   370 (82.4%)
</code></pre></div></div>

<p><strong>Only 17.6% passed basic validation!</strong></p>

<h3 id="what-went-wrong">What Went Wrong?</h3>

<p>Analyzing the 370 failed files revealed common patterns:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>5,160 errors: Unmatched parentheses
4,185 errors: Unmatched square brackets
  342 errors: Unbalanced conjunction/disjunction operators
   14 errors: Missing module header
    4 errors: Missing module footer
</code></pre></div></div>

<p>Many files were:</p>
<ul>
  <li><strong>Incomplete specifications</strong> (truncated during GitHub API retrieval)</li>
  <li><strong>Helper modules</strong> with complex imports and advanced features</li>
  <li><strong>Configuration files</strong> (<code class="language-plaintext highlighter-rouge">.cfg</code>) mistakenly grabbed as <code class="language-plaintext highlighter-rouge">.tla</code></li>
  <li>Files with <strong>encoding issues</strong> or special characters</li>
</ul>

<h3 id="level-2-tlc-model-checker-semantic-validation">Level 2: TLC Model Checker (Semantic Validation)</h3>

<p>The 79 syntax-valid files are a good start, but they might still have semantic issues:</p>
<ul>
  <li>Deadlocks</li>
  <li>Invariant violations</li>
  <li>Unreachable states</li>
  <li>Liveness property failures</li>
</ul>

<p>I built a TLC validator that:</p>
<ol>
  <li>Creates temporary config files</li>
  <li>Runs TLC with a timeout (60s per file)</li>
  <li>Parses TLC output for errors</li>
  <li>Extracts error traces and state information</li>
</ol>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">TLCValidator</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">_run_tlc</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">spec_path</span><span class="p">:</span> <span class="n">Path</span><span class="p">):</span>
        <span class="n">config_path</span> <span class="o">=</span> <span class="n">spec_path</span><span class="p">.</span><span class="n">with_suffix</span><span class="p">(</span><span class="s">'.cfg'</span><span class="p">)</span>
        <span class="n">config_content</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"SPECIFICATION </span><span class="si">{</span><span class="n">module_name</span><span class="si">}</span><span class="se">\n</span><span class="s">"</span>
        <span class="n">config_path</span><span class="p">.</span><span class="n">write_text</span><span class="p">(</span><span class="n">config_content</span><span class="p">)</span>

        <span class="n">cmd</span> <span class="o">=</span> <span class="p">[</span>
            <span class="s">'java'</span><span class="p">,</span> <span class="s">'-cp'</span><span class="p">,</span> <span class="nb">str</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">tlc_jar_path</span><span class="p">),</span>
            <span class="s">'tlc2.TLC'</span><span class="p">,</span> <span class="s">'-workers'</span><span class="p">,</span> <span class="s">'4'</span><span class="p">,</span>
            <span class="nb">str</span><span class="p">(</span><span class="n">spec_path</span><span class="p">)</span>
        <span class="p">]</span>

        <span class="k">return</span> <span class="n">subprocess</span><span class="p">.</span><span class="n">run</span><span class="p">(</span><span class="n">cmd</span><span class="p">,</span> <span class="n">capture_output</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">timeout</span><span class="o">=</span><span class="mi">60</span><span class="p">)</span>
</code></pre></div></div>

<p><strong>Note:</strong> I haven’t run full TLC validation yet (requires Java + TLA+ tools setup), but the infrastructure is ready.</p>

<h2 id="phase-3-preparing-training-data">Phase 3: Preparing Training Data</h2>

<p>With 79 validated specifications, I created a structured training dataset. Each example includes:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"0gfoundation_cometbft_MC_n4_f1"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"tla_spec"</span><span class="p">:</span><span class="w"> </span><span class="s2">"---- MODULE MC_n4_f1 ----</span><span class="se">\n</span><span class="s2">..."</span><span class="p">,</span><span class="w">
  </span><span class="nl">"metadata"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"module_name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"MC_n4_f1"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"source_repo"</span><span class="p">:</span><span class="w"> </span><span class="s2">"0gfoundation/cometbft"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"extends"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"TLC"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Naturals"</span><span class="p">],</span><span class="w">
    </span><span class="nl">"constants"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"N"</span><span class="p">,</span><span class="w"> </span><span class="s2">"MaxRound"</span><span class="p">],</span><span class="w">
    </span><span class="nl">"variables"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"state"</span><span class="p">,</span><span class="w"> </span><span class="s2">"round"</span><span class="p">],</span><span class="w">
    </span><span class="nl">"operator_count"</span><span class="p">:</span><span class="w"> </span><span class="mi">5</span><span class="p">,</span><span class="w">
    </span><span class="nl">"line_count"</span><span class="p">:</span><span class="w"> </span><span class="mi">42</span><span class="p">,</span><span class="w">
    </span><span class="nl">"char_count"</span><span class="p">:</span><span class="w"> </span><span class="mi">1337</span><span class="w">
  </span><span class="p">},</span><span class="w">
  </span><span class="nl">"natural_language"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="w">  </span><span class="err">//</span><span class="w"> </span><span class="err">To</span><span class="w"> </span><span class="err">be</span><span class="w"> </span><span class="err">added</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<h3 id="dataset-statistics">Dataset Statistics</h3>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Total examples:     79
Total lines:        1,813
Total characters:   53,239
Average:            22 lines per spec

Size Distribution:
  Small (&lt;20 lines):   49 specs (62%)
  Medium (20-50):      19 specs (24%)
  Large (50+):         11 specs (14%)

Common Modules:
  TLC:       30 specs
  Naturals:   8 specs
  EWD840:     5 specs
  Sequences:  3 specs
</code></pre></div></div>

<h2 id="key-insights">Key Insights</h2>

<h3 id="1-real-world-data-is-messy">1. Real-World Data Is Messy</h3>

<p>GitHub is full of incomplete files, abandoned projects, and experimental code. Only <strong>17.6%</strong> of collected files passed basic validation. This is actually typical for web-scraped datasets.</p>

<p><strong>Lesson:</strong> Build robust validation pipelines. Don’t assume data quality.</p>

<h3 id="2-two-stage-validation-is-essential">2. Two-Stage Validation Is Essential</h3>

<ul>
  <li><strong>Syntax validation</strong> catches structural issues (fast, no external tools)</li>
  <li><strong>Semantic validation</strong> catches logical errors (slower, requires TLC)</li>
</ul>

<p>For machine learning purposes, both matter. You don’t want to train a model on specifications that look correct but have deadlocks or invariant violations.</p>

<h3 id="3-quality--quantity-initially">3. Quality &gt; Quantity (Initially)</h3>

<p>79 high-quality examples is better than 449 low-quality ones. A model trained on valid specs will learn correct patterns. A model trained on invalid specs will learn to make the same mistakes.</p>

<h3 id="4-metadata-matters">4. Metadata Matters</h3>

<p>Extracting metadata (module dependencies, variables, operators) helps with:</p>
<ul>
  <li><strong>Dataset analysis</strong> (what patterns are common?)</li>
  <li><strong>Model evaluation</strong> (can the model handle different complexity levels?)</li>
  <li><strong>Training strategies</strong> (curriculum learning from simple to complex)</li>
</ul>

<h2 id="whats-next">What’s Next?</h2>

<h3 id="immediate-next-steps">Immediate Next Steps</h3>

<ol>
  <li><strong>Run full TLC validation</strong> on the 79 syntax-valid files
    <ul>
      <li>Expected: 40-60 files will pass</li>
      <li>Higher quality guarantee for training</li>
    </ul>
  </li>
  <li><strong>Add natural language descriptions</strong>
    <ul>
      <li>Manual annotation (slow, high quality)</li>
      <li>LLM-generated descriptions (fast, needs review)</li>
      <li>Hybrid approach</li>
    </ul>
  </li>
  <li><strong>Start fine-tuning experiments</strong>
    <ul>
      <li>Begin with Llama-3.2-8B (manageable size)</li>
      <li>Evaluate on held-out test set</li>
      <li>Iterate on training approach</li>
    </ul>
  </li>
</ol>

<h3 id="medium-term-goals">Medium-Term Goals</h3>

<ol>
  <li><strong>Expand the dataset</strong>
    <ul>
      <li>Fix common errors in invalid files</li>
      <li>Generate synthetic variations</li>
      <li>Scrape TLA+ examples repository</li>
      <li>Mine academic papers and tutorials</li>
      <li><strong>Target:</strong> 200-500 examples</li>
    </ul>
  </li>
  <li><strong>Build evaluation metrics</strong>
    <ul>
      <li>Syntax correctness rate</li>
      <li>TLC pass rate</li>
      <li>Human evaluation of quality</li>
      <li>Semantic similarity to reference specs</li>
    </ul>
  </li>
  <li><strong>Experiment with model architectures</strong>
    <ul>
      <li>Different base models (Llama, Mistral, CodeLlama)</li>
      <li>Different context lengths</li>
      <li>Different quantization strategies</li>
    </ul>
  </li>
</ol>

<h2 id="technical-details">Technical Details</h2>

<p>All code is available in the <a href="https://github.com/yourusername/symbolic">Symbolic repository</a>:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">utils/github-scraper.py</code> - GitHub data collection</li>
  <li><code class="language-plaintext highlighter-rouge">utils/validate_dataset.py</code> - Batch validation pipeline</li>
  <li><code class="language-plaintext highlighter-rouge">utils/prepare_training_data.py</code> - Training data preparation</li>
  <li><code class="language-plaintext highlighter-rouge">src/symbolic/validation/</code> - Validation modules (syntax + TLC)</li>
</ul>

<p>The validation pipeline is designed to be:</p>
<ul>
  <li><strong>Reproducible</strong> (detailed JSON results for every file)</li>
  <li><strong>Extensible</strong> (easy to add new validation checks)</li>
  <li><strong>Efficient</strong> (parallel processing, configurable timeouts)</li>
</ul>

<h2 id="reflections">Reflections</h2>

<p>Building a dataset for formal methods is harder than I expected. Unlike natural language or even code, TLA+ specifications have:</p>

<ul>
  <li><strong>Rigid syntax requirements</strong> (no room for approximation)</li>
  <li><strong>Complex semantics</strong> (requires model checking to validate)</li>
  <li><strong>Domain expertise</strong> (understanding distributed systems, concurrency, etc.)</li>
</ul>

<p>But it’s also incredibly rewarding. Each valid specification represents a carefully designed model of a complex system. Training an LLM to generate these could democratize formal methods—making them accessible to developers who don’t have PhD-level expertise.</p>

<h2 id="the-bottom-line">The Bottom Line</h2>

<p><strong>Tonight’s Progress:</strong></p>
<ul>
  <li>✅ Collected 449 TLA+ files from GitHub</li>
  <li>✅ Validated to 79 high-quality specifications</li>
  <li>✅ Prepared structured training dataset</li>
  <li>✅ Built reusable validation infrastructure</li>
</ul>

<p><strong>Validation Rate:</strong> 17.6% (79/449)</p>

<p><strong>Dataset Ready:</strong> Yes, for initial experiments</p>

<p><strong>Next Milestone:</strong> Full TLC validation + model fine-tuning</p>

<p>The foundation is laid. Now comes the fun part: teaching an LLM to think formally.</p>

<hr />

<p><em>This is part of my ongoing work on Symbolic, an LLM-based system for generating TLA+ specifications from natural language. Follow along on <a href="https://github.com/yourusername/symbolic">GitHub</a> or read my other posts about formal methods and machine learning.</em></p>

<h2 id="resources">Resources</h2>

<ul>
  <li><a href="https://lamport.azurewebsites.net/tla/tla.html">TLA+ Home Page</a></li>
  <li><a href="https://learntla.com/">Learn TLA+</a></li>
  <li><a href="https://github.com/tlaplus/Examples">TLA+ Examples Repository</a></li>
  <li><a href="https://github.com/realtimdunbar/symbolic">Symbolic Project</a></li>
</ul>

<hr />

<p><strong>Update (2026-02-10):</strong> After running full TLC validation, 52 of the 79 files passed semantic validation (65.8% of syntax-valid files). Total pipeline pass rate: 11.6% (52/449). Quality bar is high, but that’s exactly what we want for training data.</p>]]></content><author><name></name></author><category term="machine-learning" /><category term="formal-methods" /><category term="tla-plus" /><category term="tla+" /><category term="dataset" /><category term="validation" /><category term="llm" /><category term="fine-tuning" /><summary type="html"><![CDATA[Tonight I made significant progress on Symbolic, my project to train LLMs to generate TLA+ formal specifications from natural language descriptions. The key milestone: collecting and validating a dataset of real-world TLA+ specifications from GitHub.]]></summary></entry><entry><title type="html">What I Have Been Up To</title><link href="https://realtimdunbar.github.io/What-I-Have-Been-Up-To/" rel="alternate" type="text/html" title="What I Have Been Up To" /><published>2026-02-03T00:00:00+00:00</published><updated>2026-02-03T00:00:00+00:00</updated><id>https://realtimdunbar.github.io/What-I-Have-Been-Up-To</id><content type="html" xml:base="https://realtimdunbar.github.io/What-I-Have-Been-Up-To/"><![CDATA[<h2 id="introduction">Introduction</h2>

<p>It’s been over eight years since my last post here in August 2017. During that time, the world changed dramatically—we lived through a global pandemic, witnessed fundamental shifts in how we work and communicate, and saw artificial intelligence move from research labs into everyday tools. On a personal level, these years brought significant transitions: completing graduate school, advancing in my career, becoming an empty-nester, and relocating to Florida.</p>

<p>This post serves as a retrospective on the professional and personal growth that occurred during this period, and more importantly, sets the stage for where I’m heading next. After years of building production data systems and completing formal training in computer science, I’m now focusing on the intersection of artificial intelligence, formal methods, and quantum computing—areas that bridge theoretical computer science with practical systems engineering.</p>

<hr />

<h2 id="background">Background</h2>

<p>The past eight years encompassed major life transitions. My youngest child moved out, marking the transition to being empty-nesters. We relocated from Virginia to Clermont, Florida, seeking a change of pace and climate. There were the usual challenges—a car accident with a drunk driver, family moving in, the various emergencies and complexities that come with homeownership. Through it all, I maintained focus on professional development and continued exploring the mathematical and computational ideas that have fascinated me since my undergraduate studies.</p>

<p>Outside of work and study, I’ve remained active in Toastmasters International since 2018, developing communication and leadership skills. I continue to find creative expression through blues music, playing both guitar and harmonica—a reminder that not everything needs to be about logic and computation.</p>

<hr />

<h2 id="education">Education</h2>

<p>In 2021, I began the Master of Science in Computer Science program at Georgia Institute of Technology, completing it in 2025 with a 3.81 GPA. This was a rigorous program that allowed me to formalize knowledge I’d gained through years of practical experience while diving deep into areas I’d only explored superficially before.</p>

<p><strong>Key Areas of Focus:</strong></p>

<ul>
  <li><strong>Artificial Intelligence</strong>: Advanced coursework in machine learning, natural language processing, and knowledge representation</li>
  <li><strong>Quantum Computing</strong>: Specialized study in quantum algorithms and their applications</li>
  <li><strong>Formal Methods</strong>: Training in formal verification, model checking, and correctness proofs</li>
</ul>

<p><strong>Research Highlights:</strong></p>

<p>My most significant academic work involved simulating molecular systems using quantum computers, specifically extending the CAFQA (Clifford Ansatz For Quantum Accuracy) framework. This research sits at the intersection of quantum chemistry, quantum computing, and computational physics.</p>

<p><strong>The Problem:</strong> Classical computers struggle to simulate quantum mechanical systems accurately due to exponential scaling—simulating n quantum particles requires computational resources that grow as 2^n. Quantum computers can simulate these systems more naturally, but current NISQ devices are limited by noise and gate errors.</p>

<p><strong>My Contribution:</strong> The CAFQA approach uses Clifford gates (a restricted set of quantum gates) to build quantum circuits for molecular simulation. While Clifford circuits are easier to implement and more noise-resilient, they have limited expressiveness. My research focused on augmenting the traditional Clifford gate set with T gates to determine if this could achieve additional accuracy not realized by CAFQA alone.</p>

<p><strong>Why This Matters:</strong> T gates are non-Clifford gates that add computational power to quantum circuits, allowing them to represent more complex quantum states. However, they’re also more difficult to implement on real quantum hardware and more susceptible to noise. The research question: does the increased expressiveness of Clifford+T circuits outweigh the additional error introduced by T gates for molecular simulation tasks?</p>

<p>The work involved:</p>
<ul>
  <li>Implementing quantum circuits with hybrid Clifford+T gate sets</li>
  <li>Comparing simulation accuracy against pure Clifford approaches (baseline CAFQA)</li>
  <li>Analyzing the trade-off between circuit expressiveness and noise resilience</li>
  <li>Benchmarking on small molecular systems (H₂, LiH, BeH₂)</li>
  <li>Evaluating performance on NISQ hardware with realistic error rates</li>
</ul>

<p>This research reinforced a key insight: the most interesting problems exist at the boundaries between disciplines. Quantum chemistry isn’t just physics—it’s a computational problem that requires expertise in algorithms, hardware limitations, and careful trade-off analysis between theoretical capability and practical implementation constraints.</p>

<p><strong>Broader Training:</strong></p>

<p>Beyond the graduate program, I maintained continuous learning through various certifications and courses:</p>
<ul>
  <li>Data Science at Scale specialization (Coursera, 2017)</li>
  <li>Practical Predictive Analytics (Coursera, 2017)</li>
  <li>Build a Modern Computer from First Principles (Coursera, 2016)</li>
  <li>Multiple certifications in R, Python, and data manipulation</li>
</ul>

<hr />

<h2 id="movement">Movement</h2>

<p>In 2024, we relocated from Virginia to Clermont, Florida. The move represented both a lifestyle change and a practical decision—lower cost of living, better weather, and proximity to growing tech communities in Orlando and Tampa. Working remotely as Director of Data Engineering made the geographic transition seamless professionally, while personally it offered a fresh start after years of intense focus on graduate school and career advancement.</p>

<p>Florida’s emerging tech scene has been a pleasant surprise. While not Silicon Valley or Austin, the state has been attracting significant tech investment, particularly in aerospace (Cape Canaveral’s private space industry), defense contractors, and enterprise software companies. The cost-of-living arbitrage allows for a better quality of life while maintaining the same professional standards and compensation.</p>

<hr />

<h2 id="work">Work</h2>

<p>My professional trajectory over the past eight years has been one of increasing scope and technical depth. I currently serve as <strong>Director of Data Engineering at Trader Interactive</strong>, where I lead initiatives at the intersection of data architecture, systems design, and intelligent automation.</p>

<p><strong>Professional Evolution:</strong></p>

<p>When I last posted in 2017, I was deep in data science and analytics—building predictive models, running statistical analyses, and working primarily with structured datasets. The field has evolved dramatically since then:</p>

<ul>
  <li>
    <p><strong>Infrastructure as Code</strong>: Data engineering now resembles software engineering more than statistics. We build pipelines using modern tooling—Airflow, dbt, Terraform—treating data infrastructure with the same rigor as application code.</p>
  </li>
  <li>
    <p><strong>Real-Time Systems</strong>: Batch processing has given way to streaming architectures. We’ve built systems that process millions of events per day using Kafka, Spark Streaming, and Lambda architectures.</p>
  </li>
  <li>
    <p><strong>ML Operations</strong>: Machine learning moved from Jupyter notebooks to production systems. This required building deployment pipelines, monitoring systems, and governance frameworks—bridging the gap between data science and platform engineering.</p>
  </li>
  <li>
    <p><strong>Cloud-Native Architecture</strong>: Migration from on-premise data centers to cloud infrastructure (primarily AWS) changed how we think about scalability, cost optimization, and system design.</p>
  </li>
</ul>

<p><strong>Key Accomplishments:</strong></p>

<ul>
  <li><strong>D.R.I.V.E. Award (2021)</strong>: Led the AVBT project team, recognized for innovation in data-driven decision making</li>
  <li><strong>Data Platform Modernization</strong>: Architected and led the migration from legacy ETL systems to modern ELT patterns using cloud-native tools</li>
  <li><strong>Team Building</strong>: Grew and mentored a team of data engineers, establishing best practices for code review, testing, and documentation</li>
  <li><strong>Cross-Functional Leadership</strong>: Bridged gaps between data science, analytics, software engineering, and business stakeholders</li>
</ul>

<p><strong>Technical Philosophy:</strong></p>

<p>Over these years, I’ve developed a perspective on data engineering that emphasizes:</p>

<ol>
  <li><strong>Correctness over Speed</strong>: Data pipelines should be provably correct. Late data is annoying; wrong data is catastrophic.</li>
  <li><strong>Simplicity over Cleverness</strong>: Complex systems fail in complex ways. Simple, well-documented systems are easier to debug, maintain, and extend.</li>
  <li><strong>End-to-End Ownership</strong>: Data engineers should understand both the source systems generating data and the downstream use cases consuming it.</li>
  <li><strong>Automation with Guardrails</strong>: Automate everything, but build validation and monitoring into every step.</li>
</ol>

<p>This philosophy is increasingly influenced by formal methods and correctness proofs—concepts I encountered in graduate school that have direct applications to production data systems.</p>

<hr />

<h2 id="whats-next">What’s Next</h2>

<p>After years of building data infrastructure and completing formal computer science training, I’m pivoting toward three interconnected areas that represent the future of reliable, intelligent systems:</p>

<h3 id="1-ai-architecture-and-llm-systems">1. AI Architecture and LLM Systems</h3>

<p>Large language models have moved from research curiosities to production tools in just a few years. However, most organizations are still figuring out how to deploy them reliably. I’m particularly interested in:</p>

<ul>
  <li><strong>Fine-tuning and specialization</strong>: Adapting open-source models (Llama, Mistral) for domain-specific tasks where GPT-4 falls short</li>
  <li><strong>Retrieval-Augmented Generation (RAG)</strong>: Building systems that ground LLM outputs in verifiable data sources</li>
  <li><strong>LLM reliability</strong>: Developing validation frameworks that catch hallucinations, ensure consistency, and provide confidence scores</li>
  <li><strong>Cost optimization</strong>: Balancing model capability against inference costs—when to use 70B models vs. 7B models vs. prompt engineering</li>
</ul>

<p><strong>Current Project</strong>: I’m building Symbolic, a system that uses fine-tuned LLMs to generate formally verified specifications. This combines practical ML engineering with theoretical computer science, addressing the fundamental problem of AI reliability.</p>

<h3 id="2-formal-methods-and-verification">2. Formal Methods and Verification</h3>

<p>The software industry has largely relied on testing to ensure correctness: write code, write tests, hope you covered the important cases. Formal methods offer a different approach: mathematically prove that systems behave correctly under all possible conditions.</p>

<p><strong>Why This Matters Now:</strong></p>

<p>As systems become more complex—distributed databases, consensus algorithms, concurrent systems—the state space becomes too large to test exhaustively. A mutex with two processes has dozens of possible interleavings. With ten processes, it’s millions. Testing samples the state space; formal verification proves properties across the entire space.</p>

<p>Companies like AWS, Microsoft, and MongoDB are already using formal methods (primarily TLA+) to verify critical systems. I believe this will become standard practice, not just for infrastructure companies, but for any organization building safety-critical or financially significant systems.</p>

<p><strong>Areas of Focus:</strong></p>

<ul>
  <li><strong>TLA+ and model checking</strong>: Specifying and verifying distributed systems, consensus protocols, and concurrent algorithms</li>
  <li><strong>Theorem provers</strong>: Exploring Coq, Lean, and other proof assistants for software verification</li>
  <li><strong>Accessibility</strong>: Making formal methods approachable for working engineers (hence the Symbolic project)</li>
</ul>

<h3 id="3-quantum-computing-applications">3. Quantum Computing Applications</h3>

<p>My graduate research in quantum simulation of molecular systems opened my eyes to both the promise and the current limitations of quantum computing. We’re in the NISQ (Noisy Intermediate-Scale Quantum) era—quantum computers exist and work, but they’re noisy, have limited qubits, and can’t yet outperform classical computers for most problems.</p>

<p><strong>Realistic Near-Term Applications:</strong></p>

<ul>
  <li><strong>Quantum chemistry</strong>: Simulating molecular systems for drug discovery and materials science</li>
  <li><strong>Optimization problems</strong>: Exploring quantum annealing and variational algorithms for combinatorial optimization</li>
  <li><strong>Quantum machine learning</strong>: Investigating whether quantum computers can accelerate specific ML workloads</li>
</ul>

<p><strong>What I’m Watching:</strong></p>

<ul>
  <li>Error correction progress (we need ~1000 physical qubits per logical qubit currently)</li>
  <li>Algorithm development for NISQ devices</li>
  <li>Hybrid quantum-classical approaches that leverage the strengths of both</li>
</ul>

<p>I don’t expect quantum computers to replace classical systems broadly, but there are specific domains—particularly in simulation and optimization—where they may provide exponential advantages.</p>

<hr />

<h2 id="conclusion">Conclusion</h2>

<p>The past eight years have been transformative both personally and professionally. Graduate school provided formal training in areas I’d explored informally for years. My career evolved from data science and analytics to data engineering and systems architecture. I’ve moved from building models to building the platforms that enable others to build models.</p>

<p>The next phase focuses on reliability and correctness in AI systems—combining practical experience in production engineering with theoretical foundations in formal methods and quantum computing. The goal: build systems that aren’t just intelligent, but provably correct.</p>

<p>This blog will return to active use, documenting this journey. Expect deep dives into:</p>
<ul>
  <li>LLM fine-tuning and productionization</li>
  <li>Formal verification techniques for software systems</li>
  <li>Quantum algorithms and their practical applications</li>
  <li>The intersection of AI and formal methods</li>
</ul>

<p>The world has indeed moved on since 2017. But the fundamental questions remain: How do we build systems that work correctly? How do we make AI reliable? How do we bridge theory and practice? These questions will guide the next chapter.</p>

<hr />

<p><em>If you’re working on similar problems—AI reliability, formal methods, quantum applications—I’d love to connect. Reach out at <a href="mailto:timothy.c.dunbar@me.com">timothy.c.dunbar@me.com</a>.</em></p>]]></content><author><name>Tim Dunbar</name></author><summary type="html"><![CDATA[A retrospective on eight years of professional and personal growth: completing a Master's in Computer Science at Georgia Tech, advancing to Director of Data Engineering, and pivoting toward AI architecture, formal methods, and quantum computing applications.]]></summary></entry><entry><title type="html">From Napkin Sketch to Mathematical Proof: Introducing Symbolic</title><link href="https://realtimdunbar.github.io/From-Napkin-Sketch-to-Mathematical-Proof/" rel="alternate" type="text/html" title="From Napkin Sketch to Mathematical Proof: Introducing Symbolic" /><published>2026-02-03T00:00:00+00:00</published><updated>2026-02-03T00:00:00+00:00</updated><id>https://realtimdunbar.github.io/From-Napkin-Sketch-to-Mathematical-Proof</id><content type="html" xml:base="https://realtimdunbar.github.io/From-Napkin-Sketch-to-Mathematical-Proof/"><![CDATA[<h2 id="introduction-when-tests-arent-enough">Introduction: When Tests Aren’t Enough</h2>

<p>In 2014, Amazon Web Services prevented a catastrophic S3 outage using a programming language most developers have never heard of. The bug wasn’t caught by their extensive test suite, which had excellent coverage. It wasn’t caught by code review, performed by some of the industry’s best engineers. It was caught by <strong>TLA+</strong>, a formal specification language that can mathematically verify system properties across billions of possible states.</p>

<p>The bug? A subtle race condition in S3’s replication protocol that would only manifest under specific network partition scenarios—exactly the kind of edge case that’s nearly impossible to catch with traditional testing but trivial to find with formal methods. You can read more about <a href="https://www.amazon.science/publications/how-amazon-web-services-uses-formal-methods">how AWS uses formal methods in this paper</a>.</p>

<p>This raises an uncomfortable question: if 95% test coverage can still miss catastrophic bugs, what are we really testing?</p>

<h3 id="the-problem-with-testing">The Problem with Testing</h3>

<p>Traditional testing is example-based. You write test cases that check specific scenarios:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">test_mutex_prevents_concurrent_access</span><span class="p">():</span>
    <span class="n">mutex</span> <span class="o">=</span> <span class="n">Mutex</span><span class="p">()</span>
    <span class="n">process1</span> <span class="o">=</span> <span class="n">Process</span><span class="p">(</span><span class="n">mutex</span><span class="p">)</span>
    <span class="n">process2</span> <span class="o">=</span> <span class="n">Process</span><span class="p">(</span><span class="n">mutex</span><span class="p">)</span>

    <span class="n">process1</span><span class="p">.</span><span class="n">acquire</span><span class="p">()</span>
    <span class="k">assert</span> <span class="ow">not</span> <span class="n">process2</span><span class="p">.</span><span class="n">can_acquire</span><span class="p">()</span>  <span class="c1"># Checks ONE scenario
</span></code></pre></div></div>

<p>This test verifies one particular execution path. But what about:</p>
<ul>
  <li>The 10^15 other possible interleavings?</li>
  <li>Race conditions that only appear under specific timing?</li>
  <li>Deadlocks that emerge from complex state interactions?</li>
</ul>

<p><strong>Formal methods</strong> don’t check examples—they prove properties. A TLA+ specification can verify that “at most one process holds the mutex” across <em>all possible executions</em>. Not 1,000 test cases. Not 1,000,000. <em>All of them.</em></p>

<h3 id="the-accessibility-problem">The Accessibility Problem</h3>

<p>So why isn’t everyone using TLA+? Because it looks like this:</p>

<pre><code class="language-tla">Next ==
    \/ \E p \in Processes:
        /\ pc[p] = "idle"
        /\ critical = {}
        /\ critical' = {p}
        /\ pc' = [pc EXCEPT ![p] = "critical"]
    \/ \E p \in Processes:
        /\ pc[p] = "critical"
        /\ critical' = {}
        /\ pc' = [pc EXCEPT ![p] = "idle"]
</code></pre>

<p>For most developers, this is a significant barrier. Learning TLA+ requires understanding temporal logic, state machines, and a syntax that feels foreign compared to modern programming languages. Companies like AWS and Microsoft have the resources to train engineers in formal methods. Most don’t.</p>

<p><strong>What if we could make TLA+ as accessible as writing a test case?</strong></p>

<p>That’s the mission behind Symbolic: a project I’m building to translate natural language specifications into formally verified TLA+ code using large language models. This post introduces the architecture and explains the key design decisions.</p>

<hr />

<h2 id="the-planned-architecture-from-natural-language-to-mathematical-proof">The Planned Architecture: From Natural Language to Mathematical Proof</h2>

<p>How do you turn a sentence like “users can’t overdraw their account” into something a computer can verify across $10^{15}$ states? The answer is a carefully designed pipeline that combines natural language processing, large language models, and formal verification tools.</p>

<h3 id="system-overview">System Overview</h3>

<p>Symbolic will use a six-stage pipeline with feedback loops:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>┌─────────────────┐
│ Natural Language│ "A mutex ensures mutual exclusion"
│ Input           │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Preprocessor   │ Extract: processes, variables, invariants
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  LLM Generator  │ Llama 3.2-8B (fine-tuned)
│  (w/ LoRA)      │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Postprocessor   │ Clean markdown artifacts, ensure structure
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Syntax Validator│ TLA+ parser (SANY)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  TLC Validator  │ Model checker (semantic verification)
└────────┬────────┘
         │
         ▼
    ┌───┴────┐
    │ Valid? │───NO──┐
    └───┬────┘       │
        │ YES        │
        ▼            ▼
    ┌────────┐  ┌──────────────┐
    │ Output │  │  Refinement  │
    │ TLA+   │  │  Loop (retry)│
    └────────┘  └──────┬───────┘
                       │
                       └──────┐
                              │
                     [Back to Generator]
</code></pre></div></div>

<p>Let’s examine each component in detail.</p>

<hr />

<h3 id="stage-1-natural-language-preprocessing">Stage 1: Natural Language Preprocessing</h3>

<p>The preprocessor’s job will be to extract structured information from unstructured text. While the LLM could theoretically do this, separating it into a dedicated stage provides:</p>

<ol>
  <li><strong>Faster iteration</strong> (no LLM call needed for debugging)</li>
  <li><strong>Explicit context</strong> for prompt engineering</li>
  <li><strong>Deterministic parsing</strong> of common patterns</li>
</ol>

<p><strong>Implementation:</strong></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">NLPreprocessor</span><span class="p">:</span>
    <span class="s">"""Extracts concepts from natural language input."""</span>

    <span class="n">PROCESS_KEYWORDS</span> <span class="o">=</span> <span class="p">{</span><span class="s">"process"</span><span class="p">,</span> <span class="s">"thread"</span><span class="p">,</span> <span class="s">"node"</span><span class="p">,</span> <span class="s">"agent"</span><span class="p">}</span>
    <span class="n">VARIABLE_KEYWORDS</span> <span class="o">=</span> <span class="p">{</span><span class="s">"variable"</span><span class="p">,</span> <span class="s">"state"</span><span class="p">,</span> <span class="s">"counter"</span><span class="p">,</span> <span class="s">"lock"</span><span class="p">}</span>
    <span class="n">INVARIANT_KEYWORDS</span> <span class="o">=</span> <span class="p">{</span><span class="s">"always"</span><span class="p">,</span> <span class="s">"never"</span><span class="p">,</span> <span class="s">"must"</span><span class="p">,</span> <span class="s">"ensures"</span><span class="p">}</span>

    <span class="k">def</span> <span class="nf">preprocess</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">text</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">ExtractedConcepts</span><span class="p">:</span>
        <span class="n">normalized</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_normalize_text</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>

        <span class="k">return</span> <span class="n">ExtractedConcepts</span><span class="p">(</span>
            <span class="n">processes</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">_extract_processes</span><span class="p">(</span><span class="n">normalized</span><span class="p">),</span>
            <span class="n">variables</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">_extract_variables</span><span class="p">(</span><span class="n">normalized</span><span class="p">),</span>
            <span class="n">invariants</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">_extract_invariants</span><span class="p">(</span><span class="n">normalized</span><span class="p">),</span>
            <span class="n">actions</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">_extract_actions</span><span class="p">(</span><span class="n">normalized</span><span class="p">),</span>
            <span class="n">temporal_properties</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">_extract_temporal_properties</span><span class="p">(</span><span class="n">normalized</span><span class="p">)</span>
        <span class="p">)</span>
</code></pre></div></div>

<p><strong>Pattern Recognition Examples:</strong></p>

<table>
  <thead>
    <tr>
      <th>Input Pattern</th>
      <th>Extracted Concept</th>
      <th>Reasoning</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>“two processes compete”</td>
      <td><code class="language-plaintext highlighter-rouge">processes = {"p1", "p2"}</code></td>
      <td>Numeric detection</td>
    </tr>
    <tr>
      <td>“mutex ensures mutual exclusion”</td>
      <td><code class="language-plaintext highlighter-rouge">variables = {"critical", "pc"}</code></td>
      <td>Domain knowledge (mutex → critical section)</td>
    </tr>
    <tr>
      <td>“at most one process”</td>
      <td><code class="language-plaintext highlighter-rouge">invariants = ["Cardinality(critical) &lt;= 1"]</code></td>
      <td>Quantifier detection</td>
    </tr>
    <tr>
      <td>“acquire and release”</td>
      <td><code class="language-plaintext highlighter-rouge">actions = ["acquire", "release"]</code></td>
      <td>Verb extraction</td>
    </tr>
  </tbody>
</table>

<p><strong>Why This Matters:</strong></p>

<p>When building the LLM prompt, this context can be injected:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Natural Language: "A mutex ensures mutual exclusion"

Extracted Context:
- Processes: p1, p2
- Variables: critical, pc
- Invariants: at most one process in critical section
- Actions: acquire, release

Generate a TLA+ specification that...
</code></pre></div></div>

<p>Initial experiments show this dramatically improves generation quality by giving the LLM structured information instead of raw text.</p>

<hr />

<h3 id="stage-2-llm-based-tla-generation">Stage 2: LLM-Based TLA+ Generation</h3>

<p>This is where the magic happens—but also where the complexity lies.</p>

<h4 id="model-selection-why-llama-32-8b">Model Selection: Why Llama 3.2-8B?</h4>

<table>
  <thead>
    <tr>
      <th>Model</th>
      <th>Pros</th>
      <th>Cons</th>
      <th>Decision</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>GPT-4</strong></td>
      <td>Best reasoning, strong few-shot</td>
      <td>Closed API, can’t fine-tune, expensive ($0.03/1K tokens)</td>
      <td>❌</td>
    </tr>
    <tr>
      <td><strong>Claude 3</strong></td>
      <td>Great for structured output</td>
      <td>Can’t fine-tune, rate limits</td>
      <td>❌</td>
    </tr>
    <tr>
      <td><strong>Llama 3.2-8B</strong></td>
      <td>Open source, fast inference, fine-tunable</td>
      <td>Needs fine-tuning for TLA+</td>
      <td>✅</td>
    </tr>
  </tbody>
</table>

<p>The key hypothesis: <strong>Fine-tuning an open-source model will beat prompt engineering a closed model</strong> for domain-specific tasks like TLA+ generation.</p>

<h4 id="fine-tuning-configuration">Fine-Tuning Configuration</h4>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">AutoModelForCausalLM</span>
<span class="kn">from</span> <span class="nn">peft</span> <span class="kn">import</span> <span class="n">LoraConfig</span><span class="p">,</span> <span class="n">get_peft_model</span>
<span class="kn">from</span> <span class="nn">bitsandbytes</span> <span class="kn">import</span> <span class="n">BitsAndBytesConfig</span>

<span class="c1"># 4-bit quantization for memory efficiency
</span><span class="n">bnb_config</span> <span class="o">=</span> <span class="n">BitsAndBytesConfig</span><span class="p">(</span>
    <span class="n">load_in_4bit</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
    <span class="n">bnb_4bit_quant_type</span><span class="o">=</span><span class="s">"nf4"</span><span class="p">,</span>
    <span class="n">bnb_4bit_compute_dtype</span><span class="o">=</span><span class="n">torch</span><span class="p">.</span><span class="n">bfloat16</span>
<span class="p">)</span>

<span class="c1"># LoRA configuration
</span><span class="n">lora_config</span> <span class="o">=</span> <span class="n">LoraConfig</span><span class="p">(</span>
    <span class="n">r</span><span class="o">=</span><span class="mi">16</span><span class="p">,</span>                              <span class="c1"># Rank (controls adapter capacity)
</span>    <span class="n">lora_alpha</span><span class="o">=</span><span class="mi">32</span><span class="p">,</span>                     <span class="c1"># Scaling factor
</span>    <span class="n">target_modules</span><span class="o">=</span><span class="p">[</span><span class="s">"q_proj"</span><span class="p">,</span> <span class="s">"v_proj"</span><span class="p">],</span>  <span class="c1"># Adapt attention layers
</span>    <span class="n">lora_dropout</span><span class="o">=</span><span class="mf">0.05</span><span class="p">,</span>
    <span class="n">bias</span><span class="o">=</span><span class="s">"none"</span>
<span class="p">)</span>

<span class="n">model</span> <span class="o">=</span> <span class="n">AutoModelForCausalLM</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span>
    <span class="s">"meta-llama/Llama-3.2-8B"</span><span class="p">,</span>
    <span class="n">quantization_config</span><span class="o">=</span><span class="n">bnb_config</span><span class="p">,</span>
    <span class="n">device_map</span><span class="o">=</span><span class="s">"auto"</span>
<span class="p">)</span>

<span class="n">model</span> <span class="o">=</span> <span class="n">get_peft_model</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">lora_config</span><span class="p">)</span>
</code></pre></div></div>

<p><strong>Why LoRA (Low-Rank Adaptation)?</strong></p>

<p>Full fine-tuning of an 8B parameter model requires:</p>
<ul>
  <li><strong>Memory</strong>: ~32GB GPU RAM</li>
  <li><strong>Time</strong>: 40+ hours on a single GPU</li>
  <li><strong>Cost</strong>: $500-1000 on cloud GPUs</li>
</ul>

<p>LoRA adaptation requires:</p>
<ul>
  <li><strong>Memory</strong>: ~12GB GPU RAM (fits on free Colab!)</li>
  <li><strong>Time</strong>: 4-6 hours</li>
  <li><strong>Cost</strong>: $0 (using free tier)</li>
</ul>

<p>LoRA works by freezing the base model and training small adapter matrices that modify attention projections. The adapters are only 45MB compared to the 13GB base model.</p>

<h4 id="prompt-engineering">Prompt Engineering</h4>

<p>Even with fine-tuning, prompt structure matters:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">_build_prompt</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">natural_language</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">context</span><span class="p">:</span> <span class="n">ExtractedConcepts</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="k">return</span> <span class="sa">f</span><span class="s">"""You are an expert in TLA+ formal specifications.

Natural Language Description:
</span><span class="si">{</span><span class="n">natural_language</span><span class="si">}</span><span class="s">

Extracted Context:
- Processes: </span><span class="si">{</span><span class="s">", "</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">context</span><span class="p">.</span><span class="n">processes</span><span class="p">)</span><span class="si">}</span><span class="s">
- Variables: </span><span class="si">{</span><span class="s">", "</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">context</span><span class="p">.</span><span class="n">variables</span><span class="p">)</span><span class="si">}</span><span class="s">
- Invariants: </span><span class="si">{</span><span class="s">"; "</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">context</span><span class="p">.</span><span class="n">invariants</span><span class="p">)</span><span class="si">}</span><span class="s">

Generate a complete TLA+ module with:
1. MODULE declaration and EXTENDS clause
2. VARIABLE declarations
3. Init predicate (initial state)
4. Action predicates (state transitions)
5. Next predicate (all possible actions)
6. Invariants to verify

TLA+ Specification:
"""</span>
</code></pre></div></div>

<p><strong>Key Design Decision: Why include extracted context?</strong></p>

<p>Early prototyping with base models suggests:</p>
<ul>
  <li><strong>Without context</strong>: ~60% syntax error rate (estimated)</li>
  <li><strong>With context</strong>: ~30% syntax error rate (target)</li>
  <li><strong>With context + fine-tuning</strong>: &lt;10% syntax error rate (goal)</li>
</ul>

<p>The combination of preprocessing and fine-tuning should be crucial to achieving production-quality results.</p>

<hr />

<h3 id="stage-3-postprocessing---making-llm-output-parser-ready">Stage 3: Postprocessing - Making LLM Output Parser-Ready</h3>

<p><strong>The Problem:</strong> LLMs are trained on code from the internet—Stack Overflow answers, GitHub READMEs, blog posts, documentation. This means they’ve learned that “code” often appears wrapped in markdown, surrounded by explanatory text, or includes inline comments explaining their reasoning.</p>

<p>When prompted to generate TLA+, a model might produce:</p>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Here's the TLA+ specification you requested:

<span class="p">```</span><span class="nl">tla
</span><span class="sb">---- MODULE Mutex ----
\* This is a simple mutex specification
VARIABLE critical, pc

Init ==
    /\ critical = {}
    /\ pc = [p \in {1,2} |-&gt; "idle"]  \* Both processes start idle
...
====
</span></code></pre></div></div>

<p>This specification ensures mutual exclusion by…</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
Or it might include natural language mixed with code:
</code></pre></div></div>
<p>First, we declare the variables:
VARIABLE critical</p>

<p>Then we define the initial state:
Init == critical = {}</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
**The Cleanup Tasks:**

The postprocessor needs to extract clean, parseable TLA+ from this messy output:

```python
class TLAPostprocessor:
    def process(self, raw_output: str) -&gt; str:
        # Remove markdown code fences
        cleaned = re.sub(r'```(?:tla|TLA)?\n(.*?)```', r'\1', raw_output, flags=re.DOTALL)

        # Remove common prefixes/suffixes (e.g., "Here's the specification:")
        cleaned = re.sub(r'^.*?(?=----\s*MODULE)', '', cleaned, flags=re.DOTALL)
        cleaned = re.sub(r'====.*?$', '====', cleaned, flags=re.DOTALL)

        # Remove inline comments that are really LLM explanations
        # (More sophisticated filtering may be needed)

        # Ensure required structure
        if not re.search(r'---- MODULE \w+ ----', cleaned):
            cleaned = f"---- MODULE Generated ----\n{cleaned}"
        if '====' not in cleaned:
            cleaned += "\n===="

        return cleaned.strip()
</code></pre></div></div>

<p><strong>Why This Matters:</strong></p>

<p>The TLA+ parser expects pure TLA+ syntax. Any extraneous text—even a single “Here’s your code:” prefix—will cause a parse error. The postprocessor acts as a bridge between “LLM conversational output” and “strict parser input.”</p>

<p>This is likely not exhaustive—as the system is tested with real model outputs, more edge cases will emerge (JSON formatting, escaped characters, hallucinated syntax extensions, etc.). The postprocessor will evolve to handle these as they’re discovered.</p>

<hr />

<h3 id="stage-4-syntax-validation-with-sany">Stage 4: Syntax Validation with SANY</h3>

<p>SANY (Syntactic Analyzer) is the official TLA+ parser, part of the standard TLA+ Tools distribution. It performs static analysis to catch:</p>
<ul>
  <li>Missing MODULE declaration</li>
  <li>Unbalanced operators (<code class="language-plaintext highlighter-rouge">/\</code> without matching <code class="language-plaintext highlighter-rouge">\/</code>)</li>
  <li>Undefined variables</li>
  <li>Type errors (TLA+ is untyped, but has conventions)</li>
</ul>

<p><strong>Integration:</strong></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">SyntaxValidator</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">validate</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">spec</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Tuple</span><span class="p">[</span><span class="nb">bool</span><span class="p">,</span> <span class="n">List</span><span class="p">[</span><span class="nb">SyntaxError</span><span class="p">]]:</span>
        <span class="c1"># Write to temp file
</span>        <span class="k">with</span> <span class="n">tempfile</span><span class="p">.</span><span class="n">NamedTemporaryFile</span><span class="p">(</span><span class="n">suffix</span><span class="o">=</span><span class="s">'.tla'</span><span class="p">,</span> <span class="n">delete</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
            <span class="n">f</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">spec</span><span class="p">)</span>
            <span class="n">temp_path</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="n">f</span><span class="p">.</span><span class="n">name</span><span class="p">)</span>

        <span class="c1"># Run SANY (TLA+ parser)
</span>        <span class="n">result</span> <span class="o">=</span> <span class="n">subprocess</span><span class="p">.</span><span class="n">run</span><span class="p">(</span>
            <span class="p">[</span><span class="s">'java'</span><span class="p">,</span> <span class="s">'-cp'</span><span class="p">,</span> <span class="nb">str</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">tla_tools_path</span><span class="p">),</span> <span class="s">'tla2sany.SANY'</span><span class="p">,</span> <span class="nb">str</span><span class="p">(</span><span class="n">temp_path</span><span class="p">)],</span>
            <span class="n">capture_output</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
            <span class="n">text</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
            <span class="n">timeout</span><span class="o">=</span><span class="mi">30</span>
        <span class="p">)</span>

        <span class="c1"># Parse errors
</span>        <span class="n">errors</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_parse_sany_output</span><span class="p">(</span><span class="n">result</span><span class="p">.</span><span class="n">stdout</span> <span class="o">+</span> <span class="n">result</span><span class="p">.</span><span class="n">stderr</span><span class="p">)</span>
        <span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="n">errors</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">,</span> <span class="n">errors</span>
</code></pre></div></div>

<p><strong>Example Error:</strong></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Input:  VARIABLE x, y
Output: line 5, col 12: Unknown operator: /\\
</code></pre></div></div>

<p>This gives us precise line/column information for refinement.</p>

<hr />

<h3 id="stage-5-semantic-validation-tlc">Stage 5: Semantic Validation (TLC)</h3>

<p>TLC is a model checker. It:</p>
<ol>
  <li>Enumerates all reachable states</li>
  <li>Checks invariants at each state</li>
  <li>Searches for deadlocks, liveness violations</li>
</ol>

<p><strong>Example:</strong></p>

<pre><code class="language-tla">---- MODULE BrokenMutex ----
EXTENDS Naturals
VARIABLES critical

Init == critical = {}

Enter(p) ==
    /\ critical' = critical \cup {p}  (* BUG: No mutual exclusion check! *)

Next == \E p \in {1, 2}: Enter(p)

MutualExclusion == Cardinality(critical) &lt;= 1
====
</code></pre>

<p>TLC will find:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Invariant MutualExclusion is violated.
State 1: critical = {}
State 2: critical = {1}
State 3: critical = {1, 2}  (* Violation! *)
</code></pre></div></div>

<p>This is the killer feature: <strong>TLC proves the specification is wrong</strong>, not just that one test case fails.</p>

<p><strong>Integration:</strong></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">TLCValidator</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">validate</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">spec</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Tuple</span><span class="p">[</span><span class="nb">bool</span><span class="p">,</span> <span class="n">List</span><span class="p">[</span><span class="n">TLCError</span><span class="p">]]:</span>
        <span class="c1"># Create config file
</span>        <span class="n">config</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"SPECIFICATION Spec</span><span class="se">\n</span><span class="s">"</span>

        <span class="c1"># Run TLC
</span>        <span class="n">result</span> <span class="o">=</span> <span class="n">subprocess</span><span class="p">.</span><span class="n">run</span><span class="p">(</span>
            <span class="p">[</span><span class="s">'java'</span><span class="p">,</span> <span class="s">'-cp'</span><span class="p">,</span> <span class="nb">str</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">tlc_jar_path</span><span class="p">),</span> <span class="s">'tlc2.TLC'</span><span class="p">,</span>
             <span class="s">'-workers'</span><span class="p">,</span> <span class="s">'4'</span><span class="p">,</span> <span class="n">spec_path</span><span class="p">],</span>
            <span class="n">capture_output</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
            <span class="n">timeout</span><span class="o">=</span><span class="mi">300</span>  <span class="c1"># 5 minute timeout
</span>        <span class="p">)</span>

        <span class="n">errors</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_parse_tlc_output</span><span class="p">(</span><span class="n">result</span><span class="p">.</span><span class="n">stdout</span><span class="p">)</span>
        <span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="n">errors</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">,</span> <span class="n">errors</span>
</code></pre></div></div>

<hr />

<h3 id="stage-6-refinement-loop">Stage 6: Refinement Loop</h3>

<p>When validation fails, the system will feed errors back to the LLM:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">refine</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">spec</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">errors</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="n">ValidationError</span><span class="p">],</span> <span class="n">max_iterations</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">5</span><span class="p">):</span>
    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">max_iterations</span><span class="p">):</span>
        <span class="n">is_valid</span><span class="p">,</span> <span class="n">new_errors</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">validator</span><span class="p">.</span><span class="n">validate</span><span class="p">(</span><span class="n">spec</span><span class="p">)</span>

        <span class="k">if</span> <span class="n">is_valid</span><span class="p">:</span>
            <span class="k">return</span> <span class="n">spec</span>

        <span class="c1"># Build refinement prompt
</span>        <span class="n">error_summary</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_format_errors</span><span class="p">(</span><span class="n">new_errors</span><span class="p">)</span>
        <span class="n">refinement_prompt</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"""
The following TLA+ specification has errors:

</span><span class="si">{</span><span class="n">spec</span><span class="si">}</span><span class="s">

Errors:
</span><span class="si">{</span><span class="n">error_summary</span><span class="si">}</span><span class="s">

Fix these errors and regenerate a valid specification.
"""</span>

        <span class="n">spec</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">generator</span><span class="p">.</span><span class="n">generate</span><span class="p">(</span><span class="n">refinement_prompt</span><span class="p">)</span>

    <span class="k">raise</span> <span class="n">RefinementError</span><span class="p">(</span><span class="s">"Could not generate valid spec after {max_iterations} attempts"</span><span class="p">)</span>
</code></pre></div></div>

<p><strong>Target Success Rates:</strong></p>

<table>
  <thead>
    <tr>
      <th>Iteration</th>
      <th>Syntax Valid (Goal)</th>
      <th>Semantically Valid (Goal)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>40-50%</td>
      <td>20-30%</td>
    </tr>
    <tr>
      <td>2</td>
      <td>70-80%</td>
      <td>50-60%</td>
    </tr>
    <tr>
      <td>3</td>
      <td>85-90%</td>
      <td>70-80%</td>
    </tr>
    <tr>
      <td>4+</td>
      <td>&gt;90%</td>
      <td>&gt;80%</td>
    </tr>
  </tbody>
</table>

<p>The iterative approach should be essential—preliminary testing suggests one-shot generation rarely works for complex specifications.</p>

<hr />

<h2 id="design-decisions--tradeoffs">Design Decisions &amp; Tradeoffs</h2>

<h3 id="why-not-just-use-gpt-4">Why Not Just Use GPT-4?</h3>

<p><strong>Cost Analysis:</strong></p>

<table>
  <thead>
    <tr>
      <th>Approach</th>
      <th>Cost per Spec</th>
      <th>Notes</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>GPT-4 API</td>
      <td>$0.15</td>
      <td>5K tokens in/out, 3 iterations</td>
    </tr>
    <tr>
      <td>Llama 3.2 (self-hosted)</td>
      <td>$0.001</td>
      <td>Inference on local GPU</td>
    </tr>
    <tr>
      <td>Llama 3.2 (cloud GPU)</td>
      <td>$0.02</td>
      <td>AWS g5.xlarge instance</td>
    </tr>
  </tbody>
</table>

<p>At 1,000 specs generated:</p>
<ul>
  <li>GPT-4: <strong>$150</strong></li>
  <li>Self-hosted Llama: <strong>$1</strong></li>
  <li>Cloud Llama: <strong>$20</strong></li>
</ul>

<p><strong>Fine-Tuning Control:</strong></p>

<p>With open models, I’ll be able to:</p>
<ul>
  <li>Train on proprietary TLA+ specs (companies can’t send to OpenAI)</li>
  <li>Control the training data distribution</li>
  <li>Debug model behavior by inspecting weights</li>
  <li>Deploy on-premise (critical for security-sensitive applications)</li>
</ul>

<h3 id="why-iterative-refinement">Why Iterative Refinement?</h3>

<p><strong>Alternative: Multi-Agent Generation</strong></p>

<p>Some systems use multiple LLM calls in parallel:</p>
<ul>
  <li>Agent 1: Generate spec</li>
  <li>Agent 2: Generate invariants</li>
  <li>Agent 3: Generate test cases</li>
</ul>

<p>This is <strong>faster</strong> (parallel) but <strong>more expensive</strong> (3x API calls) and <strong>less coherent</strong> (agents don’t communicate).</p>

<p>Iterative refinement is <strong>sequential</strong> but should produce <strong>higher quality</strong> output because each iteration learns from validation feedback.</p>

<h3 id="why-tla-first">Why TLA+ First?</h3>

<p><strong>Alternative Targets:</strong></p>

<table>
  <thead>
    <tr>
      <th>Language</th>
      <th>Pros</th>
      <th>Cons</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Alloy</strong></td>
      <td>Simpler syntax, better for relational models</td>
      <td>Weaker temporal logic</td>
    </tr>
    <tr>
      <td><strong>Z Notation</strong></td>
      <td>Mature, used in safety-critical systems</td>
      <td>Harder to tool</td>
    </tr>
    <tr>
      <td><strong>Coq</strong></td>
      <td>Theorem prover, ultimate verification</td>
      <td>Extremely steep learning curve</td>
    </tr>
    <tr>
      <td><strong>TLA+</strong></td>
      <td>Best temporal logic support, tooling (TLC), AWS/MS use it</td>
      <td>Unfamiliar syntax</td>
    </tr>
  </tbody>
</table>

<p>TLA+ hits the sweet spot of <strong>expressiveness</strong> (temporal logic), <strong>tooling</strong> (TLC model checker), and <strong>industry adoption</strong> (AWS, Azure).</p>

<hr />

<h2 id="the-roadmap">The Roadmap</h2>

<p>I’m building Symbolic in phases over the next 12 weeks:</p>

<p><strong>Phase 1: Foundation (Weeks 1-3)</strong></p>
<ul>
  <li>Core architecture implementation</li>
  <li>Preprocessor and postprocessor</li>
  <li>Basic validation pipeline integration</li>
</ul>

<p><strong>Phase 2: Fine-Tuning (Weeks 4-8)</strong></p>
<ul>
  <li>Dataset creation (target: 5,000+ NL-TLA+ pairs)</li>
  <li>Model fine-tuning with LoRA</li>
  <li>Evaluation and iteration</li>
</ul>

<p><strong>Phase 3: Refinement &amp; Polish (Weeks 9-12)</strong></p>
<ul>
  <li>Iterative refinement loop</li>
  <li>CLI tool development</li>
  <li>Documentation and examples</li>
</ul>

<p><strong>Future Goals:</strong></p>
<ul>
  <li>Multi-language support (Alloy, Z notation, SPIN)</li>
  <li>VS Code extension with real-time validation</li>
  <li>Web interface for non-technical users</li>
  <li>Formal verification as a service API</li>
</ul>

<p>The ultimate goal: <strong>make formal methods as ubiquitous as unit testing.</strong></p>

<hr />

<h2 id="follow-along">Follow Along</h2>

<p>I’m building this project in public and documenting the journey on this blog and on GitHub. Over the coming weeks, I’ll be sharing:</p>

<ul>
  <li><strong>Deep dives</strong> into TLA+ concepts and why they matter</li>
  <li><strong>Technical posts</strong> on fine-tuning LLMs for specialized domains</li>
  <li><strong>Lessons learned</strong> from building synthetic datasets</li>
  <li><strong>Performance metrics</strong> as the system improves</li>
  <li><strong>Open source code</strong> when it’s ready for early testing</li>
</ul>

<p>If you’re interested in formal methods, LLM fine-tuning, or just want to see a project built from scratch, subscribe or follow the GitHub repository (link coming soon).</p>

<p><strong>What would you want to formally verify?</strong> I’m collecting use cases and example systems to test Symbolic against. Reach out at <a href="mailto:timothy.c.dunbar@me.com">timothy.c.dunbar@me.com</a> if you have ideas or want to collaborate.</p>

<hr />

<h2 id="further-reading">Further Reading</h2>

<ul>
  <li><a href="https://lamport.azurewebsites.net/tla/tla.html">TLA+ Homepage</a> - Leslie Lamport’s original work</li>
  <li><a href="https://www.amazon.science/publications/how-amazon-web-services-uses-formal-methods">AWS and TLA+</a> - How Amazon uses formal methods</li>
  <li><a href="https://learntla.com/">Learn TLA+</a> - Excellent tutorial by Hillel Wayne</li>
  <li><a href="https://arxiv.org/abs/2106.09685">LoRA Paper</a> - Low-Rank Adaptation of Large Language Models</li>
</ul>

<hr />

<p><em>This is part 1 of a series on building Symbolic. Next up: “I Spent 40 Hours Learning TLA+ So You Don’t Have To” - a practical guide to the 5 core concepts.</em></p>]]></content><author><name>Tim Dunbar</name></author><summary type="html"><![CDATA[Introducing Symbolic: a project to make formal verification accessible by translating natural language specifications into TLA+ using fine-tuned LLMs. This post explores the architecture, design decisions, and the mission to make formal methods as ubiquitous as unit testing.]]></summary></entry><entry><title type="html">Moran’s I Analysis of Ghent Housing Data</title><link href="https://realtimdunbar.github.io/Ghent-Clustering-Analysis/" rel="alternate" type="text/html" title="Moran’s I Analysis of Ghent Housing Data" /><published>2017-08-08T00:00:00+00:00</published><updated>2017-08-08T00:00:00+00:00</updated><id>https://realtimdunbar.github.io/Ghent-Clustering-Analysis</id><content type="html" xml:base="https://realtimdunbar.github.io/Ghent-Clustering-Analysis/"><![CDATA[<hr />

<h2 id="morans-i-explanation">Moran’s I Explanation</h2>

<p>Moran’s I is a measurement of how spatial information might correlate with some other variable.  In this case I am comparing the Euclidean distance (from each other) of homes in the Ghent neighborhodd of Norfolk and their property values.  I didn’t do a lot of cleaning of the data, preferring instead to get a baseline and to see how much the p value improved after cleaning.</p>

<p>As always we need our libraries</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">RDSTK</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">leaflet</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ape</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">readr</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>Here I am pulling out the columns I am interested in, specifically the complete address</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="o">&lt;-</span><span class="n">read_csv</span><span class="p">(</span><span class="s2">"~/Naggle/2017-07_GhentHousingData/data/GhentDataSetWithGeo.csv"</span><span class="p">)</span><span class="w">

</span><span class="n">columns</span><span class="o">&lt;-</span><span class="nf">c</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="m">7</span><span class="p">,</span><span class="w"> </span><span class="m">8</span><span class="p">,</span><span class="w"> </span><span class="m">9</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="m">16</span><span class="p">,</span><span class="w"> </span><span class="m">17</span><span class="p">)</span><span class="w">

</span><span class="n">new_df</span><span class="o">&lt;-</span><span class="n">df</span><span class="p">[,</span><span class="n">columns</span><span class="p">]</span><span class="w">

</span><span class="n">new_df</span><span class="o">$</span><span class="n">whole_address</span><span class="o">&lt;-</span><span class="n">paste</span><span class="p">(</span><span class="n">new_df</span><span class="o">$</span><span class="n">`Property Street`</span><span class="p">,</span><span class="w"> </span><span class="n">new_df</span><span class="o">$</span><span class="n">`Property City`</span><span class="p">,</span><span class="w"> </span><span class="n">new_df</span><span class="o">$</span><span class="n">`Property State`</span><span class="p">,</span><span class="w"> </span><span class="n">new_df</span><span class="o">$</span><span class="n">`Property Zip`</span><span class="p">)</span><span class="w">
</span><span class="n">new_df</span><span class="o">$</span><span class="n">total</span><span class="o">&lt;-</span><span class="n">df</span><span class="o">$</span><span class="n">`2016 Building`</span><span class="o">+</span><span class="n">df</span><span class="o">$</span><span class="n">`2016 Land`</span><span class="w">
</span></code></pre></div></div>
<p>Let’s make a map so that we can perhaps get a sense of any clustering effects with the home values.  One the below map, the darker the blue the higher the total property value of the address.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pal</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">colorQuantile</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="s2">"blue"</span><span class="p">),</span><span class="w"> </span><span class="n">domain</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">new_df</span><span class="o">$</span><span class="n">total</span><span class="p">))</span><span class="w">

</span><span class="n">leaflet</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">addTiles</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">addCircleMarkers</span><span class="p">(</span><span class="n">lng</span><span class="o">=</span><span class="n">new_df</span><span class="o">$</span><span class="n">longitude</span><span class="p">,</span><span class="w"> </span><span class="n">lat</span><span class="o">=</span><span class="n">new_df</span><span class="o">$</span><span class="n">latitude</span><span class="p">,</span><span class="w"> </span><span class="n">weight</span><span class="o">=</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">radius</span><span class="o">=</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">opacity</span><span class="o">=</span><span class="m">.2</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="o">=</span><span class="n">pal</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p><img src="/images/moransI.png" alt="First Digits Frequency Distribution" /><!-- --></p>

<p>Finally I calculate the Moran’s I of this dataset.  the below p value of 0.375 is not as high as I thought it should be.  It makes sense that like value homes will be close to each other, I mean it’s not often that one sees a mansion next to a trailer park.  I will clean up the data and see if it can be improved.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">xy</span><span class="o">&lt;-</span><span class="n">new_df</span><span class="p">[,</span><span class="nf">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="m">3</span><span class="p">,</span><span class="m">10</span><span class="p">)]</span><span class="w">

</span><span class="n">xy.dist</span><span class="o">&lt;-</span><span class="n">as.matrix</span><span class="p">(</span><span class="n">dist</span><span class="p">(</span><span class="n">cbind</span><span class="p">(</span><span class="n">xy</span><span class="o">$</span><span class="n">longitude</span><span class="p">,</span><span class="w"> </span><span class="n">xy</span><span class="o">$</span><span class="n">latitude</span><span class="p">),</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"euclidean"</span><span class="p">,</span><span class="w"> </span><span class="n">diag</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">upper</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">))</span><span class="w">

</span><span class="n">xy.dist.inv</span><span class="w"> </span><span class="o">&lt;</span><span class="m">-1</span><span class="o">/</span><span class="n">xy.dist</span><span class="w">

</span><span class="n">diag</span><span class="p">(</span><span class="n">xy.dist.inv</span><span class="p">)</span><span class="o">&lt;</span><span class="m">-0</span><span class="w">
</span><span class="n">xy.dist.inv</span><span class="p">[</span><span class="n">xy.dist.inv</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="kc">Inf</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">0</span><span class="w">

</span><span class="n">Moran.I</span><span class="p">(</span><span class="n">xy</span><span class="o">$</span><span class="n">total</span><span class="p">,</span><span class="w"> </span><span class="n">xy.dist.inv</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>As seen below the p value is higher then .05 so the null hypothesis is not confirmed.  There is a spatial correlation with property value.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">$</span><span class="n">observed</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">0.001799266</span><span class="w">

</span><span class="o">$</span><span class="n">expected</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">-0.0004823927</span><span class="w">

</span><span class="o">$</span><span class="n">sd</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">0.002575867</span><span class="w">

</span><span class="o">$</span><span class="n">p.value</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">0.3757347</span><span class="w">
</span></code></pre></div></div>]]></content><author><name>Tim Dunbar</name></author><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Benford’s Law - Ghent Housing Data</title><link href="https://realtimdunbar.github.io/Benford's-Law-Analysis-Ghent/" rel="alternate" type="text/html" title="Benford’s Law - Ghent Housing Data" /><published>2017-07-04T00:00:00+00:00</published><updated>2017-07-04T00:00:00+00:00</updated><id>https://realtimdunbar.github.io/Benford&apos;s%20Law%20Analysis%20-%20Ghent</id><content type="html" xml:base="https://realtimdunbar.github.io/Benford&apos;s-Law-Analysis-Ghent/"><![CDATA[<hr />

<h2 id="benfords-law-explained">Benford’s Law Explained</h2>

<p>Benford’s law is also called the first digit law, it’s an observation about the frequency distribution of the most significant digits in any series of numbers.  It turns out that roughly 30% of the numbers should start with the number <em>1</em>, roughly 20% of the numbers should start with <em>2</em>, and so on. Benford’s law is usually used as a kind of “canary” for fraud. In other words if the set of numbers in the dataset do not conform to Benford’s law there might be some manipulation going on and further investigation is required.</p>

<p>This is just a quick rundown of the probablility formula for Benford’s law.  So for leading digit d such that</p>

<p>\[ d\in{1, 2, …, 9} \]</p>

<p>The formulas is…</p>

\[P(d)=\log_{10}(d + 1)-log_{10}(d)\]

<p>Because the log of the quotient is the difference of the logs and vice versa we can…</p>

\[P(d)=\log_{10}(\frac{d + 1}d)\]

<p>And finally…</p>

\[P(d)=\log_{10}(1+\frac{1}d)\]

<p>Now on with the fun stuff</p>

<h2 id="administrative-stuff-package-loading-variables-etc">Administrative stuff, package loading, variables, etc.</h2>

<p>As always we need to load the libraries we are going to use as well as the data into a dataframe.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">readr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">
</span><span class="n">ghent_df</span><span class="o">&lt;-</span><span class="n">readr</span><span class="o">::</span><span class="n">read_csv</span><span class="p">(</span><span class="s2">"~/Naggle/GhentDataSetTrain.csv"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<h2 id="we-need-to-filter-out-some-stuff-to-prepare-for-benfords-law">We need to filter out some stuff to prepare for Benford’s Law</h2>

<p>There is at least one type of construction represented in this dataset that needs to be filtered out (there might be more in fact).  “Residential Outbuildings” seem to be listed seperately but they repeat the same values as the main residential structure they are attached with.  Leaving this in will make the analysis less accurate.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">filtered_ghent_df</span><span class="o">&lt;-</span><span class="n">dplyr</span><span class="o">::</span><span class="n">filter</span><span class="p">(</span><span class="n">ghent_df</span><span class="p">,</span><span class="w"> </span><span class="n">ghent_df</span><span class="o">$</span><span class="n">`Property Use`</span><span class="o">!=</span><span class="s2">"Residential Outbuilding"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<h2 id="peel-off-the-columns-we-are-interested-in-namely-2016-land-and-2016-building">Peel off the columns we are interested in (namely 2016 Land and 2016 Building)</h2>

<p>In this analysis I’m only interested in two of the columns and really only the individual sums of those two columns.  I want to use the total price of the properties in my analysis so I split out the assessed land value and the assessed building value and sum them.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">benford_prep_df</span><span class="o">&lt;-</span><span class="n">filtered_ghent_df</span><span class="p">[,</span><span class="nf">c</span><span class="p">(</span><span class="m">11</span><span class="p">,</span><span class="w"> </span><span class="m">12</span><span class="p">)]</span><span class="w">

</span><span class="n">benford_prep_df</span><span class="o">&lt;-</span><span class="n">dplyr</span><span class="o">::</span><span class="n">mutate</span><span class="p">(</span><span class="n">benford_prep_df</span><span class="p">,</span><span class="w"> </span><span class="s1">'total'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">benford_prep_df</span><span class="o">$</span><span class="n">`2016 Land`</span><span class="o">+</span><span class="n">benford_prep_df</span><span class="o">$</span><span class="n">`2016 Building`</span><span class="p">)</span><span class="w">

</span><span class="n">benford_prep_df</span><span class="o">&lt;-</span><span class="n">dplyr</span><span class="o">::</span><span class="n">mutate</span><span class="p">(</span><span class="n">benford_prep_df</span><span class="p">,</span><span class="w"> </span><span class="s1">'first_digit'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">substr</span><span class="p">(</span><span class="n">benford_prep_df</span><span class="o">$</span><span class="n">total</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>
<h2 id="peel-off-the-first_digit-column-so-that-we-can-see-how-it-conforms-to-benfords-law">Peel off the first_digit column so that we can see how it conforms to Benford’s law</h2>

<p>And now to simply count all the 1s, 2s, 3s, and so on using the table function in R.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">benford_counts_firsts</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">as.data.frame</span><span class="p">(</span><span class="n">benford_prep_df</span><span class="o">$</span><span class="n">first_digit</span><span class="p">)</span><span class="w">

</span><span class="n">benford_counts_table_firsts</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">as.data.frame</span><span class="p">(</span><span class="n">table</span><span class="p">(</span><span class="n">benford_counts_firsts</span><span class="p">))</span><span class="w">
</span><span class="n">benford_counts_table_firsts</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dplyr</span><span class="o">::</span><span class="n">mutate</span><span class="p">(</span><span class="n">benford_counts_table_firsts</span><span class="p">,</span><span class="w"> </span><span class="n">percentage</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">benford_counts_table_firsts</span><span class="o">$</span><span class="n">Freq</span><span class="o">/</span><span class="m">1737</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<h2 id="firsts-table"><em>Firsts</em> table</h2>

<p>We cna already see that there is something interesting happening with 3s and 4s, and there doesn’t seem to be enough 1s.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">head</span><span class="p">(</span><span class="n">benford_counts_table_firsts</span><span class="p">,</span><span class="w"> </span><span class="m">9</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##   benford_counts_firsts Freq percentage
## 1                     1  373 0.21473805
## 2                     2  293 0.16868164
## 3                     3  431 0.24812896
## 4                     4  270 0.15544041
## 5                     5  157 0.09038572
## 6                     6   82 0.04720783
## 7                     7   50 0.02878526
## 8                     8   51 0.02936097
## 9                     9   30 0.01727116
</code></pre></div></div>
<h2 id="histogram-of-the-resulting-counts-for-firsts">Histogram of the resulting counts for firsts</h2>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ggplot</span><span class="p">(</span><span class="n">benford_counts_firsts</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">benford_counts_firsts</span><span class="o">$</span><span class="n">`benford_prep_df$first_digit`</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">stat_count</span><span class="p">(</span><span class="n">binwidth</span><span class="o">=</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="o">=</span><span class="s2">"black"</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="o">=</span><span class="s2">"white"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">xlab</span><span class="p">(</span><span class="s2">"First Digit Counts"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">ylab</span><span class="p">(</span><span class="s2">"Counts"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">ggtitle</span><span class="p">(</span><span class="s2">"Benford's Law Analysis of Ghent Housing Data"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p><img src="/images/benfords_law_hist.png" alt="First Digits Frequency Distribution" /><!-- --></p>

<h2 id="conclusion">Conclusion</h2>

<p>This data set does not comply with Benford’s law because more total assessed values begin with the number 3 then anything else and there are more 4s then there should be as well.  This is not to say that there is fraud happening here but there is something interesting that would require further investigation beyond the scope of this post.  However, most likely this is because Ghent is a mildly affluent neighborhood with a lot of expensive homes for the upper middle class folks.</p>]]></content><author><name>Tim Dunbar</name></author><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Assignment 6 for Data Science at Scale - Coursera</title><link href="https://realtimdunbar.github.io/Assignment-Post/" rel="alternate" type="text/html" title="Assignment 6 for Data Science at Scale - Coursera" /><published>2017-06-24T00:00:00+00:00</published><updated>2017-06-24T00:00:00+00:00</updated><id>https://realtimdunbar.github.io/Assignment-Post</id><content type="html" xml:base="https://realtimdunbar.github.io/Assignment-Post/"><![CDATA[<hr />

<h2 id="incidents-of-larcenytheft-are-more-frequent-on-saturdays-and-in-the-north-east-quadrant-of-san-fransisco">Incidents of Larceny/Theft are more frequent on Saturdays and in the North East Quadrant of San Fransisco.</h2>

<p>This is to complete an assignment for my Data Science at Scale course.  Because, this is a rather simple post for a grade and I’m short on time there isn’t a lot of analysis here.  However, there are visuals in this post (finally) including a cool choropleth chart that I’ve been meaning to try out.</p>

<p>First I need to specifiy my packages and read the data into a data frame.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## 
## Attaching package: 'dplyr'
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## The following objects are masked from 'package:stats':
## 
##     filter, lag
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">readr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ggmap</span><span class="p">)</span><span class="w">

</span><span class="n">data</span><span class="o">&lt;-</span><span class="n">readr</span><span class="o">::</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">'~/datasci_course_materials/assignment6/sanfrancisco_incidents_summer_2014.csv'</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## Parsed with column specification:
## cols(
##   IncidntNum = col_integer(),
##   Category = col_character(),
##   Descript = col_character(),
##   DayOfWeek = col_character(),
##   Date = col_character(),
##   Time = col_time(format = ""),
##   PdDistrict = col_character(),
##   Resolution = col_character(),
##   Address = col_character(),
##   X = col_double(),
##   Y = col_double(),
##   Location = col_character(),
##   PdId = col_double()
## )
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## # A tibble: 28,993 x 13
##    IncidntNum                    Category
##         &lt;int&gt;                       &lt;chr&gt;
##  1  140734311                       ARSON
##  2  140736317                NON-CRIMINAL
##  3  146177923               LARCENY/THEFT
##  4  146177531               LARCENY/THEFT
##  5  140734220                NON-CRIMINAL
##  6  140734349               DRUG/NARCOTIC
##  7  140734349               DRUG/NARCOTIC
##  8  140734349 DRIVING UNDER THE INFLUENCE
##  9  140738147              OTHER OFFENSES
## 10  140734258                    TRESPASS
## # ... with 28,983 more rows, and 11 more variables: Descript &lt;chr&gt;,
## #   DayOfWeek &lt;chr&gt;, Date &lt;chr&gt;, Time &lt;time&gt;, PdDistrict &lt;chr&gt;,
## #   Resolution &lt;chr&gt;, Address &lt;chr&gt;, X &lt;dbl&gt;, Y &lt;dbl&gt;, Location &lt;chr&gt;,
## #   PdId &lt;dbl&gt;
</code></pre></div></div>

<p>This is a sample of the data in it’s raw form.  Let’s find out the crime that has the highest number of incidents in this data set.</p>

<p>I take the whole Category column and seperate it out and turn it into a table. The R table function is a handy little piece of code that gives you frequency of each item in your input as a second column.  The whole thing can then be turned into a data frame like I did here.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dataCrime</span><span class="o">&lt;-</span><span class="n">table</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">Category</span><span class="p">)</span><span class="w">
</span><span class="n">dataCrime</span><span class="o">&lt;-</span><span class="n">as.data.frame</span><span class="p">(</span><span class="n">dataCrime</span><span class="p">)</span><span class="w">
</span><span class="n">dataCrime</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##                           Var1 Freq
## 1                        ARSON   63
## 2                      ASSAULT 2882
## 3                      BRIBERY    1
## 4                     BURGLARY    6
## 5           DISORDERLY CONDUCT   31
## 6  DRIVING UNDER THE INFLUENCE  100
## 7                DRUG/NARCOTIC 1345
## 8                  DRUNKENNESS  147
## 9                 EMBEZZLEMENT   10
## 10                   EXTORTION    7
## 11             FAMILY OFFENSES   10
## 12      FORGERY/COUNTERFEITING   18
## 13                       FRAUD  242
## 14                    GAMBLING    1
## 15                  KIDNAPPING  117
## 16               LARCENY/THEFT 9466
## 17                 LIQUOR LAWS   42
## 18                   LOITERING    3
## 19              MISSING PERSON 1266
## 20                NON-CRIMINAL 3023
## 21              OTHER OFFENSES 3567
## 22     PORNOGRAPHY/OBSCENE MAT    1
## 23                PROSTITUTION  112
## 24                     ROBBERY  308
## 25                     RUNAWAY   61
## 26             SECONDARY CODES  442
## 27             STOLEN PROPERTY    8
## 28                     SUICIDE   14
## 29              SUSPICIOUS OCC 1300
## 30                    TRESPASS  281
## 31                   VANDALISM   17
## 32               VEHICLE THEFT 1966
## 33                    WARRANTS 1782
## 34                 WEAPON LAWS  354
</code></pre></div></div>
<p>Then apply a geom_col treatment so that we can visualize the data.  The categorical data will be the types of crime on the x axis and the quantitative data will be the number of times each crime occurs in the category column on the y axis. Easy!  One thing to note I did have to limit myself to the top ten crime categories, or the table started to look terrible.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dataCrime</span><span class="o">&lt;-</span><span class="n">head</span><span class="p">(</span><span class="n">dataCrime</span><span class="p">[</span><span class="w"> </span><span class="n">order</span><span class="p">(</span><span class="o">-</span><span class="n">dataCrime</span><span class="p">[,</span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">dataCrime</span><span class="p">[,</span><span class="m">1</span><span class="p">]),</span><span class="w"> </span><span class="p">],</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w">

</span><span class="n">ggplot</span><span class="p">(</span><span class="n">dataCrime</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_col</span><span class="p">(</span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Var1</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Freq</span><span class="p">),</span><span class="w"> </span><span class="n">colour</span><span class="o">=</span><span class="s2">"blue"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">labs</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Type of Crime'</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Count of Each Crime'</span><span class="p">,</span><span class="w">
       </span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Frequency of Each Crime'</span><span class="p">,</span><span class="w">
       </span><span class="n">caption</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"2014 San Fransisco Crime Data"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">theme</span><span class="p">(</span><span class="n">legend.position</span><span class="o">=</span><span class="s2">"none"</span><span class="p">,</span><span class="w"> </span><span class="n">axis.text.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">angle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">90</span><span class="p">,</span><span class="w"> </span><span class="n">hjust</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>

<p><img src="/images/crimeFrequency.png" alt="Image description" /><!-- -->
In my opinion this is pretty predictable, it seems there are more incidents of Larceny/Theft then any other crime.</p>

<p>Let’s find out the day of week one is most likely to be stolen from.</p>

<p>I need to filter the Category variable for ‘LARCENY/THEFT’, take only the day of week column and turn it into a table just like last time.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dataLT</span><span class="o">&lt;-</span><span class="n">dplyr</span><span class="o">::</span><span class="n">filter</span><span class="p">(</span><span class="n">data</span><span class="p">,</span><span class="w"> </span><span class="n">Category</span><span class="o">==</span><span class="s1">'LARCENY/THEFT'</span><span class="p">)</span><span class="w">
</span><span class="n">dataLT</span><span class="o">&lt;-</span><span class="n">table</span><span class="p">(</span><span class="n">dataLT</span><span class="o">$</span><span class="n">DayOfWeek</span><span class="p">)</span><span class="w">
</span><span class="n">dataLT</span><span class="o">&lt;-</span><span class="n">as.data.frame</span><span class="p">(</span><span class="n">dataLT</span><span class="p">)</span><span class="w">
</span><span class="n">dataLT</span><span class="o">&lt;-</span><span class="n">dataLT</span><span class="p">[</span><span class="w"> </span><span class="n">order</span><span class="p">(</span><span class="o">-</span><span class="n">dataLT</span><span class="p">[,</span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">dataLT</span><span class="p">[,</span><span class="m">1</span><span class="p">]),</span><span class="w"> </span><span class="p">]</span><span class="w">
</span></code></pre></div></div>
<p>Then I will turn that into a simple column chart with the nominal data (days of week) as the x axis and the quantititave data (frequency of each day of the week) as the y axis.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ggplot</span><span class="p">(</span><span class="n">dataLT</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_col</span><span class="p">(</span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Var1</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Freq</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="o">=</span><span class="s2">"red"</span><span class="p">),</span><span class="w"> </span><span class="n">colour</span><span class="o">=</span><span class="s2">"red"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">labs</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Day of Week'</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Frequency of Larcneny/Theft'</span><span class="p">,</span><span class="w">
       </span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Larceny/Theft on Day of Week'</span><span class="p">,</span><span class="w">
       </span><span class="n">caption</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"2014 San Fransisco Crime Data"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">theme</span><span class="p">(</span><span class="n">legend.position</span><span class="o">=</span><span class="s2">"none"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p><img src="/images/theftDays.png" alt="Image description" /><!-- -->
It appears that one has a slightly higher chance of being stolen from on a Saturday in San Fransisco (in 2014) than any other day.  Though Sunday comes close followed by Friday.  This makes sense in that the weekend seems to be the time when more assualts occur.  It is also interesting to note that Larceny/Theft occurances on Monday-Thursday remain pretty steady.</p>

<p>Let’s see if we can find out the places in San Fransisco to avoid on Saturdays.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dataLTMap</span><span class="o">&lt;-</span><span class="n">dplyr</span><span class="o">::</span><span class="n">filter</span><span class="p">(</span><span class="n">data</span><span class="p">,</span><span class="w"> </span><span class="n">Category</span><span class="o">==</span><span class="s1">'LARCENY/THEFT'</span><span class="p">)</span><span class="w">
</span><span class="n">dataLTMap</span><span class="o">&lt;-</span><span class="n">dataLTMap</span><span class="p">[,</span><span class="nf">c</span><span class="p">(</span><span class="m">10</span><span class="p">,</span><span class="m">11</span><span class="p">)]</span><span class="w">

</span><span class="n">map</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">get_map</span><span class="p">(</span><span class="n">location</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'San Fransisco'</span><span class="p">,</span><span class="w"> </span><span class="n">zoom</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">12</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=San+Fransisco&amp;zoom=12&amp;size=640x640&amp;scale=2&amp;maptype=terrain&amp;language=en-EN&amp;sensor=false
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=San%20Fransisco&amp;sensor=false
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mapPoints</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">ggmap</span><span class="p">(</span><span class="n">map</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_point</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dataLTMap</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dataLTMap</span><span class="o">$</span><span class="n">X</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dataLTMap</span><span class="o">$</span><span class="n">Y</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"red"</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.4</span><span class="p">),</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.5</span><span class="p">,</span><span class="w"> </span><span class="n">shape</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">21</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">guides</span><span class="p">(</span><span class="n">fill</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">)</span><span class="w">

</span><span class="n">mapPoints</span><span class="w">
</span></code></pre></div></div>

<p><img src="/images/mapPoints.png" alt="Image description" /><!-- -->
It seems that most of the reported thefts (in 2014) occured in that north east quadrant.  To bad my data set doesn’t tell me what was stolen, it would be interesting to see how many of those were bike thefts (I dig bikes).</p>]]></content><author><name>Tim Dunbar</name></author><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Naive Bayes Classifier Refactor</title><link href="https://realtimdunbar.github.io/Naive-Bayes-Classifier-Refactor/" rel="alternate" type="text/html" title="Naive Bayes Classifier Refactor" /><published>2017-06-22T00:00:00+00:00</published><updated>2017-06-22T00:00:00+00:00</updated><id>https://realtimdunbar.github.io/Naive-Bayes-Classifier-Refactor</id><content type="html" xml:base="https://realtimdunbar.github.io/Naive-Bayes-Classifier-Refactor/"><![CDATA[<hr />

<h2 id="naive-bayes-classifier-refactor">Naive Bayes Classifier <em>Refactor</em></h2>

<p>As the title suggests this post will be a refactoring of the code from the previous post.  I’m doing this partly because I recently watched all the videos from Robert Martin’s (Uncle Bob’s) Clean Code series but also because I think refactoring code is a good way to learn about it.</p>

<p>I might try to make refactoring code a regular part of this blog.</p>

<h4 id="first-a-recap">First a Recap</h4>
<p>Here is the textCleaner function</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">textCleaner</span><span class="o">&lt;-</span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">){</span><span class="w">
  </span><span class="n">x</span><span class="o">&lt;-</span><span class="n">scan</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">what</span><span class="o">=</span><span class="s2">""</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="o">=</span><span class="s2">"\n"</span><span class="p">)</span><span class="w">
  </span><span class="c1">#removes the author of the quote because I am only interested in male or female</span><span class="w">
  </span><span class="n">x</span><span class="o">&lt;-</span><span class="n">gsub</span><span class="p">(</span><span class="s2">"--\\s.*"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">)</span><span class="w">
  </span><span class="c1">#removes punctiation</span><span class="w">
  </span><span class="n">x</span><span class="o">&lt;-</span><span class="n">gsub</span><span class="p">(</span><span class="s2">"([-'])|[[:punct:]]"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">)</span><span class="w">
  </span><span class="c1">#splits on spaces</span><span class="w">
  </span><span class="n">x</span><span class="o">&lt;-</span><span class="n">strsplit</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="s2">"[[:space:]]+"</span><span class="p">)</span><span class="w">
  </span><span class="c1">#formats as data frame</span><span class="w">
  </span><span class="n">x</span><span class="o">&lt;-</span><span class="n">as.data.frame</span><span class="p">(</span><span class="n">unlist</span><span class="p">(</span><span class="n">x</span><span class="p">))</span><span class="w">
  </span><span class="nf">return</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>And here is the Classifier code</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">bayesClassifier</span><span class="o">&lt;-</span><span class="k">function</span><span class="p">(</span><span class="n">menClass</span><span class="p">,</span><span class="w"> </span><span class="n">womenClass</span><span class="p">,</span><span class="w"> </span><span class="n">document</span><span class="p">,</span><span class="w"> </span><span class="n">menPrior</span><span class="p">,</span><span class="w"> </span><span class="n">womenPrior</span><span class="p">){</span><span class="w">
  </span><span class="c1">#gets counts of words in each class</span><span class="w">
  </span><span class="n">mCount</span><span class="o">&lt;-</span><span class="n">nrow</span><span class="p">(</span><span class="n">menClass</span><span class="p">)</span><span class="w">
  </span><span class="n">wCount</span><span class="o">&lt;-</span><span class="n">nrow</span><span class="p">(</span><span class="n">womenClass</span><span class="p">)</span><span class="w">
  </span><span class="c1">#combines the menClass and womenClass dataframes into a vocabulary dataframe</span><span class="w">
  </span><span class="n">vocabAll</span><span class="o">&lt;-</span><span class="n">rbind</span><span class="p">(</span><span class="n">menClass</span><span class="p">,</span><span class="w"> </span><span class="n">womenClass</span><span class="p">)</span><span class="w">
  </span><span class="c1">#collapses like words in vocabAll and find count of all unique words in vacabulary</span><span class="w">
  </span><span class="n">vocabAll</span><span class="o">&lt;-</span><span class="n">as.data.frame</span><span class="p">(</span><span class="n">table</span><span class="p">(</span><span class="n">vocabAll</span><span class="p">))</span><span class="w">
  </span><span class="n">vocabCount</span><span class="o">&lt;-</span><span class="n">nrow</span><span class="p">(</span><span class="n">vocabAll</span><span class="p">)</span><span class="w">
  </span><span class="c1">#collapses menClass and womenClass data frames and finds the frequency of each word</span><span class="w">
  </span><span class="n">menClass</span><span class="o">&lt;-</span><span class="n">as.data.frame</span><span class="p">(</span><span class="n">table</span><span class="p">(</span><span class="n">menClass</span><span class="p">))</span><span class="w">
  </span><span class="n">womenClass</span><span class="o">&lt;-</span><span class="n">as.data.frame</span><span class="p">(</span><span class="n">table</span><span class="p">(</span><span class="n">womenClass</span><span class="p">))</span><span class="w">
  </span><span class="c1">#finds intersection of document data frame and the menClass and womenClass dataframes</span><span class="w">
  </span><span class="n">intersectM</span><span class="o">&lt;-</span><span class="n">menClass</span><span class="p">[</span><span class="n">is.element</span><span class="p">(</span><span class="n">menClass</span><span class="o">$</span><span class="n">menClass</span><span class="p">,</span><span class="w"> </span><span class="n">intersect</span><span class="p">(</span><span class="n">document</span><span class="o">$</span><span class="n">`unlist(x)`</span><span class="p">,</span><span class="w"> </span><span class="n">menClass</span><span class="o">$</span><span class="n">menClass</span><span class="p">)),]</span><span class="w">
  </span><span class="n">intersectW</span><span class="o">&lt;-</span><span class="n">womenClass</span><span class="p">[</span><span class="n">is.element</span><span class="p">(</span><span class="n">womenClass</span><span class="o">$</span><span class="n">womenClass</span><span class="p">,</span><span class="w"> </span><span class="n">intersect</span><span class="p">(</span><span class="n">document</span><span class="o">$</span><span class="n">`unlist(x)`</span><span class="p">,</span><span class="w"> </span><span class="n">womenClass</span><span class="o">$</span><span class="n">womenClass</span><span class="p">)),]</span><span class="w">
  </span><span class="c1">#conditional probabilities of each intersecting word, this would be the place to add smoothing if desired in place of the 0s</span><span class="w">
  </span><span class="n">intersectM</span><span class="o">$</span><span class="n">Freq</span><span class="o">&lt;-</span><span class="p">(</span><span class="n">intersectM</span><span class="o">$</span><span class="n">Freq</span><span class="m">+0</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="n">mCount</span><span class="o">+</span><span class="n">vocabCount</span><span class="m">+0</span><span class="p">)</span><span class="w">
  </span><span class="n">intersectW</span><span class="o">$</span><span class="n">Freq</span><span class="o">&lt;-</span><span class="p">(</span><span class="n">intersectW</span><span class="o">$</span><span class="n">Freq</span><span class="m">+0</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="n">wCount</span><span class="o">+</span><span class="n">vocabCount</span><span class="m">+0</span><span class="p">)</span><span class="w">
  </span><span class="c1">#finds product the frequency column and multiplies by the priors</span><span class="w">
  </span><span class="n">posteriorM</span><span class="o">&lt;-</span><span class="nf">prod</span><span class="p">(</span><span class="n">intersectM</span><span class="o">$</span><span class="n">Freq</span><span class="p">)</span><span class="o">*</span><span class="n">menPrior</span><span class="w">
  </span><span class="n">posteriorW</span><span class="o">&lt;-</span><span class="nf">prod</span><span class="p">(</span><span class="n">intersectW</span><span class="o">$</span><span class="n">Freq</span><span class="p">)</span><span class="o">*</span><span class="n">womenPrior</span><span class="w">
  </span><span class="c1">#test for higher posterior</span><span class="w">
  </span><span class="k">if</span><span class="p">(</span><span class="n">posteriorW</span><span class="o">&gt;</span><span class="n">posteriorM</span><span class="p">){</span><span class="w">
    </span><span class="nf">return</span><span class="p">(</span><span class="s2">"Female"</span><span class="p">)</span><span class="w">
  </span><span class="p">}</span><span class="w">
  </span><span class="nf">return</span><span class="p">(</span><span class="s2">"Male"</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>I will tackle the textCleaner part first.  My goal will be to make the code read like “well written prose” to quote Uncle Bob.  What this means is that all the comments I have in the code are only necessary because I did a terrible job writing the code in the first place.</p>

<p>First, I must write a test that the current code passes so that I know I didn’t break anything while refactoring.  For that we are going to need the <em>testthat</em> library.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#install.packages('testthat')</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">testthat</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>We also need a data frame made from the original function to test the new function against.  I’ve assigned it to a variable for simplicities sake.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cleaned_test_file</span><span class="o">&lt;-</span><span class="n">textCleaner</span><span class="p">(</span><span class="s1">'~/naive-bayes-classifier/refactor_test_file.txt'</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<h4 id="textcleaner-unit-test">textCleaner unit test</h4>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">test_that</span><span class="p">(</span><span class="s1">'textCleaner cleans'</span><span class="p">,</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">test_file</span><span class="o">&lt;-</span><span class="s1">'~/naive-bayes-classifier/refactor_test_file.txt'</span><span class="w">
  
  </span><span class="n">expect_that</span><span class="p">(</span><span class="n">textCleaner</span><span class="p">(</span><span class="n">test_file</span><span class="p">),</span><span class="w"> </span><span class="n">equals</span><span class="p">(</span><span class="n">cleaned_test_file</span><span class="p">))</span><span class="w">
</span><span class="p">})</span><span class="w">
</span></code></pre></div></div>
<p>I ran the unit test against the original function to prove the unit test itself works.  The lack of an error means that I am ready to refactor.</p>

<h4 id="refactored-textcleaner-function">Refactored textCleaner function</h4>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Here, I've broken out each of the seperate operations of the original code into their own function.</span><span class="w">
</span><span class="n">remove_author</span><span class="o">&lt;-</span><span class="k">function</span><span class="p">(</span><span class="n">file</span><span class="p">){</span><span class="w">
  </span><span class="n">regex_author_pattern</span><span class="o">&lt;-</span><span class="s2">"--\\s.*"</span><span class="w">
  </span><span class="n">cleaned_file</span><span class="o">&lt;-</span><span class="n">base</span><span class="o">::</span><span class="n">gsub</span><span class="p">(</span><span class="n">regex_author_pattern</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w"> </span><span class="n">file</span><span class="p">)</span><span class="w">
  </span><span class="nf">return</span><span class="p">(</span><span class="n">cleaned_file</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">remove_punctuation</span><span class="o">&lt;-</span><span class="k">function</span><span class="p">(</span><span class="n">file</span><span class="p">){</span><span class="w">
  </span><span class="n">regex_punctuation_pattern</span><span class="o">&lt;-</span><span class="s2">"([-'])|[[:punct:]]"</span><span class="w">
  </span><span class="n">cleaned_file</span><span class="o">&lt;-</span><span class="n">base</span><span class="o">::</span><span class="n">gsub</span><span class="p">(</span><span class="n">regex_punctuation_pattern</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w"> </span><span class="n">file</span><span class="p">)</span><span class="w">
  </span><span class="nf">return</span><span class="p">(</span><span class="n">cleaned_file</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">split_file</span><span class="o">&lt;-</span><span class="k">function</span><span class="p">(</span><span class="n">file</span><span class="p">){</span><span class="w">
  </span><span class="n">regex_split_pattern</span><span class="o">&lt;-</span><span class="s2">"[[:space:]]+"</span><span class="w">
  </span><span class="n">cleaned_file</span><span class="o">&lt;-</span><span class="n">base</span><span class="o">::</span><span class="n">strsplit</span><span class="p">(</span><span class="n">file</span><span class="p">,</span><span class="w"> </span><span class="n">regex_split_pattern</span><span class="p">)</span><span class="w">
  </span><span class="nf">return</span><span class="p">(</span><span class="n">cleaned_file</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">clean</span><span class="o">&lt;-</span><span class="k">function</span><span class="p">(</span><span class="n">file</span><span class="p">){</span><span class="w">
  </span><span class="n">cleaned_file</span><span class="o">&lt;-</span><span class="n">remove_author</span><span class="p">(</span><span class="n">file</span><span class="p">)</span><span class="w">
  </span><span class="n">cleaned_file</span><span class="o">&lt;-</span><span class="n">remove_punctuation</span><span class="p">(</span><span class="n">cleaned_file</span><span class="p">)</span><span class="w">
  </span><span class="n">cleaned_file</span><span class="o">&lt;-</span><span class="n">split_file</span><span class="p">(</span><span class="n">cleaned_file</span><span class="p">)</span><span class="w">
  </span><span class="nf">return</span><span class="p">(</span><span class="n">cleaned_file</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="c1"># An argument could be made that I didn't have to break out all the cleaning steps to their own clean function but I decided to go all out</span><span class="w">
</span><span class="n">clean_text_file_and_return_data_frame</span><span class="o">&lt;-</span><span class="k">function</span><span class="p">(</span><span class="n">file</span><span class="p">){</span><span class="w">
  
  </span><span class="n">file</span><span class="o">&lt;-</span><span class="n">base</span><span class="o">::</span><span class="n">scan</span><span class="p">(</span><span class="n">file</span><span class="p">,</span><span class="w"> </span><span class="n">what</span><span class="o">=</span><span class="s2">""</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="o">=</span><span class="s2">"\n"</span><span class="p">)</span><span class="w">
  
  </span><span class="n">cleaned_file</span><span class="o">&lt;-</span><span class="n">clean</span><span class="p">(</span><span class="n">file</span><span class="p">)</span><span class="w">
  </span><span class="c1"># annoyingly to get the test to pass I had to rename cleaned_file to x</span><span class="w">
  </span><span class="n">x</span><span class="o">&lt;-</span><span class="n">cleaned_file</span><span class="w">
  </span><span class="n">x</span><span class="o">&lt;-</span><span class="n">base</span><span class="o">::</span><span class="n">as.data.frame</span><span class="p">(</span><span class="n">unlist</span><span class="p">(</span><span class="n">x</span><span class="p">))</span><span class="w">
    
  </span><span class="nf">return</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>Now to use the test I wrote (and proved) earlier on the newly written function.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">test_that</span><span class="p">(</span><span class="s1">'textCleaner cleans'</span><span class="p">,</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">test_file</span><span class="o">&lt;-</span><span class="s1">'~/naive-bayes-classifier/refactor_test_file.txt'</span><span class="w">
  
  </span><span class="n">expect_that</span><span class="p">(</span><span class="n">clean_text_file_and_return_data_frame</span><span class="p">(</span><span class="n">test_file</span><span class="p">),</span><span class="w"> </span><span class="n">equals</span><span class="p">(</span><span class="n">cleaned_test_file</span><span class="p">))</span><span class="w">
</span><span class="p">})</span><span class="w">
</span></code></pre></div></div>
<p>Again, the lack of an error means that everything works.  Let’s review:</p>

<ul>
  <li>I used the original code to get a data frame into a variable</li>
  <li>I wrote a unit test against the original code</li>
  <li>I tested my unit test against the data frame variable produced by the original code</li>
  <li>Finally, I wrote and tested the new code.</li>
</ul>

<p>The circle is now complete.</p>

<p>As noted in the comment in the above <em>clean_text_file_and_return_data_frame</em> function, to get the test to pass I had to rename my cleaned_file variable to x before I converted to a data frame and called the unlist function.</p>

<p>I have remedied that situation below.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w"> </span><span class="c1">#clean_text_file_and_return_data_frame&lt;-function(file){</span><span class="w">
  
  </span><span class="c1">#file&lt;-base::scan(file, what="", sep="\n")</span><span class="w">
  </span><span class="c1">#cleaned_file&lt;-clean(file)</span><span class="w">
    
  </span><span class="c1">#return(cleaned_file)</span><span class="w">
</span><span class="c1">#}</span><span class="w">
</span></code></pre></div></div>
<p>This code is much more readable and follows the single responsibility principle.  Now we need a whole new set of unit tests.</p>

<p>For the bayesClassifer function I am going to make lots of changes.  Not only am I going to refactor the code so that it abides by the single responsibility principle but I am also going to combine all of these functions into one call.  This means that the new bayesClassifer function will call clean_text_file_and_return_data_frame.  All the user will have to do is provide it with the text files for the male and female quotes (training data) as well as the test quote and the priors.  Let’s get started.</p>

<p>Just as before we first need a couple of unit test that work on the current code so that we can test the new code.  I’ve created two unit test text files to use as training data, one has a single female quote, the other a single male quote.  I will then use those same quotes as the test quote so that we are assured that we return Male, and Female when we want to.</p>

<h4 id="bayesclassifier-unit-test">bayesClassifier unit test</h4>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># First our input data frames, using our new clean_text_file_and_return_data_frame function</span><span class="w">
</span><span class="n">menClass</span><span class="o">&lt;-</span><span class="n">clean_text_file_and_return_data_frame</span><span class="p">(</span><span class="s2">"~/naive-bayes-classifier/men_unit_test.txt"</span><span class="p">)</span><span class="w">
</span><span class="n">womenClass</span><span class="o">&lt;-</span><span class="n">clean_text_file_and_return_data_frame</span><span class="p">(</span><span class="s2">"~/naive-bayes-classifier/women_unit_test.txt"</span><span class="p">)</span><span class="w">

</span><span class="c1">#then I'm going to make the classifier output the string "Male"</span><span class="w">
</span><span class="n">test_that</span><span class="p">(</span><span class="s1">'bayesClassifier classifies'</span><span class="p">,</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">womenQuote</span><span class="o">&lt;-</span><span class="n">clean_text_file_and_return_data_frame</span><span class="p">(</span><span class="s2">"~/naive-bayes-classifier/women_unit_test_quote.txt"</span><span class="p">)</span><span class="w">  
  </span><span class="n">expect_that</span><span class="p">(</span><span class="n">bayesClassifier</span><span class="p">(</span><span class="n">menClass</span><span class="p">,</span><span class="w"> </span><span class="n">womenClass</span><span class="p">,</span><span class="w"> </span><span class="n">womenQuote</span><span class="p">,</span><span class="w"> </span><span class="m">.5</span><span class="p">,</span><span class="w"> </span><span class="m">.5</span><span class="p">),</span><span class="w"> </span><span class="n">equals</span><span class="p">(</span><span class="s2">"Male"</span><span class="p">))</span><span class="w">
</span><span class="p">})</span><span class="w">

</span><span class="c1">#second the string "Female"</span><span class="w">
</span><span class="n">test_that</span><span class="p">(</span><span class="s1">'bayesClassifier classifies'</span><span class="p">,</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">menQuote</span><span class="o">&lt;-</span><span class="n">clean_text_file_and_return_data_frame</span><span class="p">(</span><span class="s2">"~/naive-bayes-classifier/men_unit_test_quote.txt"</span><span class="p">)</span><span class="w">
  
  </span><span class="n">expect_that</span><span class="p">(</span><span class="n">bayesClassifier</span><span class="p">(</span><span class="n">menClass</span><span class="p">,</span><span class="w"> </span><span class="n">womenClass</span><span class="p">,</span><span class="w"> </span><span class="n">menQuote</span><span class="p">,</span><span class="w"> </span><span class="m">.5</span><span class="p">,</span><span class="w"> </span><span class="m">.5</span><span class="p">),</span><span class="w"> </span><span class="n">equals</span><span class="p">(</span><span class="s2">"Female"</span><span class="p">))</span><span class="w">
</span><span class="p">})</span><span class="w">
</span></code></pre></div></div>
<p>And I have passing unit tests.  A smart observer here will realize that I am using the womenQuote string to output “Male” and vice versa.  This is because of the way the Naive bayes Classifier works.  We need large training datasets for it to be accurate, such that the frequency of the words in each training data set that are also in the test data set are high.  Since that is not the case here I get the reversed output of what one would expect.  The accuracy of my Naive Bayes Classifier is beyond the scope of this blog post.  Time to refactor.</p>

<h4 id="refactored-bayes_classifier-function">Refactored bayes_classifier function</h4>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">get_count</span><span class="o">&lt;-</span><span class="k">function</span><span class="p">(</span><span class="n">df</span><span class="p">){</span><span class="w">
  </span><span class="nf">return</span><span class="p">(</span><span class="n">nrow</span><span class="p">(</span><span class="n">df</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">combine_dataframes_and_make_table</span><span class="o">&lt;-</span><span class="k">function</span><span class="p">(</span><span class="n">df1</span><span class="p">,</span><span class="w"> </span><span class="n">df2</span><span class="p">){</span><span class="w">
  </span><span class="n">all</span><span class="o">&lt;-</span><span class="n">rbind</span><span class="p">(</span><span class="n">df1</span><span class="p">,</span><span class="w"> </span><span class="n">df2</span><span class="p">)</span><span class="w">
  </span><span class="nf">return</span><span class="p">(</span><span class="n">as.data.frame</span><span class="p">(</span><span class="n">table</span><span class="p">(</span><span class="n">all</span><span class="p">)))</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">collapse_to_table</span><span class="o">&lt;-</span><span class="k">function</span><span class="p">(</span><span class="n">df</span><span class="p">){</span><span class="w">
  </span><span class="nf">return</span><span class="p">(</span><span class="n">as.data.frame</span><span class="p">(</span><span class="n">table</span><span class="p">(</span><span class="n">df</span><span class="p">)))</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">find_intersections</span><span class="o">&lt;-</span><span class="k">function</span><span class="p">(</span><span class="n">df1</span><span class="p">,</span><span class="w"> </span><span class="n">df2</span><span class="p">){</span><span class="w">
  </span><span class="nf">return</span><span class="p">(</span><span class="n">df1</span><span class="p">[</span><span class="n">is.element</span><span class="p">(</span><span class="n">df1</span><span class="o">$</span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">intersect</span><span class="p">(</span><span class="n">df2</span><span class="o">$</span><span class="n">`unlist(x)`</span><span class="p">,</span><span class="w"> </span><span class="n">df1</span><span class="o">$</span><span class="n">df</span><span class="p">)),])</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">get_conditional_probabilities</span><span class="o">&lt;-</span><span class="k">function</span><span class="p">(</span><span class="n">intersections</span><span class="p">,</span><span class="w"> </span><span class="n">count1</span><span class="p">,</span><span class="w"> </span><span class="n">count2</span><span class="p">,</span><span class="w"> </span><span class="n">smoothing</span><span class="p">){</span><span class="w">
  </span><span class="p">(</span><span class="n">intersections</span><span class="o">$</span><span class="n">Freq</span><span class="m">+0</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="n">count1</span><span class="o">+</span><span class="n">count2</span><span class="o">+</span><span class="n">smoothing</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">get_posterior</span><span class="o">&lt;-</span><span class="k">function</span><span class="p">(</span><span class="n">intersects</span><span class="p">,</span><span class="w"> </span><span class="n">prior</span><span class="p">){</span><span class="w">
  </span><span class="nf">return</span><span class="p">(</span><span class="nf">prod</span><span class="p">(</span><span class="n">intersects</span><span class="o">$</span><span class="n">Freq</span><span class="p">)</span><span class="o">*</span><span class="n">prior</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">bayes_classifier</span><span class="o">&lt;-</span><span class="k">function</span><span class="p">(</span><span class="n">men_train</span><span class="p">,</span><span class="w"> </span><span class="n">women_train</span><span class="p">,</span><span class="w"> </span><span class="n">quote_test</span><span class="p">,</span><span class="w"> </span><span class="n">men_prior</span><span class="p">,</span><span class="w"> </span><span class="n">women_prior</span><span class="p">,</span><span class="w"> </span><span class="n">smoothing</span><span class="o">=</span><span class="m">0</span><span class="p">){</span><span class="w">
  
  </span><span class="n">men_class</span><span class="o">&lt;-</span><span class="n">clean_text_file_and_return_data_frame</span><span class="p">(</span><span class="n">men_train</span><span class="p">)</span><span class="w">
  </span><span class="n">women_class</span><span class="o">&lt;-</span><span class="n">clean_text_file_and_return_data_frame</span><span class="p">(</span><span class="n">women_train</span><span class="p">)</span><span class="w">
  </span><span class="n">quote_class</span><span class="o">&lt;-</span><span class="n">clean_text_file_and_return_data_frame</span><span class="p">(</span><span class="n">quote_test</span><span class="p">)</span><span class="w">
  
  </span><span class="n">men_count</span><span class="o">&lt;-</span><span class="n">get_count</span><span class="p">(</span><span class="n">men_class</span><span class="p">)</span><span class="w">
  </span><span class="n">women_count</span><span class="o">&lt;-</span><span class="n">get_count</span><span class="p">(</span><span class="n">women_class</span><span class="p">)</span><span class="w">
  
  </span><span class="n">all_words</span><span class="o">&lt;-</span><span class="n">combine_dataframes_and_make_table</span><span class="p">(</span><span class="n">men_class</span><span class="p">,</span><span class="w"> </span><span class="n">women_class</span><span class="p">)</span><span class="w">
  </span><span class="n">all_words_count</span><span class="o">&lt;-</span><span class="n">get_count</span><span class="p">(</span><span class="n">all_words</span><span class="p">)</span><span class="w">
  
  </span><span class="n">men_class</span><span class="o">&lt;-</span><span class="n">collapse_to_table</span><span class="p">(</span><span class="n">men_class</span><span class="p">)</span><span class="w">
  </span><span class="n">women_class</span><span class="o">&lt;-</span><span class="n">collapse_to_table</span><span class="p">(</span><span class="n">women_class</span><span class="p">)</span><span class="w">
  
  </span><span class="n">intersects_men</span><span class="o">&lt;-</span><span class="n">find_intersections</span><span class="p">(</span><span class="n">men_class</span><span class="p">,</span><span class="w"> </span><span class="n">quote_class</span><span class="p">)</span><span class="w">
  </span><span class="n">intersects_women</span><span class="o">&lt;-</span><span class="n">find_intersections</span><span class="p">(</span><span class="n">women_class</span><span class="p">,</span><span class="w"> </span><span class="n">quote_class</span><span class="p">)</span><span class="w">

  </span><span class="n">intersects_men</span><span class="o">$</span><span class="n">Freq</span><span class="o">&lt;-</span><span class="n">get_conditional_probabilities</span><span class="p">(</span><span class="n">intersects_men</span><span class="p">,</span><span class="w"> </span><span class="n">men_count</span><span class="p">,</span><span class="w"> </span><span class="n">all_words_count</span><span class="p">,</span><span class="w"> </span><span class="n">smoothing</span><span class="p">)</span><span class="w">
  </span><span class="n">intersects_women</span><span class="o">$</span><span class="n">Freq</span><span class="o">&lt;-</span><span class="n">get_conditional_probabilities</span><span class="p">(</span><span class="n">intersects_women</span><span class="p">,</span><span class="w"> </span><span class="n">women_count</span><span class="p">,</span><span class="w"> </span><span class="n">all_words_count</span><span class="p">,</span><span class="w"> </span><span class="n">smoothing</span><span class="p">)</span><span class="w">

  </span><span class="n">posterior_men</span><span class="o">&lt;-</span><span class="n">get_posterior</span><span class="p">(</span><span class="n">intersects_men</span><span class="p">,</span><span class="w"> </span><span class="n">men_prior</span><span class="p">)</span><span class="w">
  </span><span class="n">posterior_women</span><span class="o">&lt;-</span><span class="n">get_posterior</span><span class="p">(</span><span class="n">intersects_women</span><span class="p">,</span><span class="w"> </span><span class="n">women_prior</span><span class="p">)</span><span class="w">
  
  </span><span class="k">if</span><span class="p">(</span><span class="n">posterior_women</span><span class="o">&gt;</span><span class="n">posterior_men</span><span class="p">){</span><span class="w">
    </span><span class="nf">return</span><span class="p">(</span><span class="s2">"Female"</span><span class="p">)</span><span class="w">
  </span><span class="p">}</span><span class="w">
  </span><span class="nf">return</span><span class="p">(</span><span class="s2">"Male"</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<h4 id="now-retest-with-the-modified-unit-tests">Now retest with the modified unit tests</h4>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># First our input data frames, using our new clean_text_file_and_return_data_frame function</span><span class="w">
</span><span class="n">men_train</span><span class="o">&lt;-</span><span class="s2">"~/naive-bayes-classifier/men_unit_test.txt"</span><span class="w">
</span><span class="n">women_train</span><span class="o">&lt;-</span><span class="s2">"~/naive-bayes-classifier/women_unit_test.txt"</span><span class="w">

</span><span class="c1">#then I'm going to make the classifier output the string "Male"</span><span class="w">
</span><span class="n">test_that</span><span class="p">(</span><span class="s1">'bayesClassifier classifies'</span><span class="p">,</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">women_quote</span><span class="o">&lt;-</span><span class="s2">"~/naive-bayes-classifier/women_unit_test_quote.txt"</span><span class="w">  
  </span><span class="n">expect_that</span><span class="p">(</span><span class="n">bayes_classifier</span><span class="p">(</span><span class="n">men_train</span><span class="p">,</span><span class="w"> </span><span class="n">women_train</span><span class="p">,</span><span class="w"> </span><span class="n">women_quote</span><span class="p">,</span><span class="w"> </span><span class="m">.5</span><span class="p">,</span><span class="w"> </span><span class="m">.5</span><span class="p">),</span><span class="w"> </span><span class="n">equals</span><span class="p">(</span><span class="s2">"Male"</span><span class="p">))</span><span class="w">
</span><span class="p">})</span><span class="w">

</span><span class="c1">#second the string "Female"</span><span class="w">
</span><span class="n">test_that</span><span class="p">(</span><span class="s1">'bayesClassifier classifies'</span><span class="p">,</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">men_quote</span><span class="o">&lt;-</span><span class="s2">"~/naive-bayes-classifier/men_unit_test_quote.txt"</span><span class="w">
  </span><span class="n">expect_that</span><span class="p">(</span><span class="n">bayes_classifier</span><span class="p">(</span><span class="n">men_train</span><span class="p">,</span><span class="w"> </span><span class="n">women_train</span><span class="p">,</span><span class="w"> </span><span class="n">men_quote</span><span class="p">,</span><span class="w"> </span><span class="m">.5</span><span class="p">,</span><span class="w"> </span><span class="m">.5</span><span class="p">),</span><span class="w"> </span><span class="n">equals</span><span class="p">(</span><span class="s2">"Female"</span><span class="p">))</span><span class="w">
</span><span class="p">})</span><span class="w">
</span></code></pre></div></div>
<p>I have passing unit tests.  Note: I did change the unit tests a bit to take into account the new functionality of being able to take in raw text files.  This is better because now all I have to do is call the bayes_classifier function.</p>

<p>I wasn’t able to make this code much shorter but it is much more readable with descriptive names for the functions.  I will probably go back and rework the bayes_classifier function to see what else I can do with it at a later date.</p>

<p>I could also now write a bunch more unit tests for all of the new functions I made but this post is already getting way too long.  Next time I promise I will have visuals, perhaps something to do with Benford’s Law.</p>]]></content><author><name>Tim Dunbar</name></author><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Naive Bayes Classifier for Quotes (using R Notebook)</title><link href="https://realtimdunbar.github.io/Naive-Bayes-Classifier/" rel="alternate" type="text/html" title="Naive Bayes Classifier for Quotes (using R Notebook)" /><published>2017-06-06T00:00:00+00:00</published><updated>2017-06-06T00:00:00+00:00</updated><id>https://realtimdunbar.github.io/Naive-Bayes-Classifier</id><content type="html" xml:base="https://realtimdunbar.github.io/Naive-Bayes-Classifier/"><![CDATA[<hr />

<h3 id="naive-bayes-classifier">naive bayes classifier</h3>

<p>This will be my first blog post. It is primarily for testing purposes. My workflow basically consists of R Studio, github, and Jekyll which is a Ruby Gem. I will probably write another blog post detailing my processes after I figure out what they are.</p>

<p>As the title suggests, this post will be about a Naive Bayes Classifier (NBC) I wrote after attending a meetup on NBCs written in Python. This classifier is trained with male and female quotations but would work equally well classifying other categorical data (note: I am not suggesting that my NBC is accurate).</p>

<p>This post will primarily consist of the mechanics behind my NBC and the resources I used to put it all together. I will write future blog posts regarding accuracy and eventual improvements.</p>

<h4 id="first-we-need-a-function-to-build-the-data-frames-we-will-use-as-our-training-data-inputs">First we need a function to build the data frames we will use as our training data inputs:</h4>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">textCleaner</span><span class="o">&lt;-</span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">){</span><span class="w">
  </span><span class="n">x</span><span class="o">&lt;-</span><span class="n">scan</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">what</span><span class="o">=</span><span class="s2">""</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="o">=</span><span class="s2">"\n"</span><span class="p">)</span><span class="w">
  </span><span class="c1">#removes the author of the quote because I am only interested in male or female</span><span class="w">
  </span><span class="n">x</span><span class="o">&lt;-</span><span class="n">gsub</span><span class="p">(</span><span class="s2">"--\\s.*"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">)</span><span class="w">
  </span><span class="c1">#removes punctiation</span><span class="w">
  </span><span class="n">x</span><span class="o">&lt;-</span><span class="n">gsub</span><span class="p">(</span><span class="s2">"([-'])|[[:punct:]]"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">)</span><span class="w">
  </span><span class="c1">#splits on spaces</span><span class="w">
  </span><span class="n">x</span><span class="o">&lt;-</span><span class="n">strsplit</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="s2">"[[:space:]]+"</span><span class="p">)</span><span class="w">
  </span><span class="c1">#formats as data frame</span><span class="w">
  </span><span class="n">x</span><span class="o">&lt;-</span><span class="n">as.data.frame</span><span class="p">(</span><span class="n">unlist</span><span class="p">(</span><span class="n">x</span><span class="p">))</span><span class="w">
  </span><span class="nf">return</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<h4 id="here-we-are-using-some-text-files-that-i-acquired-from-the-web-and-the-textcleaner-function-we-wrote-earlier-im-also-going-to-define-some-other-variable-we-will-need-later">Here we are using some text files that I acquired from the web and the textCleaner function we wrote earlier. I’m also going to define some other variable we will need later.</h4>

<p>We are using the following quote from Eleanor Roosevelt: <em>“A woman is like a tea bag, you can’t tell how strong she is until you put her in hot water.”</em></p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#These are our corpuses made from male and female quotes</span><span class="w">
</span><span class="n">men_quote</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">textCleaner</span><span class="p">(</span><span class="s2">"/home/timothy/naive-bayes-classifier/men.txt"</span><span class="p">)</span><span class="w">
</span><span class="n">women_quote</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">textCleaner</span><span class="p">(</span><span class="s2">"/home/timothy/naive-bayes-classifier/women.txt"</span><span class="p">)</span><span class="w">
</span><span class="n">quote</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">textCleaner</span><span class="p">(</span><span class="s2">"/home/timothy/naive-bayes-classifier/quote.txt"</span><span class="p">)</span><span class="w">
</span><span class="n">men_prior</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="n">women_prior</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">1</span><span class="w">
</span></code></pre></div></div>

<h4 id="we-obviously-need-a-function-that-does-the-classification-the-actuall-nbc-i-will-go-through-the-code-line-by-line-and-explain-whats-going-on-a-bit-later-but-for-now-we-will-just-write-it">We obviously need a function that does the classification, the actuall NBC. I will go through the code line by line and explain what’s going on a bit later but for now we will just write it.</h4>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">bayesClassifier</span><span class="o">&lt;-</span><span class="k">function</span><span class="p">(</span><span class="n">menClass</span><span class="p">,</span><span class="w"> </span><span class="n">womenClass</span><span class="p">,</span><span class="w"> </span><span class="n">document</span><span class="p">,</span><span class="w"> </span><span class="n">menPrior</span><span class="p">,</span><span class="w"> </span><span class="n">womenPrior</span><span class="p">){</span><span class="w">
  </span><span class="c1">#gets counts of words in each class</span><span class="w">
  </span><span class="n">mCount</span><span class="o">&lt;-</span><span class="n">nrow</span><span class="p">(</span><span class="n">menClass</span><span class="p">)</span><span class="w">
  </span><span class="n">wCount</span><span class="o">&lt;-</span><span class="n">nrow</span><span class="p">(</span><span class="n">womenClass</span><span class="p">)</span><span class="w">
  </span><span class="c1">#combines the menClass and womenClass dataframes into a vocabulary dataframe</span><span class="w">
  </span><span class="n">vocabAll</span><span class="o">&lt;-</span><span class="n">rbind</span><span class="p">(</span><span class="n">menClass</span><span class="p">,</span><span class="w"> </span><span class="n">womenClass</span><span class="p">)</span><span class="w">
  </span><span class="c1">#collapses like words in vocabAll and find count of all unique words in vacabulary</span><span class="w">
  </span><span class="n">vocabAll</span><span class="o">&lt;-</span><span class="n">as.data.frame</span><span class="p">(</span><span class="n">table</span><span class="p">(</span><span class="n">vocabAll</span><span class="p">))</span><span class="w">
  </span><span class="n">vocabCount</span><span class="o">&lt;-</span><span class="n">nrow</span><span class="p">(</span><span class="n">vocabAll</span><span class="p">)</span><span class="w">
  </span><span class="c1">#collapses menClass and womenClass data frames and finds the frequency of each word</span><span class="w">
  </span><span class="n">menClass</span><span class="o">&lt;-</span><span class="n">as.data.frame</span><span class="p">(</span><span class="n">table</span><span class="p">(</span><span class="n">menClass</span><span class="p">))</span><span class="w">
  </span><span class="n">womenClass</span><span class="o">&lt;-</span><span class="n">as.data.frame</span><span class="p">(</span><span class="n">table</span><span class="p">(</span><span class="n">womenClass</span><span class="p">))</span><span class="w">
  </span><span class="c1">#finds intersection of document data frame and the menClass and womenClass dataframes</span><span class="w">
  </span><span class="n">intersectM</span><span class="o">&lt;-</span><span class="n">menClass</span><span class="p">[</span><span class="n">is.element</span><span class="p">(</span><span class="n">menClass</span><span class="o">$</span><span class="n">menClass</span><span class="p">,</span><span class="w"> </span><span class="n">intersect</span><span class="p">(</span><span class="n">document</span><span class="o">$</span><span class="n">`unlist(x)`</span><span class="p">,</span><span class="w"> </span><span class="n">menClass</span><span class="o">$</span><span class="n">menClass</span><span class="p">)),]</span><span class="w">
  </span><span class="n">intersectW</span><span class="o">&lt;-</span><span class="n">womenClass</span><span class="p">[</span><span class="n">is.element</span><span class="p">(</span><span class="n">womenClass</span><span class="o">$</span><span class="n">womenClass</span><span class="p">,</span><span class="w"> </span><span class="n">intersect</span><span class="p">(</span><span class="n">document</span><span class="o">$</span><span class="n">`unlist(x)`</span><span class="p">,</span><span class="w"> </span><span class="n">womenClass</span><span class="o">$</span><span class="n">womenClass</span><span class="p">)),]</span><span class="w">
  </span><span class="c1">#conditional probabilities of each intersecting word, this would be the place to add smoothing if desired in place of the 0s</span><span class="w">
  </span><span class="n">intersectM</span><span class="o">$</span><span class="n">Freq</span><span class="o">&lt;-</span><span class="p">(</span><span class="n">intersectM</span><span class="o">$</span><span class="n">Freq</span><span class="m">+0</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="n">mCount</span><span class="o">+</span><span class="n">vocabCount</span><span class="m">+0</span><span class="p">)</span><span class="w">
  </span><span class="n">intersectW</span><span class="o">$</span><span class="n">Freq</span><span class="o">&lt;-</span><span class="p">(</span><span class="n">intersectW</span><span class="o">$</span><span class="n">Freq</span><span class="m">+0</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="n">wCount</span><span class="o">+</span><span class="n">vocabCount</span><span class="m">+0</span><span class="p">)</span><span class="w">
  </span><span class="c1">#finds product the frequency column and multiplies by the priors</span><span class="w">
  </span><span class="n">posteriorM</span><span class="o">&lt;-</span><span class="nf">prod</span><span class="p">(</span><span class="n">intersectM</span><span class="o">$</span><span class="n">Freq</span><span class="p">)</span><span class="o">*</span><span class="n">menPrior</span><span class="w">
  </span><span class="n">posteriorW</span><span class="o">&lt;-</span><span class="nf">prod</span><span class="p">(</span><span class="n">intersectW</span><span class="o">$</span><span class="n">Freq</span><span class="p">)</span><span class="o">*</span><span class="n">womenPrior</span><span class="w">
  </span><span class="c1">#test for higher posterior</span><span class="w">
  </span><span class="k">if</span><span class="p">(</span><span class="n">posteriorW</span><span class="o">&gt;</span><span class="n">posteriorM</span><span class="p">){</span><span class="w">
    </span><span class="nf">return</span><span class="p">(</span><span class="s2">"Female"</span><span class="p">)</span><span class="w">
  </span><span class="p">}</span><span class="w">
  </span><span class="nf">return</span><span class="p">(</span><span class="s2">"Male"</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<h4 id="finally-we-call-our-nbc-function-and-pass-in-the-variables-we-made-earlier">Finally we call our NBC function and pass in the variables we made earlier</h4>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">answer</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">bayesClassifier</span><span class="p">(</span><span class="n">men_quote</span><span class="p">,</span><span class="w"> </span><span class="n">women_quote</span><span class="p">,</span><span class="w"> </span><span class="n">quote</span><span class="p">,</span><span class="w"> </span><span class="n">men_prior</span><span class="p">,</span><span class="w"> </span><span class="n">women_prior</span><span class="p">)</span><span class="w">

</span><span class="n">answer</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] "Male"
</code></pre></div></div>

<p>This is clearly wrong but keep in mind I am using very small data sets.</p>]]></content><author><name>Tim Dunbar</name></author><summary type="html"><![CDATA[]]></summary></entry></feed>