Building a TLA+ Training Dataset: From GitHub to Model-Ready Specs
Tonight I made significant progress on Symbolic, my project to train LLMs to generate TLA+ formal specifications from natural language descriptions. The key milestone: collecting and validating a dataset of real-world TLA+ specifications from GitHub.
New to this project? If you want to start from the beginning of the Symbolic project, read From Napkin Sketch to Mathematical Proof: Introducing Symbolic first. Otherwise, continue reading to learn about dataset collection and validation.
The Challenge
To fine-tune a model that can generate valid TLA+ specifications, I need training data. Lots of it. And not just any TLA+ code—I need specifications that are:
- Syntactically correct (proper module structure, balanced operators)
- Semantically valid (pass the TLC model checker)
- Diverse (covering different domains and patterns)
The question: where do I get this data?
Phase 1: Scraping GitHub for TLA+ Files
I started by building a simple GitHub scraper to collect .tla files from public repositories. Using GitHub’s Code Search API, I searched for extension:tla and downloaded the raw file contents.
# github-scraper.py (simplified)
SEARCH_QUERY = "extension:tla"
SAVE_DIR = "tla_dataset"
def search_tla_files(page=1):
url = f"https://api.github.com/search/code?q={SEARCH_QUERY}&page={page}&per_page=100"
response = requests.get(url, headers=HEADERS)
return response.json().get('items', [])
def download_file(item):
file_url = item['url']
res = requests.get(file_url, headers=HEADERS)
if res.status_code == 200:
content_json = res.json()
if content_json.get('encoding') == 'base64':
file_content = base64.b64decode(content_json['content'])
# Save to disk
Results:
- 449 TLA+ files collected from GitHub
- Sourced from 60+ open-source repositories
- Including specs from CometBFT, Paxos implementations, PBFT, and various distributed systems
This gave me a solid starting point, but the real work was just beginning.
Phase 2: Validation—Separating the Wheat from the Chaff
Having 449 files is great, but are they actually valid? I built a validation pipeline with two levels:
Level 1: Syntax Validation (Basic Structure)
First, I implemented a basic syntax validator that checks for:
- Proper module headers (
---- MODULE Name ----) - Proper module footers (
====) - Balanced brackets and parentheses
- Balanced logical operators (
/\and\/)
# validate_dataset.py
class DatasetValidator:
def validate_file(self, file_path: Path) -> ValidationResult:
content = file_path.read_text(encoding='utf-8')
# Syntax validation
syntax_valid, syntax_errors = self.syntax_validator.validate(content)
# TLC validation (if available)
if syntax_valid and not self.skip_tlc:
tlc_valid, tlc_errors = self.tlc_validator.validate(content)
return ValidationResult(...)
I ran this on all 449 files. The results were… sobering:
Total files: 449
✅ Valid files: 79 (17.6%)
❌ Invalid files: 370 (82.4%)
Only 17.6% passed basic validation!
What Went Wrong?
Analyzing the 370 failed files revealed common patterns:
5,160 errors: Unmatched parentheses
4,185 errors: Unmatched square brackets
342 errors: Unbalanced conjunction/disjunction operators
14 errors: Missing module header
4 errors: Missing module footer
Many files were:
- Incomplete specifications (truncated during GitHub API retrieval)
- Helper modules with complex imports and advanced features
- Configuration files (
.cfg) mistakenly grabbed as.tla - Files with encoding issues or special characters
Level 2: TLC Model Checker (Semantic Validation)
The 79 syntax-valid files are a good start, but they might still have semantic issues:
- Deadlocks
- Invariant violations
- Unreachable states
- Liveness property failures
I built a TLC validator that:
- Creates temporary config files
- Runs TLC with a timeout (60s per file)
- Parses TLC output for errors
- Extracts error traces and state information
class TLCValidator:
def _run_tlc(self, spec_path: Path):
config_path = spec_path.with_suffix('.cfg')
config_content = f"SPECIFICATION {module_name}\n"
config_path.write_text(config_content)
cmd = [
'java', '-cp', str(self.tlc_jar_path),
'tlc2.TLC', '-workers', '4',
str(spec_path)
]
return subprocess.run(cmd, capture_output=True, timeout=60)
Note: I haven’t run full TLC validation yet (requires Java + TLA+ tools setup), but the infrastructure is ready.
Phase 3: Preparing Training Data
With 79 validated specifications, I created a structured training dataset. Each example includes:
{
"id": "0gfoundation_cometbft_MC_n4_f1",
"tla_spec": "---- MODULE MC_n4_f1 ----\n...",
"metadata": {
"module_name": "MC_n4_f1",
"source_repo": "0gfoundation/cometbft",
"extends": ["TLC", "Naturals"],
"constants": ["N", "MaxRound"],
"variables": ["state", "round"],
"operator_count": 5,
"line_count": 42,
"char_count": 1337
},
"natural_language": "" // To be added
}
Dataset Statistics
Total examples: 79
Total lines: 1,813
Total characters: 53,239
Average: 22 lines per spec
Size Distribution:
Small (<20 lines): 49 specs (62%)
Medium (20-50): 19 specs (24%)
Large (50+): 11 specs (14%)
Common Modules:
TLC: 30 specs
Naturals: 8 specs
EWD840: 5 specs
Sequences: 3 specs
Key Insights
1. Real-World Data Is Messy
GitHub is full of incomplete files, abandoned projects, and experimental code. Only 17.6% of collected files passed basic validation. This is actually typical for web-scraped datasets.
Lesson: Build robust validation pipelines. Don’t assume data quality.
2. Two-Stage Validation Is Essential
- Syntax validation catches structural issues (fast, no external tools)
- Semantic validation catches logical errors (slower, requires TLC)
For machine learning purposes, both matter. You don’t want to train a model on specifications that look correct but have deadlocks or invariant violations.
3. Quality > Quantity (Initially)
79 high-quality examples is better than 449 low-quality ones. A model trained on valid specs will learn correct patterns. A model trained on invalid specs will learn to make the same mistakes.
4. Metadata Matters
Extracting metadata (module dependencies, variables, operators) helps with:
- Dataset analysis (what patterns are common?)
- Model evaluation (can the model handle different complexity levels?)
- Training strategies (curriculum learning from simple to complex)
What’s Next?
Immediate Next Steps
- Run full TLC validation on the 79 syntax-valid files
- Expected: 40-60 files will pass
- Higher quality guarantee for training
- Add natural language descriptions
- Manual annotation (slow, high quality)
- LLM-generated descriptions (fast, needs review)
- Hybrid approach
- Start fine-tuning experiments
- Begin with Llama-3.2-8B (manageable size)
- Evaluate on held-out test set
- Iterate on training approach
Medium-Term Goals
- Expand the dataset
- Fix common errors in invalid files
- Generate synthetic variations
- Scrape TLA+ examples repository
- Mine academic papers and tutorials
- Target: 200-500 examples
- Build evaluation metrics
- Syntax correctness rate
- TLC pass rate
- Human evaluation of quality
- Semantic similarity to reference specs
- Experiment with model architectures
- Different base models (Llama, Mistral, CodeLlama)
- Different context lengths
- Different quantization strategies
Technical Details
All code is available in the Symbolic repository:
utils/github-scraper.py- GitHub data collectionutils/validate_dataset.py- Batch validation pipelineutils/prepare_training_data.py- Training data preparationsrc/symbolic/validation/- Validation modules (syntax + TLC)
The validation pipeline is designed to be:
- Reproducible (detailed JSON results for every file)
- Extensible (easy to add new validation checks)
- Efficient (parallel processing, configurable timeouts)
Reflections
Building a dataset for formal methods is harder than I expected. Unlike natural language or even code, TLA+ specifications have:
- Rigid syntax requirements (no room for approximation)
- Complex semantics (requires model checking to validate)
- Domain expertise (understanding distributed systems, concurrency, etc.)
But it’s also incredibly rewarding. Each valid specification represents a carefully designed model of a complex system. Training an LLM to generate these could democratize formal methods—making them accessible to developers who don’t have PhD-level expertise.
The Bottom Line
Tonight’s Progress:
- ✅ Collected 449 TLA+ files from GitHub
- ✅ Validated to 79 high-quality specifications
- ✅ Prepared structured training dataset
- ✅ Built reusable validation infrastructure
Validation Rate: 17.6% (79/449)
Dataset Ready: Yes, for initial experiments
Next Milestone: Full TLC validation + model fine-tuning
The foundation is laid. Now comes the fun part: teaching an LLM to think formally.
This is part of my ongoing work on Symbolic, an LLM-based system for generating TLA+ specifications from natural language. Follow along on GitHub or read my other posts about formal methods and machine learning.
Resources
Update (2026-02-10): After running full TLC validation, 52 of the 79 files passed semantic validation (65.8% of syntax-valid files). Total pipeline pass rate: 11.6% (52/449). Quality bar is high, but that’s exactly what we want for training data.