Fine-Tuning Dataset Validator

Validate JSONL datasets for LLM fine-tuning with error detection, token counting, and cost estimation

~/finetune-validator

Paste a JSONL dataset above to validate it for fine-tuning.

Supports OpenAI Chat, OpenAI Completions, and Anthropic formats. Detects errors, token counts, and estimates fine-tune costs.

What is a Fine-Tuning Dataset Validator?

A fine-tuning dataset validator checks your JSONL training data for structural errors, missing fields, and format inconsistencies before you upload it to an AI provider's fine-tuning API. Catching these issues locally saves you from failed training runs, wasted compute costs, and hours of debugging cryptic API error messages.

Fine-tuning lets you customize a foundation model on your own examples — but the training data must follow a strict format. OpenAI requires a messages array with system, user, and assistant roles. Anthropic uses human and assistant alternating turns. Even a single malformed line can cause the entire training job to fail.

Our free validator auto-detects the format, validates every line individually, counts tokens per example, estimates fine-tuning costs across providers, and lets you export only the valid examples. All processing happens in your browser — your training data never leaves your machine.

How to Use This Tool

Validating your fine-tuning dataset takes just a few steps:

  1. Paste your JSONL data or drag-and-drop a .jsonl file into the input area. Each line should be a valid JSON object.
  2. The validator auto-detects the format (OpenAI Chat, OpenAI Completions, or Anthropic) and validates each line against the expected schema.
  3. Set the max token limit per example using the dropdown — examples exceeding this limit will be flagged with warnings.
  4. Review the summary stats: total examples, valid/invalid counts, token statistics (avg, min, max, median), and estimated fine-tuning costs.
  5. Check individual line results — errors show exactly what's wrong (missing roles, empty content), warnings flag token limit violations.
  6. Use the Copy buttons to export valid examples, statistics, or the full validation report.

Supported Fine-Tuning Formats

The validator supports the three most common fine-tuning dataset formats:

OpenAI Chat Format

The standard format for fine-tuning GPT models. Each example is a JSON object with a messages array containing objects with role (system, user, or assistant) and content fields. Every example must include at least one user message and one assistant message.

OpenAI Completions Format (Legacy)

The older format uses prompt and completion string fields. While still supported for some models, OpenAI recommends migrating to the chat format for all new fine-tuning jobs.

Anthropic Format

Anthropic's fine-tuning format uses a messages array with human and assistant roles in alternating turns. The validator checks that roles alternate correctly and that no messages have empty content.

Understanding Fine-Tuning Costs

Fine-tuning costs are based on the total number of training tokens across all examples multiplied by the per-token training price. The validator estimates costs for all models that currently support fine-tuning, including GPT-4o, GPT-4o Mini, and Mistral Small.

Token counts shown are estimates based on a character-to-token ratio of approximately 4:1 for English text. Actual token counts may vary by 10-20% depending on vocabulary and content. For exact counts, use the provider's tokenizer after validation.

Frequently Asked Questions

Does this tool upload my data anywhere?

No. All validation happens entirely in your browser using JavaScript. Your training data never leaves your machine — no API calls, no server processing, no data storage. This is especially important for fine-tuning datasets which often contain proprietary or sensitive examples.

How accurate are the token count estimates?

The validator uses an approximation of ~4 characters per token for English text. This is accurate to within 10-20% for most content. For exact token counts, use OpenAI's tiktoken library or Anthropic's tokenizer after validating the dataset structure here.

What does the max token limit setting do?

It sets the maximum number of estimated tokens per training example. Examples exceeding this limit are flagged with warnings (not errors) because they may still be valid but could be truncated during training or rejected by the API. Common limits are 4,096 for GPT-4o Mini and 8,192 for GPT-4o fine-tuning.

Can I use this for formats other than JSONL?

Currently, the validator supports JSONL (JSON Lines) format only, which is the standard format required by OpenAI and Anthropic for fine-tuning. Each line must be a valid JSON object. CSV or other formats need to be converted to JSONL first.

What should I do with the validation report?

Fix all errors (invalid JSON, missing fields, wrong roles) and review warnings (token limit violations). Use the Copy Valid Examples button to export only the clean lines, then re-validate the clean dataset. Aim for zero errors before uploading to any fine-tuning API.

Related Tools

Explore more tools for your AI development workflow:

Related Tools