One seed, two runtimes, the same bytes.
csvtodashboard ships thirteen sample CSV datasets — sales transactions, customers, employees, weather, OHLC stock bars, IoT telemetry, web server logs. All fake, all CC0, each in five sizes from 100 rows to 1,000,000. The pages make a promise that sounds easy and isn't: your copy of sales-1000.csv is byte-identical to everyone else's, forever — and the million-row file your browser generates on click matches what the static file would contain at that size.
Two runtimes, one set of bytes. This is how that works, and the bug that nearly made the promise false.
Fake data is a contract
It's tempting to treat sample data as the one artifact where nothing can be wrong — it's made up. That stops being true the moment someone links to it.
Three groups depend on these exact bytes. The test suite asserts that datasets/sales-1000.csv has exactly 1,001 lines, that its header is date,region,product,channel,units,unit_price,revenue, and that revenue equals units × unit_price on every row it samples. Teachers link lessons against the files and write answer keys of the form "the answer is row 412" — if row 412 quietly changes before grading day, the lesson breaks with no error message. And every dataset page publishes schema.org Dataset JSON-LD whose DataDownload entries carry contentSize — the literal byte count of each file. Regenerate with different bytes and your structured data is lying to search engines.
So the spec file opens with the only rule that matters: "Seeds are arbitrary but FIXED: changing a seed (or column order) changes the published files — don't." The bytes are an API. Everything below is API discipline.
Generate the big files on the visitor's machine
The site is a static host with no backend — every tool runs client-side, on principle. A sample-data library strains that. A thousand rows of the sales table is 52 KB, so a million rows is roughly 52 MB; web-logs (user agents are long) crosses 100 MB. Committing thirteen datasets at five sizes would turn a static site into a CDN bill.
So the library splits. The 100- and 1,000-row files are real committed CSVs — the entire set is under a megabyte, small enough to hot-link from a lesson plan. The 10k, 100k and 1M sizes are buttons. Click one and an 8 KB script builds the file in your browser — 100,000 rows per chunk, yielding to the event loop so the progress label can repaint — then hands you a Blob download. The server never touches a big file, and a million rows costs one click.
That script is the whole trick, and the whole risk. dataset-gen.js runs in both runtimes with zero build step: Node imports it for its side effect (import "../dataset-gen.js", then read globalThis.DatasetGen), and browsers load the same file as a plain <script> tag:
(function (root) {
"use strict";
// mulberry32, column kinds, generate(), generateAsync() …
root.DatasetGen = { generate: generate, generateAsync: generateAsync, headerNames: headerNames };
})(typeof window !== "undefined" ? window : globalThis);
The cost: two engines must now agree byte-for-byte — the Node that wrote the committed files and whatever browser the visitor brought. So the generator leans only on operations whose results ECMAScript pins exactly: int32 arithmetic through Math.imul and unsigned shifts, division by 2³², toFixed, toISOString. No Math.random, no locale-aware formatting, no local time.
Seed per row, not per file
The PRNG is mulberry32 — 32 bits of state, a handful of lines, deterministic on anything that runs JavaScript:
// mulberry32 — tiny, fast, deterministic PRNG
function mulberry32(a) {
return function () {
a |= 0; a = (a + 0x6D2B79F5) | 0;
var t = Math.imul(a ^ (a >>> 15), 1 | a);
t = (t + Math.imul(t ^ (t >>> 7), 61 | t)) ^ t;
return ((t ^ (t >>> 14)) >>> 0) / 4294967296;
};
}
The obvious design is one PRNG per file: seed it once, draw values top to bottom. It works until you touch anything. One extra draw anywhere shifts every value from there to the end of the file, chunked generation has to carry generator state across chunk boundaries, and a row is only ever "whatever the stream happened to contain at that point." So the generator reseeds per row:
// generate(spec, n) → CSV string. Deterministic in (spec.seed, n).
for (var i = 0; i < n; i++) {
var rnd = mulberry32((spec.seed ^ Math.imul(i + 1, 0x9E3779B9)) | 0);
rnd(); rnd(); // warm-up: first draws from adjacent seeds correlate
var row = buildRow(spec, i, n, rnd);
Row i's randomness is a pure function of (spec.seed, i): the row index is multiplied by 0x9E3779B9 — the 32-bit golden-ratio constant — to scatter consecutive rows across the integer space, then XORed into the dataset's fixed seed. Three things fall out. Row N draws the same values whether you asked for a thousand rows or a million. Generation can stop at any row boundary and resume — generateAsync parks in a setTimeout between 100k-row chunks with no PRNG state to carry. And damage is contained: adding a column to a spec changes later columns of the same row, not every cell to the end of the file.
One honest footnote: rows are pinned per (dataset, size), not across sizes — spread-mode date columns take the row count on purpose, so the 1,000-row sales file spans the same two-year window as the million-row one. sales-1000.csv is not a prefix of sales-10000.csv; each (dataset, size) pair is what's frozen forever.
Which leaves the strangest line in the file: rnd(); rnd();. Mulberry32's first output doesn't fully mix a fresh seed, and these row seeds come from a regular sequence — first draws from related seeds correlate, which shows up as visible banding down the first random column. So every row burns two draws before a single value is used.
The warm-up draws are part of the file format
There are two generation paths in the module. generate() is synchronous: Node calls it to write the committed files, and the build uses it for the 8-row preview tables on each dataset page. generateAsync() is the browser-button variant — the same loop sliced into chunks so a million rows doesn't freeze the tab. They are supposed to be twins.
While the chunked variant was being built, the twins briefly disagreed about the warm-up — the per-row draw counts didn't match. Nothing complained. The browser file had the right header, the right row count, the right number of commas. Every value was plausible. The derived columns were even internally consistent — revenue still equaled units × unit_price exactly, because derived columns are computed from whatever was drawn, right or wrong. Every invariant a reasonable test would check still held.
But inside each row, every random value had slid along the draw sequence — a category column consuming its neighbor's draw, in every row. Node said one file; the browser said a different, equally convincing file. No single-runtime test can see this. The only test that catches it is the bluntest one imaginable: generate in both runtimes and compare the bytes.
That test now exists, and it runs against a real browser engine, not a simulation. Headless Chromium loads the published page, clicks the actual 10,000-row button, captures the actual download, and compares it to Node's output of the same spec:
const [dl] = await Promise.all([
page.waitForEvent("download", { timeout: 60000 }),
page.click('[data-gen="10000"]')
]);
const browserCsv = fs.readFileSync(await dl.path(), "utf8");
const nodeCsv = globalThis.DatasetGen.generate(
DATASETS.find(d => d.slug === "sales"), 10000);
ok("generate: browser output byte-identical to node output (determinism)",
browserCsv === nodeCsv,
"first diff at " + [...browserCsv].findIndex((c, i) => c !== nodeCsv[i]));
On failure it prints the index of the first differing character — which, for a warm-up mismatch, points straight at the first random cell of row one. And the rule now lives exactly where the next editor will be standing when they break it:
// in generate() — sync, writes the committed files:
rnd(); rnd(); // warm-up: first draws from adjacent seeds correlate
// in generateAsync() — the in-browser 10k / 100k / 1M buttons:
rnd(); rnd(); // warm-up — MUST match generate() exactly (same bytes)
The per-row draw sequence — including the two draws that produce nothing — is part of the file format. Change it and you haven't tweaked an implementation detail; you've silently republished every file in the library.
Specs are data, not code
Nothing in dataset-gen.js knows what a "sale" is. The thirteen datasets live in scripts/_dataset-specs.mjs as plain objects — column lists the generator interprets. Here's the sales table, trimmed:
{
slug: "sales", seed: 101,
cols: [
{ name: "date", kind: "date", start: "2024-01-01", days: 730 },
{ name: "region", kind: "pickw",
values: ["North", "South", "East", "West", "Central"],
weights: [3, 2, 2.5, 2, 1] },
{ name: "channel", kind: "pickw", values: ["Online", "Retail", "Partner"], weights: [5, 3, 2] },
{ name: "units", kind: "int", min: 1, max: 12 },
{ name: "unit_price", kind: "money", min: 19, max: 199 },
{ name: "revenue", kind: "mul", a: "units", b: "unit_price" }
] // trimmed: product column, page copy, FAQ entries
}
Two column kinds do most of the work. pickw is a weighted categorical — a draw scaled by the weight total and walked through the cumulative weights — so North outsells Central three to one and charts look like a business instead of uniform noise. The derived kinds compute one column from others: mul gives sales its revenue = units × unit_price invariant, add gives orders total = subtotal + shipping, and mulf keeps relationships sane by construction — a product's price is its cost × 1.6–3.2 so margin can't go negative, and marketing clicks are 0.5–6% of impressions so CTR is never nonsense.
Because specs are data, one script fans each object out into everything else. node scripts/gen-datasets.mjs writes the committed 100- and 1,000-row files, builds the dataset's page — preview table produced by the same generator, schema table, FAQ — emits Dataset JSON-LD carrying the CC0 license and the exact contentSize of every file, and registers the page in the sitemap and redirects idempotently. The page itself embeds only { seed, cols } for the buttons. Adding dataset fourteen means writing one object.
And CC0 isn't generosity — it's load-bearing. The entire point of deterministic sample data is that other people build on it, and people only cite files they're allowed to take.
Determinism turns fake data into infrastructure
Most fake data is disposable: generated, used, never the same twice — and usually that's fine. Pinning the bytes costs three decisions. Fix a seed per dataset. Make every row a pure function of (seed, row index) so the sync and chunked paths can't drift apart. Then treat the per-row draw sequence as a published format, with a byte-diff test in a real browser standing behind it.
In exchange, the files stop being filler and become something you can cite: tests that reproduce on any machine, lesson answer keys that don't rot, and a million-row benchmark anyone can generate locally and trust to match yours — byte for byte.