← Engineering notes

Real SQL on a static host, without breaking the ads

csvtodashboard.com is a pile of static files on Hostinger shared hosting. There is no server code anywhere — the closest thing to a backend is an .htaccess that rewrites /csv-sql-query to /csv-sql-query.html. The site is ad-supported, and every tool makes the same promise: your file never leaves the tab.

We still wanted real SQL. Not SQL-flavored filtering — actual JOINs, HAVING, window functions, CTEs, against a 100 MB CSV, in the browser. DuckDB compiles to WebAssembly, so the compute turned out to be the easy half. The hard half was a pair of HTTP response headers we are never allowed to send.

Real SQL on a static site sounds like a contradiction

It stops sounding like one when you stop equating SQL with a database server. A browser tab on a mid-range laptop has gigabytes of memory and a very good JIT; DuckDB-wasm is the same vectorized columnar engine you'd run from a CLI, compiled to wasm. Registering a CSV as a table and running a window function over it is comfortably inside what a tab can do.

What a static host can't give you is headers-on-demand, server-side fallbacks, or excuses. Apache serves bytes; everything else is the client's problem. That constraint shaped three decisions worth writing down: which DuckDB build we ship, when its bytes get downloaded, and what answers queries when DuckDB isn't there.

The header you can't have

Multi-threaded DuckDB-wasm is built on pthreads, which in a browser means SharedArrayBuffer. Ever since Spectre, SharedArrayBuffer only exists on pages that are crossOriginIsolated, which you opt into with two response headers:

Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp

COEP: require-corp flips the default for every cross-origin subresource on the page: anything that doesn't explicitly opt in via CORS or a Cross-Origin-Resource-Policy header simply does not load. An AdSense placement is a chain of cross-origin iframes and scripts from Google's ad pipeline, and that chain doesn't send CORP headers — on a cross-origin-isolated page the ad slots just stay empty. (COEP: credentialless was supposed to soften this; browser support is still uneven and ad iframes still aren't reliable under it.) COOP: same-origin additionally severs window.opener relationships, breaking some measurement flows for good measure.

On an ad-supported site this is a genuine fork: threads or ads — pick one. We picked ads, on purpose, and ship the single-threaded build. The code says so at the exact spot where future-us would otherwise try to "fix" it (converter.jsx, trimmed):

// SQL-query "Power mode": lazy-load DuckDB-WASM (the SINGLE-THREAD build — DuckDB
// auto-picks it when crossOriginIsolated is false, so NO COOP/COEP headers are
// needed and AdSense is unaffected). Loaded from the CDN only when the user opts
// in (the engine is ~10MB+ gzipped — too heavy to self-host in the repo).
const DUCKDB_CDN = "https://cdn.jsdelivr.net/npm/@duckdb/duckdb-wasm@1.29.0/+esm";
…
const bundle = await duckdb.selectBundle(duckdb.getJsDelivrBundles());

selectBundle() feature-detects: the coi bundle (the threaded one — a 35,272,630-byte wasm plus a separate pthread worker) is only ever chosen when crossOriginIsolated is true. We never send the headers, so it never is, and selection lands on the eh build — single-threaded, wasm exception handling, 35,659,694 bytes — or the mvp build (~39 MB) on engines without exception-handling support.

What does one thread cost? Scans and joins that would fan out across cores run serially. For the data a tab will realistically hold — tens of megabytes, occasionally a couple hundred — that's extra milliseconds to low seconds per query, not minutes. Against the alternative (no ads means no site), it isn't close.

Pay for the engine only when you ask for it

The whole application is one 482 KB bundle.js (494,266 bytes — no framework build, just JSX panels concatenated and compiled by a small script). Heavy libraries are self-hosted in /vendor and lazy-loaded the first time a tool needs them: SheetJS is 951,904 bytes on disk, jsPDF 365,730, Apache Arrow 172,843, hyparquet 160,837 plus a 53,917-byte Parquet writer. None of them are in the main bundle either.

You might expect DuckDB to sit in vendor/ next to them. It doesn't — it's the one dependency we don't self-host. The wasm artifact alone is ~34 MB uncompressed (roughly 6 MB over the wire with jsDelivr's brotli), which is a silly thing to mirror onto the shared host that also serves your HTML. So ensureDuckDB() dynamically imports a pinned version from jsDelivr, and only when a user actually ticks Power mode:

function ensureDuckDB() {
  if (window.__duckdbApi) return Promise.resolve(window.__duckdbApi);
  if (window.__duckdbLoading) return window.__duckdbLoading;
  window.__duckdbLoading = (async () => {
    const duckdb = await import(/* @vite-ignore */ DUCKDB_CDN);
    const bundle = await duckdb.selectBundle(duckdb.getJsDelivrBundles());
    // Cross-origin Worker is blocked, so wrap the CDN worker in a same-origin blob.
    const workerUrl = URL.createObjectURL(new Blob(
      ['importScripts("' + bundle.mainWorker + '");'], { type: "text/javascript" }));
    const db = new duckdb.AsyncDuckDB(new duckdb.ConsoleLogger(), new Worker(workerUrl));
    await db.instantiate(bundle.mainModule, bundle.pthreadWorker);
    /* …build the query api, then cache it… */
    window.__duckdbApi = api;
    return api;
  })().catch(e => { window.__duckdbLoading = null; throw e; });
  return window.__duckdbLoading;
}

Three small things are doing real work there. The promise is the cache: __duckdbApi short-circuits once the engine is up, __duckdbLoading dedupes concurrent first calls, and a failed load nulls the slot so the next attempt retries instead of memoizing the error. The worker is wrapped in a blob, because you can't construct a Worker from a cross-origin URL; a one-line importScripts() shim makes it same-origin. And nothing loads at page load/csv-sql-query weighs the same as every other page until the checkbox is ticked.

The UI is honest about the one-time cost instead of hiding it behind a generic spinner. The status line is literally a ternary on whether the engine is already cached:

setOutput(window.__duckdbApi ? "Running query in DuckDB…"
                             : "Loading the DuckDB engine (first run only)…");

and the checkbox label commits up front: "Loads a ~10 MB engine in your browser on first use; nothing is uploaded." Rounding the wire cost up beats surprising people.

Two engines, one query box

Power mode is a checkbox, not the default, because most queries don't need a database. The default engine is a regex-grammar SQL interpreter that lives inside the 482 KB bundle: zero download, zero spin-up, results re-render as you type. This is its entire parser (converter.jsx):

function parseSQL(sql) {
  const re = /^\s*select\s+(.+?)\s+from\s+(\w+)(?:\s+where\s+(.+?))?(?:\s+group\s+by\s+(.+?))?(?:\s+order\s+by\s+("[^"]+"|\[[^\]]+\]|`[^`]+`|[^\s]+)(?:\s+(asc|desc))?)?(?:\s+limit\s+(\d+))?\s*;?\s*$/i;
  const m = sql.match(re);
  if (!m) throw new Error("Couldn't parse SQL. Use: SELECT cols FROM csv [WHERE …] [GROUP BY col] [ORDER BY col [DESC]] [LIMIT n]");

One regular expression, and the engine behind it honestly covers:

  • SELECT with *, bare columns, or aggregates — COUNT(*), COUNT(col), SUM, AVG, MIN, MAX — each with an optional AS alias; identifiers can be quoted "like this", [like this] or `like this` for columns with spaces;
  • WHERE col OP value with =, !=, <>, <, >, <=, >= and LIKE (%/_ wildcards, case-insensitive), chained with AND/OR left-to-right — no parentheses;
  • GROUP BY across one or more columns, where a bare selected column must appear in the grouping — mismatches throw the same complaint real SQL would;
  • implicit whole-table aggregates: SELECT COUNT(*), AVG(price) FROM csv with no GROUP BY collapses the filtered table to a single row;
  • ORDER BY on a source column or a SELECT alias, ASC/DESC, numeric-aware; and LIMIT n.

The last two bullets exist because of the SQL-on-CSV course. Each of its ten lessons checks exercises entirely client-side: your query and a hidden reference query both run through this same in-bundle engine, and the resulting value matrices are compared — aliases and row order are forgiven unless the lesson is teaching ORDER BY (tools7.jsx):

// Exercises self-check ENTIRELY client-side: the learner's query and a hidden
// reference query both run through the same in-bundle SQL engine
// (Convert.sqlQuery) on the same data, and the VALUE matrices are compared —
// header names are ignored so any AS alias is accepted; row order is ignored
// unless the exercise teaches ORDER BY (ex.ordered).
function execute(q) {
  const r = Convert.sqlQuery(profile, { sql: q });
  const parsed = r.csv ? CSVTool.parseCSV(r.csv) : { headers: [], rows: [] };
  return { csv: r.csv, headers: parsed.headers, rows: parsed.rows, count: parsed.rows.length };
}

A course that pauses to download 34 MB before checking SELECT name FROM csv would be a bad course, so the checker was pinned to the instant engine — which immediately exposed what that engine couldn't do. The commit that shipped the lessons says it plainly: "SQL engine upgrade (caught by the course test): whole-table aggregates without GROUP BY, AS aliases on plain columns, ORDER BY alias." The curriculum forced the engine to grow. The pipeline builder's SQL step runs on the same instant engine for the same reason.

The same judgment call shows up inverted in /csv-join. Joins are exactly what DuckDB is for, and the join tool still doesn't use it: a multi-key hash join in plain JavaScript is instant and costs zero extra bytes — "no 10 MB DuckDB download for the 99% case", as the comment in tools6.jsx puts it — so the tool does the join itself and prints the equivalent DuckDB SQL alongside, for anyone who wants to carry the operation over to the real engine. In the whole codebase, DuckDB has exactly one consumer: the Power-mode path of the query tool, where JOIN, HAVING, window functions and CTEs genuinely need it.

Getting data in and out

Power mode's data path is deliberately boring. The CSV already in the tab — it arrived by drag-and-drop or paste, nothing was uploaded — is handed to DuckDB as an in-memory virtual file, and a real table is built from it with DuckDB's own type inference (converter.jsx, trimmed):

async query(sql, csvText) {
  const conn = await db.connect();
  try {
    if (registered !== csvText) {
      await db.registerFileText("__data.csv", csvText);
      await conn.query("CREATE OR REPLACE TABLE csv AS SELECT * FROM read_csv_auto('__data.csv', header=true)");
      registered = csvText;
    }
    const res = await conn.query(sql);
    const cols = res.schema.fields.map(f => f.name);
    /* …rows → plain objects → CSV text… */
  } finally { try { await conn.close(); } catch (e) {} }
}

registered !== csvText is the entire cache-invalidation strategy: the table is rebuilt only when the text actually changes, so iterating on a query against the same file skips the re-parse. Results come back as Arrow, get flattened to plain rows, and are re-serialized to CSV with a ten-line escaper — the output box and the download button only ever see a string.

Equally important is what doesn't happen. No OPFS, no IndexedDB, no persistence of any kind: close the tab and the database ceases to exist. And the ad iframes on the same page are cross-origin documents — they can't read the parent page's JS heap, which is the only place your data lives. The local-first promise and the ad-supported model turn out to be compatible; they just both had to be designed for.

That's the whole trick: refuse two headers, accept one thread, make the heavy thing lazy and the common thing instant. A static host never stopped being enough.

Try the pieces: /csv-sql-query runs both engines — tick Power mode (DuckDB) for JOINs, window functions and CTEs. /csv-join does the hash-join-with-equivalent-SQL trick. And /sql-on-csv is the ten-lesson course that runs entirely on the instant engine. Everything stays in your tab on all three.

More engineering notes

← All engineering notes