Why I Reach for DuckDB When Reading Parquet from Swift or Zig

Outcome focus: Reader can decide when DuckDB is the right Parquet path for a Swift or Zig project, configure the SPM and build.zig integrations correctly the first time, and avoid the binary-size and linker failures that the unconfigured path produces.

Part 3 of 3. Part 1: How I Read Parquet in Rust and Go Without an OOM. Part 2: Reading Parquet from Elixir and Mojo Without Pretending the Runtime Is Native.

A Swift team statically linked the DuckDB C library into an iOS app and watched the binary jump from about 80 MB to over 250 MB. The TestFlight upload warned about cellular-download size. The product team asked whether DuckDB was the wrong choice, since Swift does not really have a mature native Parquet library and shoehorning an analytical database into a phone seemed extreme.

It was not the wrong choice. It was a configuration choice with a setup cost that nobody had paid yet. The fix was to use the duckdb-swift package's pre-built binary target, strip symbols, and let the linker do its job. The same app shipped a few weeks later at around 41 MB.

This post is about why DuckDB is the right answer for Swift and Zig today, what the configuration looks like in each ecosystem, and where the SQL boundary inside an otherwise typed codebase pays back the developer-experience cost. It is also about being honest that "use DuckDB" is a real architectural decision, not a workaround for the absence of a native library.

The Argument for DuckDB#

Writing a Parquet reader is a multi-year project. The format requires Thrift-encoded metadata, several compression codecs (Snappy, Zstd, Gzip, LZ4, Brotli), dictionary encoding, page indexes, bloom filters, and the Dremel-style repetition and definition levels for nested types. Doing this correctly is what the Apache Arrow project has been doing for a decade across Rust, Go, Java, C++, and Python. Doing it well in a fifth or sixth language is not a weekend project.

DuckDB has already done it. Its Parquet reader supports predicate pushdown (filters apply during the scan), projection pushdown (only requested columns get decoded), parallel I/O across row groups, and all production compression codecs. That work is in the engine, not in the binding. Any language that can call the DuckDB C library inherits the work.

The cost is binary size, a SQL string boundary inside an otherwise typed codebase, and the platform-specific build configuration that this post is mostly about. The benefit is everything else.

The Swift Setup That Matters#

The Swift binding ships through the Swift Package Manager. The DuckDB Swift package recommends a major-version pin against the 1.x series and provides a binary target that does not require the consumer to compile DuckDB from source. The first thing to get right is that pin.

// Package.swift
// swift-tools-version: 5.9
import PackageDescription
 
let package = Package(
    name: "ParquetReader",
    platforms: [.macOS(.v13), .iOS(.v16)],
    dependencies: [
        .package(
            url: "https://github.com/duckdb/duckdb-swift",
            .upToNextMajor(from: .init(1, 0, 0))
        )
    ],
    targets: [
        .executableTarget(
            name: "ParquetReader",
            dependencies: [
                .product(name: "DuckDB", package: "duckdb-swift")
            ]
        )
    ]
)

Reading a Parquet file is then a SQL query against the file path:

import DuckDB
import Foundation
 
let db = try Database(store: .inMemory)
let conn = try db.connect()
 
// DuckDB infers the format from the .parquet extension; the function-call form
// (read_parquet) is also available when the path does not have the extension.
let result = try conn.query("""
    SELECT account_id, amount
      FROM 'data.parquet'
     WHERE amount > 1000
     ORDER BY amount DESC
     LIMIT 10
""")
 
for row in result.rowIterator() {
    let accountId = row[0].cast(to: String.self) ?? ""
    let amount = row[1].cast(to: Double.self) ?? 0.0
    print("\(accountId): \(amount)")
}
 
// Glob patterns work for multi-file datasets.
let multi = try conn.query("""
    SELECT count(*)
      FROM 'data/*.parquet'
""")

The SQL boundary is real. Result columns come back as DuckDB.Value and require an explicit cast to a Swift type. For analytical workloads (counts, sums, top-N queries), this trades a small per-cell ergonomic cost for the optimizer doing the actual work. For high-throughput per-row reads, the SQL boundary has overhead per call, and a per-row stream from a typed Parquet reader would be faster. Most Swift apps reading Parquet are doing analytical work, not per-row streaming.

The Binary Size Conversation#

The 250 MB binary above came from statically linking DuckDB without using the published binary target. The fix is to consume duckdb-swift through SPM (which selects the right pre-built binary for the platform) rather than linking against a self-built DuckDB. SPM will pick xcframework artifacts that are already compiled for the target architecture, and the App Store strip pass during distribution build will remove debug symbols. A typical iOS app embedding DuckDB this way lands somewhere between 30 and 50 MB of added size, depending on which DuckDB extensions are included.

A few additional knobs that matter for size budgets:

Use accessMode = .readOnly in the database configuration when the app is only querying. Read-only mode skips the write-ahead-log code paths and avoids creating a WAL file alongside the database.
Bundle Parquet files as on-demand resources or fetched assets rather than baking them into the IPA. The size budget that matters is the IPA, not what the app downloads after first launch.
If the app never uses one of DuckDB's optional extensions (the spatial extension, the full-text-search extension), make sure the bundled binary does not include it.

The binary-size objection to DuckDB on iOS is real but solvable. The objection that does not have a clean solution is per-row hot-path performance, where the SQL boundary cost is real and a custom typed reader would beat it. Most apps do not have that workload.

The Zig Setup That Matters#

Zig has no Parquet library worth using. Its excellent C interoperability makes DuckDB integration straightforward in code; the complexity is in the build. Linking against libduckdb requires explicit configuration that is not obvious from the code itself.

The reading code is small. The example below uses the legacy per-value accessor API for clarity. DuckDB now recommends the data-chunk API (duckdb_result_get_chunk) for new C-API consumers because it amortizes the per-cell overhead that the per-value functions pay. For a small example, the per-value API is easier to read; for a production Zig service, the chunk API is the right default.

const std = @import("std");
const c = @cImport({
    @cInclude("duckdb.h");
});
 
pub fn main() !void {
    var db: c.duckdb_database = undefined;
    var con: c.duckdb_connection = undefined;
    var result: c.duckdb_result = undefined;
 
    if (c.duckdb_open(null, &db) != c.DuckDBSuccess) return error.OpenFailed;
    defer c.duckdb_close(&db);
 
    if (c.duckdb_connect(db, &con) != c.DuckDBSuccess) return error.ConnectFailed;
    defer c.duckdb_disconnect(&con);
 
    const query = "SELECT account_id, amount FROM 'data.parquet' WHERE amount > 1000 LIMIT 10";
    if (c.duckdb_query(con, query, &result) != c.DuckDBSuccess) {
        std.debug.print("query error: {s}\n", .{c.duckdb_result_error(&result)});
        return error.QueryFailed;
    }
    defer c.duckdb_destroy_result(&result);
 
    const row_count = c.duckdb_row_count(&result);
    var i: u64 = 0;
    while (i < row_count) : (i += 1) {
        const account = c.duckdb_value_varchar(&result, 0, i);
        defer c.duckdb_free(@ptrCast(account));
        const amount = c.duckdb_value_double(&result, 1, i);
        std.debug.print("{s}: {d:.2}\n", .{ account, amount });
    }
}

The compile-time work is in build.zig. The example below is the Zig 0.14 syntax and is the most widely deployed shape today. Zig 0.16, released April 14, 2026, moves several configuration calls from the executable onto the root module; if the project is on 0.16 or later, the same set of operations applies with exe.root_module.addIncludePath(...) and similar. The shape of what needs to happen is the same: tell Zig where the DuckDB header lives, where the library lives, and to link against it and libc.

const std = @import("std");
 
pub fn build(b: *std.Build) void {
    const target = b.standardTargetOptions(.{});
    const optimize = b.standardOptimizeOption(.{});
 
    const exe = b.addExecutable(.{
        .name = "parquet-reader",
        .root_source_file = b.path("src/main.zig"),
        .target = target,
        .optimize = optimize,
    });
 
    // Adjust these paths for your platform. On macOS with Homebrew this is
    // typically /opt/homebrew/include and /opt/homebrew/lib. On Linux with
    // the libduckdb release zip extracted to /usr/local, these are correct.
    exe.addIncludePath(.{ .cwd_relative = "/usr/local/include" });
    exe.addLibraryPath(.{ .cwd_relative = "/usr/local/lib" });
    exe.linkSystemLibrary("duckdb");
    exe.linkLibC();
 
    b.installArtifact(exe);
 
    const run_cmd = b.addRunArtifact(exe);
    run_cmd.step.dependOn(b.getInstallStep());
    if (b.args) |args| run_cmd.addArgs(args);
    const run_step = b.step("run", "Run the parquet reader");
    run_step.dependOn(&run_cmd.step);
}

The failure I have seen on this build is a Linux CI box where pkg-config reported the wrong include path (the apt-installed libduckdb-dev package landed the header in a non-standard location), and the build hit error: ld returned 1 exit status. The fix is to set addIncludePath and addLibraryPath explicitly rather than relying on system search. On platforms where libduckdb is shared rather than static, an addRPath line that names the runtime library directory keeps the loader from failing on launch. On Apple platforms, code signing wants the LC_RPATH entries to match the embedded library's signed location, which is one more reason to prefer a static link or a frameworks-style packaging when you can.

The libduckdb release archives live at github.com/duckdb/duckdb/releases. For DuckDB 1.5.x, the relevant files are libduckdb-linux-amd64.zip, libduckdb-linux-aarch64.zip, libduckdb-osx-universal.zip, and the matching Windows archives. Extract duckdb.h to the include path and libduckdb.so (or .dylib / .lib) to the library path, run ldconfig on Linux when using a system-wide install, and the Zig build links cleanly.

What the SQL Boundary Buys You#

Swift and Zig issue SQL queries against DuckDB. The engine reads Parquet directly with predicate and projection pushdown. The SQL string is the boundary; everything after it is engine work.

DuckDB's Parquet documentation describes the optimizations the engine applies automatically. Two of them matter for any non-trivial workload.

Projection pushdown means the engine only decodes the columns that the query references. A SELECT account_id, amount FROM 'data.parquet' against an 80-column file reads two column chunks per row group, not 80. This is the same win that Polars provides through LazyFrame::scan_parquet in Rust, but inside a SQL boundary that any language with a DuckDB binding can use.

Filter pushdown means the engine consults Parquet zonemaps (column-chunk min/max statistics) to skip row groups that cannot match the WHERE clause. For a query like SELECT * FROM 'data.parquet' WHERE ts > '2026-04-01', row groups whose timestamp range ends before that date are skipped without decompression. The performance gain depends on whether the Parquet writer emitted statistics; most modern writers do.

Where DuckDB Is the Wrong Answer#

DuckDB is not the right choice for every Parquet use case in Swift or Zig. The cases where I would not reach for it:

Hard binary-size budgets. A watch-face widget or a Zig program targeting microcontrollers cannot absorb 30 MB of analytical engine.
Hot per-row read paths. The SQL boundary has overhead per query, and a custom typed reader streaming rows in a tight loop will beat DuckDB on sub-millisecond reads.
Workloads where the file is the schema. If the application is mostly bytes-in, bytes-out and the schema is dynamic at runtime, a row-oriented reader is a better match than a column-oriented engine.

For a Swift app doing analytics on a few bundled Parquet files, or a Zig service ingesting Parquet from object storage and writing aggregates somewhere downstream, DuckDB is almost always the right answer. The configuration is a one-time cost. The engine is mature. The SQL boundary is the API.

A Note on Roc, and on Platform-Delegated Parsing#

Roc is a young functional language whose architecture separates pure application logic from effectful platform operations. Heavy parsing (Parquet, Thrift metadata, decompression) belongs in the platform layer, written in Rust, Zig, or C. The basic-cli platform reads bytes; it does not parse Parquet. A Roc application that needs Parquet support today would either build a custom platform that wraps DuckDB or Arrow, or it would preprocess data into a simpler format and feed Roc the result.

I am not covering Roc in this series because the platform-delegation work is itself a significant undertaking, the language ecosystem is too small for me to write code samples that I can verify compile, and the syntax-highlighting layer on this site does not yet have a Roc grammar. The pattern Roc would use is the one this post is about: an embedded engine with a typed boundary, doing the work the host language does not need to do twice. A future note will cover what that platform looks like once the ecosystem catches up.

Close#

Use DuckDB when the host language does not have a mature native Parquet reader and the workload is analytical. Pin the major version in SPM, set accessMode = .readOnly for query-only Swift apps, and let the published binary target keep the IPA size honest. In Zig, set addIncludePath and addLibraryPath explicitly rather than trusting pkg-config on every distribution, link libc, and prefer the data-chunk API over the per-value accessors when the program is doing more than a small example. The SQL string is the boundary. Everything after it is years of engine work that the team does not need to do.