Reading Parquet from Elixir and Mojo Without Pretending the Runtime Is Native

Outcome focus: Reader can ship Parquet-reading Elixir without surprise source compilation in CI, recognize where Mojo's Python interop boundary is the bottleneck rather than Mojo itself, and know which DataFrame guarantees leak at the BEAM and PyArrow boundaries.

Part 2 of 3. Part 1: How I Read Parquet in Rust and Go Without an OOM. Part 3: Why I Reach for DuckDB When Reading Parquet from Swift or Zig.

A Phoenix deploy ran for fourteen minutes and then failed. The build container did not have cargo installed. A precompiled NIF in the dependency graph did not ship an artifact for the target triple, and mix deps.compile quietly fell back to building from source. The container hit the timeout before the Rust toolchain had finished downloading.

The application read Parquet through Explorer.DataFrame.from_parquet/2. Explorer wraps Polars through Rustler-based NIFs, distributed by RustlerPrecompiled so that consumers do not need a Rust compiler at install time. That promise holds when the published artifact set covers the consumer's target. It does not hold for every dependency in every release.

The lesson is not "do not use Explorer." Explorer is excellent. The lesson is that a borrowed runtime is a deploy surface, not an implementation detail. When the runtime crosses a NIF, FFI, or Python boundary, the boundary needs to be on the operations diagram next to the database and the queue.

The Shape of the NIF Failure#

The shape is the same regardless of which dependency triggers it. Some library in the application's transitive graph compiles a Rust crate and packages it as a precompiled NIF. The library publishes binaries for a list of target triples. The deploy environment uses a triple that is not on the list. mix deps.compile falls back to source compilation, which requires cargo. If cargo is not in the build container, the build hangs while it tries and eventually fails.

The current versions of Explorer ship precompiled artifacts for the common targets, including aarch64-unknown-linux-gnu, aarch64-unknown-linux-musl, x86_64-unknown-linux-gnu, x86_64-unknown-linux-musl, both Apple architectures, and Windows. That covers most production deploys. It does not cover every transitive dependency. A different Rustler-based crate added later in the project (a custom NIF, a less-popular ecosystem package) can ship a narrower artifact set and break the deploy on the same target the team was confident about.

I have seen this happen on Graviton (aarch64-unknown-linux-gnu) in 2024 with an older Explorer version, and on a glibc-mismatched Linux target with a transitive dependency that only published linux-musl artifacts. The variants are not interesting. The pattern is.

The Phoenix Dockerfile That Pays for Itself#

The defensive shape is a multi-stage Dockerfile that installs the Rust toolchain in the build stage and ships the runtime image without it. The build stage compiles any NIFs that fall back to source. The runtime stage stays small.

# Build stage
FROM hexpm/elixir:1.17.3-erlang-27.1.2-ubuntu-noble-20240605 AS build
 
# Install Rust so any precompiled-NIF fall-through can build from source.
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential git curl ca-certificates \
  && rm -rf /var/lib/apt/lists/*
 
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs \
    | sh -s -- -y --default-toolchain stable --profile minimal
ENV PATH="/root/.cargo/bin:${PATH}"
 
WORKDIR /app
ENV MIX_ENV=prod
 
COPY mix.exs mix.lock ./
RUN mix local.hex --force && mix local.rebar --force
RUN mix deps.get --only prod
RUN mix deps.compile
 
COPY . .
RUN mix release
 
# Runtime stage: no Rust, no apt-get, no toolchain
FROM ubuntu:noble AS runtime
RUN apt-get update && apt-get install -y --no-install-recommends \
    locales libstdc++6 openssl libncurses6 \
  && rm -rf /var/lib/apt/lists/*
 
WORKDIR /app
COPY --from=build /app/_build/prod/rel/myapp ./
 
ENV LANG=en_US.UTF-8
CMD ["bin/myapp", "start"]

Two operational notes. First, this Dockerfile produces a runtime image that does not contain cargo. The build is heavier but the production surface is smaller. Second, if the team uses a smaller base image for runtime, the libstdc++ and libncurses lines need to match what the Erlang runtime expects. Alpine images are appealing for size but introduce a glibc-versus-musl boundary that interacts with NIF artifact selection in ways that are easy to get wrong.

Explorer Reads, From the Top#

Once the deploy is honest about the runtime boundary, Explorer is a good DataFrame surface for Elixir. The two read paths are eager and lazy.

# mix.exs
# {:explorer, "~> 0.11"}
 
alias Explorer.DataFrame, as: DF
 
# Eager: load the whole Parquet file into a DataFrame. Good for small files
# and one-off reads. Use the bang version when raising on error is acceptable.
df = DF.from_parquet!("data.parquet")
IO.inspect(DF.shape(df))
IO.inspect(DF.dtypes(df))
DF.head(df, 5) |> DF.print()
 
# Eager with column projection. The :columns option asks Explorer to only
# materialize the columns the workload needs. For wide files, this is the
# cheapest performance win in Elixir Parquet code.
df_narrow = DF.from_parquet!("data.parquet",
  columns: ["account_id", "amount", "ts"]
)
 
# Lazy: convert the DataFrame to a lazy plan that records operations and
# materializes them on collect/1. Use this when the dataset is large enough
# that filter and projection pushdown matter.
plan =
  df_narrow
  |> DF.to_lazy()
  |> DF.filter(amount > 1000)
  |> DF.select(["account_id", "amount"])
 
result = DF.collect(plan)
DF.print(result)

The :columns option on from_parquet/2 is the most useful one for most pipelines, because it lets the underlying Polars engine skip column-page reads on the columns the query never uses. The other useful option is :max_rows, for sampling. Both arrived in Explorer 0.11.

The honest framing of the lazy story is that Explorer's lazy mode is a Polars query plan that runs through the NIF. Most operations transfer cleanly: filter, select, mutate, joins, aggregations. The places where the contract leaks are around custom kernels and types that do not have an Elixir-side equivalent. Series.cast/2 to a non-native type, user-defined Elixir functions inside a Polars expression, and complex regex rewrites are the most common failure points. Explorer surfaces the failure with a RuntimeError that names the unsupported operation, but the abstraction has briefly leaked at that point and the writer needs to either rewrite the operation in pure Polars terms or collect the plan to an eager DataFrame and finish the work in Elixir.

What the BEAM Boundary Actually Costs#

Every call from BEAM into the Rust runtime crosses a serialization boundary. For batch operations on large columnar data, the cost is amortized over the work and is usually invisible. For operations that produce small results from large inputs (a count, a sum), the BEAM boundary is the bottleneck on the small-result path because the result has to be marshaled back through the NIF.

Each call from BEAM into Polars crosses a NIF boundary. Reading Parquet is a Polars-side operation; the BEAM cost is amortized over the work the runtime does on the file.

The deployment shape that follows from this diagram is that Explorer scales with the underlying Polars engine, not with the BEAM. Adding more worker processes does not parallelize Parquet reads across the same file. Splitting the work across files, partitions, or row-group ranges and dispatching per-file reads to separate processes does. The Phoenix code that orchestrates that split lives on the BEAM. The Parquet reads live on the other side of the NIF.

Mojo's Case for Borrowing Python#

Mojo is younger than Elixir's NIF story and has a different relationship to the runtime it borrows. Mojo can call Python directly through a typed wrapper:

from std.python import Python
 
def main() raises:
    pq = Python.import_module("pyarrow.parquet")
 
    table = pq.read_table("data.parquet")
    print("rows:", table.num_rows)
    print("columns:", table.num_columns)
 
    pf = pq.ParquetFile("large_file.parquet")
    print("row groups:", pf.metadata.num_row_groups)
 
    for batch in pf.iter_batches(batch_size=10000):
        process(batch)

The architectural choice here is intentional, not a workaround. Mojo's design assumes Python's ecosystem and treats interop as a first-class feature. For Parquet, that means PyArrow is the reader. PyArrow is the reference implementation of the format from the Apache Arrow project; it is mature, well-tested, and used in production at scale.

The thing to be honest about is that Mojo programs that read Parquet through PyArrow inherit Python's runtime constraints. The build environment needs Python and PyArrow installed. The startup cost includes a Python interpreter. The performance ceiling is whatever PyArrow's internals can do. None of that is bad, and for analytical workloads that already live near Python, the integration is the right shape.

The GIL Boundary Inside PyArrow#

The place where the borrowed-runtime cost shows up most clearly in Mojo is when the writer assumes that calling PyArrow from a parallelized Mojo path produces parallel speedup. PyArrow is careful about the GIL: it releases the lock during I/O-heavy operations like file reads, and it holds the lock during Python-side schema decoding. For a workload that reads many small Parquet files and parses each one's schema, the per-file Python work serializes on the GIL even though the bytes-on-disk part scales.

The fix is not to abandon the borrowed runtime. The fix is to put the work that releases the GIL inside the parallel section and the work that holds it outside. Concretely, this means using ParquetFile.iter_batches once to enumerate row groups, dispatching each row group to a worker, and aggregating the results. The PyArrow operations inside iter_batches release the GIL more freely than the schema-construction call does.

Schema Evolution at the Borrowed-Runtime Boundary#

Both Explorer and PyArrow tolerate Parquet schema evolution within the limits Polars and Arrow set. The cases that show up most often in production:

A new column added to a writer between two batches of files. Explorer reads the older files with the column missing and surfaces it as nil if the dataset is read with Explorer.Datasets.from_directory (or the equivalent multi-file path). PyArrow's pyarrow.dataset API behaves the same way, with the missing column surfacing as null.
A column whose type widened (int32 to int64, for instance). Both libraries promote the narrower values during the read. The widening is one-directional; narrowing the type later requires explicit casts.
A renamed column. Neither library reconciles renames automatically, because the metadata does not record the rename. The dataset reads as if the renamed column were two distinct columns across the file boundary.

The operational rule is to write schema-version metadata into the file alongside the data when the schema is going to evolve. Most production Parquet writers support custom file-level metadata. Reading that field out before the dataset scan is cheaper than reconstructing the rename history later.

Close#

Treat the runtime boundary as a deploy surface. For Elixir, that means the Dockerfile knows whether cargo is in the build stage and the runtime stage knows it does not need to be. For Mojo, that means the GIL is on the architecture diagram, not in the appendix. Borrowed-runtime stacks are first-class for analytical Parquet work. They earn their place by giving the host language ergonomics on top of a battle-tested engine. They earn their cost by making the boundary something the team has to think about exactly once, when the application is being shaped, and then almost never again.