Exploring Trino with Apache Iceberg: My Hands-On Experiment

In this short article, we will explore Trino with Apache Iceberg.
Table of Contents
- Introduction: Why Trino + Apache Iceberg Caught My Eye
- What is Trino?
- A Quick Look at Apache Iceberg
- Why Pair Trino with Iceberg?
- How Trino Works (In Plain English)
- Setting Up My Trino + Iceberg Playground
- Prerequisites
- Docker-based Setup
- Configuration Tips
- Running My First Query on Iceberg via Trino
- Common Gotchas I Encountered (And How I Fixed Them)
- Real-World Example: Building a Simple Data Lakehouse Query Layer
- Performance Notes from My Experiments
- Final Thoughts and Where to Go Next
1. Introduction: Why Trino + Apache Iceberg Caught My Eye
I’ve been playing with data lake technologies lately, and the combo of Trino and Apache Iceberg kept popping up in blogs, meetups, and developer forums.
Both tools are hot in the big data world — Trino for its lightning-fast SQL queries on pretty much anything, and Iceberg for its super-flexible table format that plays nice with huge datasets.
So, I decided to get my hands dirty and see what all the fuss is about.
2. What is Trino?
Trino (formerly PrestoSQL) is an open-source distributed SQL query engine designed for interactive analytics at scale. Think of it as a universal translator for data — it doesn’t store data itself, but it can query data from almost anywhere: object storage (like S3), relational databases, NoSQL systems, and, of course, table formats like Apache Iceberg.
What makes Trino stand out:
- Federated queries: You can join data from different sources in a single query.
- Speed: Optimized for interactive, low-latency analytics.
- Scalability: Designed to handle petabyte-scale datasets.
- Plugin-based connectors: Easy to extend with more data sources.
3. A Quick Look at Apache Iceberg
Apache Iceberg is a high-performance table format for huge analytic datasets. It solves the pain points of older formats like Hive tables:
- Schema evolution without rewrites.
- Partition evolution without breaking queries.
- ACID transactions for consistency.
- Hidden partitioning for simpler queries.
In short, Iceberg makes your data lake behave like a reliable SQL table — and Trino knows exactly how to talk to it.
4. Why Pair Trino with Iceberg?
The match is natural:
- Trino = super-fast SQL queries across many sources.
- Iceberg = modern, transactional, and scalable table format.
With these two, you get:
- A lakehouse-style architecture (data lake + data warehouse features).
- Ability to query massive datasets directly in object storage without ETL overhead.
- Flexibility to mix data from Iceberg tables with other systems in the same query.
5. How Trino Works (In Plain English)
When you run a SQL query in Trino:
- Coordinator Node: Parses the query and plans execution.
- Worker Nodes: Fetch and process data in parallel from the source (e.g., Iceberg in S3 or HDFS).
- Coordinator: Merges results and sends them back to you.
This architecture means Trino doesn’t need to “own” your data — it just reads it on demand.
6. Setting Up My Trino + Iceberg Playground
Prerequisites
- Docker & Docker Compose
- Basic SQL knowledge
- A sample dataset (I used some public CSVs to start)
Docker-based Setup
Here’s the docker-compose.yml
I used for a quick setup:
version: '3.8'
services:
trino:
image: trinodb/trino:latest
ports:
- "8080:8080"
volumes:
- ./etc:/etc/trino
minio:
image: minio/minio
command: server /data
ports:
- "9000:9000"
environment:
MINIO_ACCESS_KEY: minio
MINIO_SECRET_KEY: minio123
Configuration Tips
In /etc/trino/catalog/iceberg.properties
:
connector.name=iceberg
iceberg.catalog.type=hive
hive.metastore.uri=thrift://localhost:9083
7. Running My First Query on Iceberg via Trino
Once Trino was running, I opened the Trino CLI and tried:
SHOW CATALOGS;
Then I created a sample table:
CREATE TABLE iceberg.test.sample_data (
id bigint,
name varchar,
created_at timestamp
);
And inserted some rows. Querying them back was instant — even with a small setup on my laptop.
8. Common Gotchas I Encountered (And How I Fixed Them)
- Permissions with MinIO: Make sure your access/secret keys match in both Trino and Iceberg configs.
- Schema refresh: Sometimes Trino caches metadata — running
REFRESH TABLE
helps. - Version mismatch: Ensure the Iceberg connector version in Trino matches your Iceberg table version.
9. Real-World Example: Building a Simple Data Lakehouse Query Layer
Let’s say you have:
- User activity logs in Iceberg tables.
- Reference data in a PostgreSQL database.
With Trino, you can join them directly:
SELECT u.name, COUNT(a.event_id) AS total_actions
FROM iceberg.logs.activity a
JOIN postgresql.public.users u
ON a.user_id = u.id
GROUP BY u.name
ORDER BY total_actions DESC;
This single query spans two systems without moving the data.
10. Performance Notes from My Experiments
Even with a modest local setup, I noticed:
- Queries on Iceberg tables are noticeably faster than on raw files in S3/MinIO.
- Partition pruning in Iceberg significantly reduces scan times.
- Scaling Trino workers linearly improves query throughput.
11.Final Thoughts and Where to Go Next
My experiment confirmed the hype: Trino + Apache Iceberg is a killer combo for modern analytics.
It’s fast, flexible, and doesn’t lock you into one storage engine or format.
Next on my list:
- Trying Trino’s Iceberg REST catalog.
- Testing with large-scale datasets on AWS S3.
- Benchmarking Trino vs. Spark SQL on Iceberg.
If you’re building a lakehouse or want to query massive datasets without the pain of traditional ETL, I highly recommend giving this setup a spin.