How to Install Apache Parquet on Docker: A Complete Step-by-Step Guide

Apache Parquet

Introduction

In today’s data-driven world, Apache Parquet has become a go-to choice for efficient data storage and analytics. It’s a columnar storage format optimized for performance, compression, and compatibility with big data frameworks such as Apache Spark, Hadoop, and Drill.

If you want to test or work with Parquet files in a lightweight and isolated environment, Docker offers the perfect solution. In this tutorial, we’ll walk you through the process of installing Apache Parquet inside a Docker container, along with the necessary Python tools to read and write Parquet data.

What Is Apache Parquet?

Apache Parquet is an open-source, column-oriented data storage format designed for efficient data analytics. Unlike row-based storage formats (like CSV or JSON), Parquet organizes data by columns, which reduces storage space and improves query performance.

Key Features of Apache Parquet:

  • Columnar storage for faster analytical queries.
  • Efficient compression and encoding for reduced disk space.
  • Compatibility with major data processing frameworks (Spark, Hive, Drill, etc.).
  • Schema evolution support for flexible data structures.

Why Use Docker for Apache Parquet?

Using Docker provides a controlled, portable environment to test and develop applications that read or write Parquet files. Here are a few advantages:

  1. No environment conflicts: Each container runs in isolation.
  2. Quick setup: Get started without installing multiple dependencies locally.
  3. Reproducibility: The same Docker image can be shared and run anywhere.
  4. Integration testing: Perfect for integrating Parquet operations into data pipelines.

Prerequisites

Before you begin, make sure you have the following tools installed:

  • Docker Engine (version 20.10 or later)
  • Basic knowledge of Docker commands
  • Internet access to pull images from Docker Hub

You can verify if Docker is installed by running:

docker --version

Step 1: Create a Working Directory

Start by creating a project directory to keep things organized:

mkdir parquet-docker
cd parquet-docker

Inside this directory, we’ll create a Dockerfile and a simple Python script to work with Parquet files.

Step 2: Create a Dockerfile

The Dockerfile defines the environment where Apache Parquet will run. Since Parquet works well with Python via Pandas and PyArrow, we’ll use the official Python base image.

Create a file named Dockerfile and add the following:

# Use the official Python image
FROM python:3.11-slim

# Set working directory
WORKDIR /app

# Install necessary Python packages
RUN pip install pandas pyarrow fastparquet

# Copy project files
COPY . .

# Default command
CMD ["python", "parquet_test.py"]

This Dockerfile does the following:

  • Uses a lightweight Python image.
  • Installs pandas, pyarrow, and fastparquet for Parquet operations.
  • Copies the local files into the container.
  • Runs a test script when the container starts.

Step 3: Create a Python Script to Test Parquet

Next, create a simple Python file named parquet_test.py in the same directory.

import pandas as pd

# Create a sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Paris', 'Tokyo']
}

df = pd.DataFrame(data)

# Write DataFrame to a Parquet file
df.to_parquet('sample.parquet', engine='pyarrow')

print("Parquet file successfully created!")

# Read the Parquet file back
df_read = pd.read_parquet('sample.parquet')
print("Data read from Parquet file:")
print(df_read)

This script will:

  1. Create a sample DataFrame.
  2. Write it into a Parquet file using pyarrow.
  3. Read the file back and display the content.

Step 4: Build the Docker Image

Now that you have your Dockerfile and Python script, build the Docker image using:

docker build -t parquet-demo .

Docker will download the necessary base image, install dependencies, and prepare your container.

Step 5: Run the Container

Once the image is built, you can run it with:

docker run --name parquet_container parquet-demo

Expected output:

Parquet file successfully created!
Data read from Parquet file:
      Name  Age      City
0    Alice   25  New York
1      Bob   30     Paris
2  Charlie   35     Tokyo

You have successfully created and tested a Parquet file inside a Docker container.

Step 6: Accessing the Parquet File

To check the generated file, run the following:

docker exec -it parquet_container /bin/bash

Inside the container, list files with:

ls -l

You should see sample.parquet in your working directory. You can also copy it to your local system using:

docker cp parquet_container:/app/sample.parquet .

Step 7: Clean Up Resources

When you’re done testing, remove the container and image to free up space:

docker rm parquet_container
docker rmi parquet-demo

Troubleshooting Tips

IssuePossible CauseSolution
ModuleNotFoundError: No module named 'pyarrow'Package not installed properlyRebuild the image and ensure dependencies are in Dockerfile
File not found errorsWrong working directoryConfirm WORKDIR /app is set correctly
Docker build failsNetwork or permissions issueTry running Docker with admin privileges

Best Practices for Using Apache Parquet in Docker

  • Use volumes to persist data outside containers.
  • Pin dependency versions in your Dockerfile for reproducibility.
  • Integrate Parquet with Spark or Dask inside the same Docker environment for scalable analytics.
  • Leverage Docker Compose if you want to run Parquet operations alongside databases or analytics engines.

Conclusion

In this tutorial, we’ve learned how to install and run Apache Parquet on Docker using a Python-based environment. By containerizing your Parquet setup, you can easily test, share, and deploy data processing applications without worrying about dependency conflicts.

With Docker and Parquet combined, you now have a powerful, efficient, and portable data processing workflow ready for modern analytics.

Recommended Reading

(Visited 28 times, 1 visits today)

You may also like