Iceberg Go

Iceberg Go is a go-native implementation for accessing iceberg tables.

Go Reference

iceberg is a Golang implementation of the Iceberg table spec.

Feature Support / Roadmap

FileSystem Support

Filesystem TypeSupported
S3X
Google Cloud StorageX
Azure Blob StorageX
Local FilesystemX

Metadata

OperationSupported
Get SchemaX
Get SnapshotsX
Get Sort OrdersX
Get Partition SpecsX
Get ManifestsX
Create New ManifestsX
Plan ScanX
Plan Scan for SnapshotX

Catalog Support

OperationRESTHiveGlueSQL
Load TableXXX
List TablesXXX
Create TableXXX
Register TableXX
Update Current SnapshotXXX
Create New SnapshotXXX
Rename TableXXX
Drop TableXXX
Alter TableXXX
Check Table ExistsXXX
Set Table PropertiesXXX
List NamespacesXXX
Create NamespaceXXX
Check Namespace ExistsXXX
Drop NamespaceXXX
Update Namespace PropertiesXXX
Create ViewXX
Load ViewX
List ViewXX
Drop ViewXX
Check View ExistsXX

Read/Write Data Support

  • Data can currently be read as an Arrow Table or as a stream of Arrow record batches.

Supported Write Operations

As long as the FileSystem is supported and the Catalog supports altering the table, the following tracks the current write support:

OperationSupported
Append StreamX
Append Data FilesX
Rewrite Files
Rewrite manifests
Overwrite Files
Write Pos Delete
Write Eq Delete
Row Delta

Get in Touch

Install

In this quickstart, we’ll glean insights from code segments and learn how to:

Requirements

Go 1.23 or later is required to build.

Installation

To install iceberg-go package, you need to install Go and set your Go workspace first. If you don't have a go.mod file, create it with go mod init gin.

  1. Download and install it:
go get -u github.com/apache/iceberg-go
  1. Import it in your code:
import "github.com/apache/iceberg-go"

CLI

Run go build ./cmd/iceberg from the root of this repository to build the CLI executable, alternately you can run go install github.com/apache/iceberg-go/cmd/iceberg to install it to the bin directory of your GOPATH.

The iceberg CLI usage is very similar to pyiceberg CLI
You can pass the catalog URI with --uri argument.

Example: You can start the Iceberg REST API docker image which runs on default in port 8181

docker pull apache/iceberg-rest-fixture:latest
docker run -p 8181:8181 apache/iceberg-rest-fixture:latest

and run the iceberg CLI pointing to the REST API server.

 ./iceberg --uri http://0.0.0.0:8181 list
┌─────┐
| IDs |
| --- |
└─────┘

Create Namespace

./iceberg --uri http://0.0.0.0:8181 create namespace taxitrips

List Namespace

 ./iceberg --uri http://0.0.0.0:8181 list
┌───────────┐
| IDs       |
| --------- |
| taxitrips |
└───────────┘


Catalog

Catalog is the entry point for accessing iceberg tables. You can use a catalog to:

  • Create and list namespaces.
  • Create, load, and drop tables

Currently only rest catalog has been implemented, and other catalogs are under active development. Here is an example of how to create a RestCatalog:

import (
    "context"
    "github.com/apache/iceberg-go/catalog"
    "github.com/apache/iceberg-go/catalog/rest"
)

// Create a REST catalog
cat, err := rest.NewCatalog(context.Background(), "rest", "http://localhost:8181", 
    rest.WithOAuthToken("your-token"))
if err != nil {
    log.Fatal(err)
}

You can run following code to list all root namespaces:

// List all root namespaces
namespaces, err := cat.ListNamespaces(context.Background(), nil)
if err != nil {
    log.Fatal(err)
}

for _, ns := range namespaces {
    fmt.Printf("Namespace: %v\n", ns)
}

Then you can run following code to create namespace:

// Create a namespace
namespace := catalog.ToIdentifier("my_namespace")
err = cat.CreateNamespace(context.Background(), namespace, nil)
if err != nil {
    log.Fatal(err)
}

Other Catalog Types

SQL Catalog

You can also use SQL-based catalogs:

import (
    "github.com/apache/iceberg-go/catalog"
    "github.com/apache/iceberg-go/io"
)

// Create a SQLite catalog
cat, err := catalog.Load(context.Background(), "local", iceberg.Properties{
    "type":               "sql",
    "uri":                "file:iceberg-catalog.db",
    "sql.dialect":        "sqlite",
    "sql.driver":         "sqlite",
    io.S3Region:          "us-east-1",
    io.S3AccessKeyID:     "admin",
    io.S3SecretAccessKey: "password",
    "warehouse":          "file:///tmp/warehouse",
})
if err != nil {
    log.Fatal(err)
}

Glue Catalog

For AWS Glue integration:

import (
    "github.com/apache/iceberg-go/catalog/glue"
    "github.com/aws/aws-sdk-go-v2/config"
)

// Create AWS config
awsCfg, err := config.LoadDefaultConfig(context.TODO())
if err != nil {
    log.Fatal(err)
}

// Create Glue catalog
cat := glue.NewCatalog(glue.WithAwsConfig(awsCfg))

// Create table in Glue
tableIdent := catalog.ToIdentifier("my_database", "my_table")
tbl, err := cat.CreateTable(
    context.Background(),
    tableIdent,
    schema,
    catalog.WithLocation("s3://my-bucket/tables/my_table"),
)
if err != nil {
    log.Fatal(err)
}

Table

After creating Catalog, we can manipulate tables through Catalog.

You can use following code to create a table:

import (
    "github.com/apache/iceberg-go"
    "github.com/apache/iceberg-go/catalog"
    "github.com/apache/iceberg-go/table"
)

// Create a simple schema
schema := iceberg.NewSchemaWithIdentifiers(1, []int{2},
    iceberg.NestedField{ID: 1, Name: "foo", Type: iceberg.PrimitiveTypes.String, Required: false},
    iceberg.NestedField{ID: 2, Name: "bar", Type: iceberg.PrimitiveTypes.Int32, Required: true},
    iceberg.NestedField{ID: 3, Name: "baz", Type: iceberg.PrimitiveTypes.Bool, Required: false},
)

// Create table identifier
tableIdent := catalog.ToIdentifier("my_namespace", "my_table")

// Create table with optional properties
tbl, err := cat.CreateTable(
    context.Background(),
    tableIdent,
    schema,
    catalog.WithProperties(map[string]string{"owner": "me"}),
    catalog.WithLocation("s3://my-bucket/tables/my_table"),
)
if err != nil {
    log.Fatal(err)
}

Also, you can load a table directly:

// Load an existing table
tbl, err := cat.LoadTable(context.Background(), tableIdent, nil)
if err != nil {
    log.Fatal(err)
}

fmt.Printf("Table: %s\n", tbl.Identifier())
fmt.Printf("Location: %s\n", tbl.MetadataLocation())

Schema Creation

Here are some examples of creating different types of schemas:

// Simple schema with primitive types
simpleSchema := iceberg.NewSchemaWithIdentifiers(1, []int{2},
    iceberg.NestedField{ID: 1, Name: "id", Type: iceberg.PrimitiveTypes.Int32, Required: true},
    iceberg.NestedField{ID: 2, Name: "name", Type: iceberg.PrimitiveTypes.String, Required: false},
    iceberg.NestedField{ID: 3, Name: "active", Type: iceberg.PrimitiveTypes.Bool, Required: false},
)

// Schema with nested struct
nestedSchema := iceberg.NewSchemaWithIdentifiers(1, []int{1},
    iceberg.NestedField{ID: 1, Name: "person", Type: &iceberg.StructType{
        FieldList: []iceberg.NestedField{
            {ID: 2, Name: "name", Type: iceberg.PrimitiveTypes.String, Required: false},
            {ID: 3, Name: "age", Type: iceberg.PrimitiveTypes.Int32, Required: true},
        },
    }, Required: false},
)

// Schema with list and map types
complexSchema := iceberg.NewSchemaWithIdentifiers(1, []int{1},
    iceberg.NestedField{ID: 1, Name: "tags", Type: &iceberg.ListType{
        ElementID: 2, Element: iceberg.PrimitiveTypes.String, ElementRequired: true,
    }, Required: false},
    iceberg.NestedField{ID: 3, Name: "metadata", Type: &iceberg.MapType{
        KeyID: 4, KeyType: iceberg.PrimitiveTypes.String,
        ValueID: 5, ValueType: iceberg.PrimitiveTypes.String, ValueRequired: true,
    }, Required: false},
)

Table Operations

Here are some common table operations:

// List tables in a namespace
tables := cat.ListTables(context.Background(), catalog.ToIdentifier("my_namespace"))
for tableIdent, err := range tables {
    if err != nil {
        log.Printf("Error listing table: %v", err)
        continue
    }
    fmt.Printf("Table: %v\n", tableIdent)
}

// Check if table exists
exists, err := cat.CheckTableExists(context.Background(), tableIdent)
if err != nil {
    log.Fatal(err)
}
fmt.Printf("Table exists: %t\n", exists)

// Drop a table
err = cat.DropTable(context.Background(), tableIdent)
if err != nil {
    log.Fatal(err)
}

// Rename a table
fromIdent := catalog.ToIdentifier("my_namespace", "old_table")
toIdent := catalog.ToIdentifier("my_namespace", "new_table")
renamedTable, err := cat.RenameTable(context.Background(), fromIdent, toIdent)
if err != nil {
    log.Fatal(err)
}

Working with Table Metadata

Once you have a table, you can access its metadata and properties:

// Access table metadata
metadata := tbl.Metadata()
fmt.Printf("Table UUID: %s\n", metadata.TableUUID())
fmt.Printf("Format version: %d\n", metadata.Version())
fmt.Printf("Last updated: %d\n", metadata.LastUpdatedMillis())

// Access table schema
schema := tbl.Schema()
fmt.Printf("Schema ID: %d\n", schema.ID)
fmt.Printf("Number of fields: %d\n", schema.NumFields())

// Access table properties
props := tbl.Properties()
fmt.Printf("Owner: %s\n", props["owner"])

// Access current snapshot
if snapshot := tbl.CurrentSnapshot(); snapshot != nil {
    fmt.Printf("Current snapshot ID: %d\n", snapshot.SnapshotID)
    fmt.Printf("Snapshot timestamp: %d\n", snapshot.TimestampMs)
}

// List all snapshots
for _, snapshot := range tbl.Snapshots() {
    fmt.Printf("Snapshot %d: %s\n", snapshot.SnapshotID, snapshot.Summary.Operation)
}

Creating Tables with Partitioning

You can create tables with partitioning:

import (
    "github.com/apache/iceberg-go"
    "github.com/apache/iceberg-go/catalog"
)

// Create schema
schema := iceberg.NewSchemaWithIdentifiers(1, []int{1},
    iceberg.NestedField{ID: 1, Name: "id", Type: iceberg.PrimitiveTypes.Int32, Required: true},
    iceberg.NestedField{ID: 2, Name: "name", Type: iceberg.PrimitiveTypes.String, Required: false},
    iceberg.NestedField{ID: 3, Name: "date", Type: iceberg.PrimitiveTypes.Date, Required: false},
)

// Create partition spec
partitionSpec := iceberg.NewPartitionSpec(
    iceberg.PartitionField{SourceID: 3, FieldID: 1000, Transform: iceberg.IdentityTransform{}, Name: "date"},
)

// Create table with partitioning
tbl, err := cat.CreateTable(
    context.Background(),
    tableIdent,
    schema,
    catalog.WithPartitionSpec(&partitionSpec),
    catalog.WithLocation("s3://my-bucket/tables/partitioned_table"),
)
if err != nil {
    log.Fatal(err)
}

Glossary

This glossary defines important terms used throughout the Iceberg ecosystem, organized in tables for easy reference.

Core Concepts

TermDefinition
CatalogA centralized service that manages table metadata and provides a unified interface for accessing Iceberg tables. Catalogs can be implemented as Hive metastore, AWS Glue, REST API, or SQL-based solutions.
TableA collection of data files organized by a schema, with metadata tracking changes over time through snapshots. Tables support ACID transactions and schema evolution.
SchemaThe structure definition of a table, specifying field names, types, and whether fields are required or optional. Schemas are versioned and can evolve over time.
SnapshotA point-in-time view of a table's data, representing the state after a specific operation (append, overwrite, delete, etc.). Each snapshot contains metadata about the operation and references to data files.
ManifestA metadata file that lists data files and their metadata (location, partition information, record counts, etc.). Manifests are organized into manifest lists for efficient access.
Manifest ListA file that contains references to manifest files for a specific snapshot, enabling efficient discovery of data files without reading all manifests.

Data Types

Primitive Types

TypeDescription
booleanTrue/false values
int (32-bit)Integer values
long (64-bit)Long integer values
float (32-bit)Single precision floating point
double (64-bit)Double precision floating point
dateDate values (days since epoch)
timeTime values (microseconds since midnight)
timestampTimestamp values (microseconds since epoch)
timestamptzTimestamp with timezone
stringUTF-8 encoded strings
uuidUUID values
binaryVariable length binary data
fixed[n]Fixed length binary data of n bytes
decimal(p,s)Decimal values with precision p and scale s

Nested Types

TypeDescription
structCollection of named fields
listOrdered collection of elements
mapKey-value pairs

Operations

OperationDescription
AppendAn operation that adds new data files to a table without removing existing data. Creates a new snapshot with the additional files.
OverwriteAn operation that replaces existing data files with new ones, typically based on a partition predicate. Creates a new snapshot with the replacement files.
DeleteAn operation that removes data files from a table, either by marking them as deleted or by removing references to them.
ReplaceAn operation that completely replaces all data in a table with new data, typically used for full table refreshes.

Partitioning

TermDefinition
PartitionA logical division of table data based on column values, used to improve query performance by allowing selective reading of relevant data files.
Partition SpecDefines how table data is partitioned by specifying source columns and transformations (identity, bucket, truncate, year, month, day, hour).
Partition FieldA field in the partition spec that defines how a source column is transformed for partitioning.
Partition PathThe file system path structure created by partition values, typically in the format partition_name=value/.

Partition Transforms

TransformDescription
identityUse the column value directly
bucket[n]Hash the value into n buckets
truncate[n]Truncate strings to n characters
yearExtract year from date/timestamp
monthExtract month from date/timestamp
dayExtract day from date/timestamp
hourExtract hour from timestamp
voidAlways returns null (used for unpartitioned tables)

Expressions and Predicates

TermDefinition
ExpressionA computation or comparison that can be evaluated against table data, used for filtering and transformations.
PredicateA boolean expression used to filter data, such as column comparisons, null checks, or set membership tests.
Bound PredicateA predicate that has been resolved against a specific schema, with field references bound to actual columns.
Unbound PredicateA predicate that contains unresolved field references, typically in string form before binding to a schema.
LiteralA constant value used in expressions and predicates, such as numbers, strings, dates, etc.

File Formats

FormatUsageDescription
ParquetData filesThe primary data file format used by Iceberg, providing columnar storage with compression and encoding optimizations.
AvroMetadata filesUsed for manifests and manifest lists due to its schema evolution capabilities and compact binary format.
ORCData filesAn alternative columnar format supported by some Iceberg implementations.

Metadata

TermDefinition
Metadata FileA JSON file containing table metadata including schema, partition spec, properties, and snapshot information.
Metadata LocationThe URI pointing to the current metadata file for a table, stored in the catalog.
PropertiesKey-value pairs that configure table behavior, such as compression settings, write options, and custom metadata.
StatisticsMetadata about data files including record counts, file sizes, and value ranges for optimization.

Transactions

TermDefinition
TransactionA sequence of operations that are committed atomically, ensuring data consistency and ACID properties.
CommitThe process of finalizing a transaction by creating a new snapshot and updating the metadata file.
RollbackThe process of undoing changes in a transaction, typically by reverting to a previous snapshot.

References

TermDefinition
BranchA named reference to a specific snapshot, allowing multiple concurrent views of table data.
TagAn immutable reference to a specific snapshot, typically used for versioning and releases.

Storage

TermDefinition
WarehouseThe root directory or bucket where table data and metadata are stored.
Location ProviderA component that generates file paths for table data and metadata based on table location and naming conventions.
FileIOAn abstraction layer for reading and writing files across different storage systems (local filesystem, S3, GCS, Azure Blob, etc.).

Query Optimization

TechniqueDescription
Column PruningA technique that reads only the columns needed for a query, reducing I/O and improving performance.
Partition PruningA technique that skips reading data files from irrelevant partitions based on query predicates.
Predicate PushdownA technique that applies filtering predicates at the storage layer, reducing data transfer and processing.
Statistics-based OptimizationUsing table and file statistics to optimize query execution plans and file selection.

Schema Evolution

TermDefinition
Schema EvolutionThe process of modifying a table's schema over time while maintaining backward compatibility.
Column AdditionAdding new columns to a table schema, which are typically optional to maintain compatibility.
Column DeletionRemoving columns from a table schema, which may be logical (marking as deleted) or physical.
Column RenamingChanging column names while preserving data and type information.
Type EvolutionChanging column types in ways that maintain data compatibility (e.g., int32 to int64).

Time Travel

TermDefinition
Time TravelThe ability to query a table as it existed at a specific point in time using snapshot timestamps.
Snapshot IsolationA property that ensures queries see a consistent view of data as it existed at a specific snapshot.

ACID Properties

PropertyDescription
AtomicityEnsures that all operations in a transaction either succeed completely or fail completely.
ConsistencyEnsures that the table remains in a valid state after each transaction.
IsolationEnsures that concurrent transactions do not interfere with each other.
DurabilityEnsures that committed changes are permanently stored and survive system failures.