Getting Started
This walkthrough takes you from go get to a fully written-and-read Iceberg table in a few minutes, using a local SQLite-backed catalog so you do not need any external services. By the end you will have:
- An Iceberg catalog stored in a SQLite file
- A table with an Arrow schema
- Data written from Apache Arrow Go and read back
- A schema evolution committed
- A branch created on top of the table
If you would rather see all the operations as a reference, jump straight to the API.
1. Install
go mod init iceberg-go-tutorial
go get github.com/apache/iceberg-go@latest
go get github.com/apache/arrow-go/v18@latest
go get github.com/uptrace/bun/driver/sqliteshim@latest
iceberg-go itself only registers the local file system. For S3, GCS, or Azure Blob you would also blank-import github.com/apache/iceberg-go/io/gocloud. We are staying on local disk for this tutorial.
2. Open a local catalog
The SQL catalog stores its metadata in a database, and the actual data files live under a "warehouse" directory. We will use SQLite for both the catalog DB and a local folder for the warehouse.
package main
import (
"context"
"database/sql"
"log"
"github.com/apache/iceberg-go"
sqlcat "github.com/apache/iceberg-go/catalog/sql"
"github.com/uptrace/bun/driver/sqliteshim"
)
func openCatalog(ctx context.Context) (*sqlcat.Catalog, error) {
db, err := sql.Open(sqliteshim.ShimName, "file:./tutorial-catalog.db?cache=shared")
if err != nil {
return nil, err
}
return sqlcat.NewCatalog("default", db, sqlcat.SQLite, iceberg.Properties{
"warehouse": "file:///tmp/iceberg-tutorial",
"init_catalog_tables": "true",
})
}
init_catalog_tables: "true" lets the catalog create its bookkeeping tables (iceberg_tables, iceberg_namespace_properties) on first use. Make sure /tmp/iceberg-tutorial is writable.
3. Create a namespace and a table
Iceberg organizes tables under namespaces (the analog of databases). We will create one and put a table inside it.
import (
"github.com/apache/iceberg-go/catalog"
"github.com/apache/iceberg-go/table"
)
func createTrips(ctx context.Context, cat *sqlcat.Catalog) (*table.Table, error) {
ns := table.Identifier{"taxi"}
if err := cat.CreateNamespace(ctx, ns, nil); err != nil {
return nil, err
}
schema := iceberg.NewSchema(1,
iceberg.NestedField{ID: 1, Name: "trip_id", Type: iceberg.PrimitiveTypes.Int64, Required: true},
iceberg.NestedField{ID: 2, Name: "fare", Type: iceberg.PrimitiveTypes.Float64, Required: false},
iceberg.NestedField{ID: 3, Name: "borough", Type: iceberg.PrimitiveTypes.String, Required: false},
)
ident := catalog.ToIdentifier("taxi", "trips")
return cat.CreateTable(ctx, ident, schema)
}
If the namespace already exists, CreateNamespace returns an error - the example skips error handling for brevity; production code should check for iceberg.ErrAlreadyExists and recover.
4. Write some Arrow data
The write path takes either a streaming array.RecordReader or a fully materialized arrow.Table. We will build a small Arrow table inline and append it.
import (
"github.com/apache/arrow-go/v18/arrow"
"github.com/apache/arrow-go/v18/arrow/array"
"github.com/apache/arrow-go/v18/arrow/memory"
)
func writeSomeTrips(ctx context.Context, tbl *table.Table) (*table.Table, error) {
mem := memory.NewGoAllocator()
arrowSchema := arrow.NewSchema([]arrow.Field{
{Name: "trip_id", Type: arrow.PrimitiveTypes.Int64, Nullable: false},
{Name: "fare", Type: arrow.PrimitiveTypes.Float64, Nullable: true},
{Name: "borough", Type: arrow.BinaryTypes.String, Nullable: true},
}, nil)
b := array.NewRecordBuilder(mem, arrowSchema)
defer b.Release()
b.Field(0).(*array.Int64Builder).AppendValues([]int64{1, 2, 3}, nil)
b.Field(1).(*array.Float64Builder).AppendValues([]float64{12.50, 8.75, 22.10}, nil)
b.Field(2).(*array.StringBuilder).AppendValues([]string{"Manhattan", "Brooklyn", "Queens"}, nil)
rec := b.NewRecord()
defer rec.Release()
arrowTbl := array.NewTableFromRecords(arrowSchema, []arrow.Record{rec})
defer arrowTbl.Release()
return tbl.AppendTable(ctx, arrowTbl, 1024 /* batchSize */, nil)
}
AppendTable returns a refreshed *table.Table reflecting the new snapshot.
5. Read the data back
func readAll(ctx context.Context, tbl *table.Table) error {
arrowTbl, err := tbl.Scan().ToArrowTable(ctx)
if err != nil {
return err
}
defer arrowTbl.Release()
log.Printf("read %d rows in %d columns", arrowTbl.NumRows(), arrowTbl.NumCols())
return nil
}
For larger tables, use streaming so only one batch is in memory at a time:
arrowSchema, batches, err := tbl.Scan().ToArrowRecords(ctx)
if err != nil {
return err
}
log.Printf("scan schema: %s", arrowSchema)
for batch, err := range batches {
if err != nil {
return err
}
log.Printf("batch with %d rows", batch.NumRows())
batch.Release()
}
6. Filter and project
Use the predicate DSL to push filters down into the scan, and WithSelectedFields to project columns:
filter := iceberg.GreaterThan(iceberg.Reference("fare"), float64(10.0))
arrowTbl, err := tbl.Scan(
table.WithSelectedFields("trip_id", "fare"),
table.WithRowFilter(filter),
).ToArrowTable(ctx)
7. Evolve the schema
Most schema changes go through a transaction. Add a tip column:
import "github.com/apache/iceberg-go/table"
func addTipColumn(ctx context.Context, tbl *table.Table) (*table.Table, error) {
txn := tbl.NewTransaction()
err := table.NewUpdateSchema(txn, true /* caseSensitive */, false /* allowIncompatible */).
AddColumn([]string{"tip"}, iceberg.PrimitiveTypes.Float64, "Tip in dollars", false, nil).
Commit()
if err != nil {
return nil, err
}
return txn.Commit(ctx)
}
The returned table has the new schema and a fresh metadata file.
8. Branch the table
A branch is a named ref that points at a snapshot. Creating one requires a metadata commit that adds the ref - NewTransactionOnBranch only writes to a branch that already exists. Two steps:
import "github.com/apache/iceberg-go/table"
// Step 1: create the branch via Catalog.CommitTable.
func createExperimentBranch(ctx context.Context, cat *sqlcat.Catalog, tbl *table.Table) (*table.Table, error) {
snap := tbl.CurrentSnapshot()
update := table.NewSetSnapshotRefUpdate(
"experiment", snap.SnapshotID, table.BranchRef,
0 /* maxRefAgeMs */, 0 /* maxSnapshotAgeMs */, 0 /* minSnapshotsToKeep */,
)
reqs := []table.Requirement{
table.AssertTableUUID(tbl.Metadata().TableUUID()),
table.AssertRefSnapshotID("experiment", nil), // branch must not yet exist
}
if _, _, err := cat.CommitTable(ctx, tbl.Identifier(), reqs, []table.Update{update}); err != nil {
return nil, err
}
// Reload so subsequent transactions see the new ref.
return cat.LoadTable(ctx, tbl.Identifier())
}
// Step 2: open a transaction on the branch and write.
func writeOnBranch(ctx context.Context, tbl *table.Table, arrowTbl arrow.Table) (*table.Table, error) {
txn := tbl.NewTransactionOnBranch("experiment")
if err := txn.AppendTable(ctx, arrowTbl, 1024, nil); err != nil {
return nil, err
}
return txn.Commit(ctx)
}
Reads can target the branch with tbl.Scan().UseRef("experiment").
Where to go next
- API - the full Go surface for catalogs, tables, scans, writes, transactions, schema and partition evolution, snapshot management, maintenance, and views.
- Configuration - cloud credentials, table properties, concurrency, custom catalog and IO registration.
- CLI - inspect, expire, compact, branch, and tag from the command line.
- Row Filter Syntax and Expression DSL - the predicate DSL in detail.
The full Iceberg specification, terminology, and engine integration policy live with the main project at iceberg.apache.org.