Iceberg Go
Iceberg Go is a go-native implementation for accessing iceberg tables.
iceberg
is a Golang implementation of the Iceberg table spec.
Feature Support / Roadmap
FileSystem Support
Filesystem Type | Supported |
---|---|
S3 | X |
Google Cloud Storage | X |
Azure Blob Storage | X |
Local Filesystem | X |
Metadata
Operation | Supported |
---|---|
Get Schema | X |
Get Snapshots | X |
Get Sort Orders | X |
Get Partition Specs | X |
Get Manifests | X |
Create New Manifests | X |
Plan Scan | X |
Plan Scan for Snapshot | X |
Catalog Support
Operation | REST | Hive | Glue | SQL |
---|---|---|---|---|
Load Table | X | X | X | |
List Tables | X | X | X | |
Create Table | X | X | X | |
Register Table | X | X | ||
Update Current Snapshot | X | X | X | |
Create New Snapshot | X | X | X | |
Rename Table | X | X | X | |
Drop Table | X | X | X | |
Alter Table | X | X | X | |
Check Table Exists | X | X | X | |
Set Table Properties | X | X | X | |
List Namespaces | X | X | X | |
Create Namespace | X | X | X | |
Check Namespace Exists | X | X | X | |
Drop Namespace | X | X | X | |
Update Namespace Properties | X | X | X | |
Create View | X | X | ||
Load View | X | |||
List View | X | X | ||
Drop View | X | X | ||
Check View Exists | X | X |
Read/Write Data Support
- Data can currently be read as an Arrow Table or as a stream of Arrow record batches.
Supported Write Operations
As long as the FileSystem is supported and the Catalog supports altering the table, the following tracks the current write support:
Operation | Supported |
---|---|
Append Stream | X |
Append Data Files | X |
Rewrite Files | |
Rewrite manifests | |
Overwrite Files | |
Write Pos Delete | |
Write Eq Delete | |
Row Delta |
Get in Touch
Install
In this quickstart, we’ll glean insights from code segments and learn how to:
Requirements
Installation
To install iceberg-go
package, you need to install Go and set your Go workspace first.
If you don't have a go.mod file, create it with go mod init gin
.
- Download and install it:
go get -u github.com/apache/iceberg-go
- Import it in your code:
import "github.com/apache/iceberg-go"
CLI
Run go build ./cmd/iceberg
from the root of this repository to build the CLI executable, alternately you can run go install github.com/apache/iceberg-go/cmd/iceberg
to install it to the bin
directory of your GOPATH
.
The iceberg
CLI usage is very similar to pyiceberg CLI
You can pass the catalog URI with --uri
argument.
Example:
You can start the Iceberg REST API docker image which runs on default in port 8181
docker pull apache/iceberg-rest-fixture:latest
docker run -p 8181:8181 apache/iceberg-rest-fixture:latest
and run the iceberg
CLI pointing to the REST API server.
./iceberg --uri http://0.0.0.0:8181 list
┌─────┐
| IDs |
| --- |
└─────┘
Create Namespace
./iceberg --uri http://0.0.0.0:8181 create namespace taxitrips
List Namespace
./iceberg --uri http://0.0.0.0:8181 list
┌───────────┐
| IDs |
| --------- |
| taxitrips |
└───────────┘
Catalog
Catalog
is the entry point for accessing iceberg tables. You can use a catalog to:
- Create and list namespaces.
- Create, load, and drop tables
Currently only rest catalog has been implemented, and other catalogs are under active development. Here is an
example of how to create a RestCatalog
:
import (
"context"
"github.com/apache/iceberg-go/catalog"
"github.com/apache/iceberg-go/catalog/rest"
)
// Create a REST catalog
cat, err := rest.NewCatalog(context.Background(), "rest", "http://localhost:8181",
rest.WithOAuthToken("your-token"))
if err != nil {
log.Fatal(err)
}
You can run following code to list all root namespaces:
// List all root namespaces
namespaces, err := cat.ListNamespaces(context.Background(), nil)
if err != nil {
log.Fatal(err)
}
for _, ns := range namespaces {
fmt.Printf("Namespace: %v\n", ns)
}
Then you can run following code to create namespace:
// Create a namespace
namespace := catalog.ToIdentifier("my_namespace")
err = cat.CreateNamespace(context.Background(), namespace, nil)
if err != nil {
log.Fatal(err)
}
Other Catalog Types
SQL Catalog
You can also use SQL-based catalogs:
import (
"github.com/apache/iceberg-go/catalog"
"github.com/apache/iceberg-go/io"
)
// Create a SQLite catalog
cat, err := catalog.Load(context.Background(), "local", iceberg.Properties{
"type": "sql",
"uri": "file:iceberg-catalog.db",
"sql.dialect": "sqlite",
"sql.driver": "sqlite",
io.S3Region: "us-east-1",
io.S3AccessKeyID: "admin",
io.S3SecretAccessKey: "password",
"warehouse": "file:///tmp/warehouse",
})
if err != nil {
log.Fatal(err)
}
Glue Catalog
For AWS Glue integration:
import (
"github.com/apache/iceberg-go/catalog/glue"
"github.com/aws/aws-sdk-go-v2/config"
)
// Create AWS config
awsCfg, err := config.LoadDefaultConfig(context.TODO())
if err != nil {
log.Fatal(err)
}
// Create Glue catalog
cat := glue.NewCatalog(glue.WithAwsConfig(awsCfg))
// Create table in Glue
tableIdent := catalog.ToIdentifier("my_database", "my_table")
tbl, err := cat.CreateTable(
context.Background(),
tableIdent,
schema,
catalog.WithLocation("s3://my-bucket/tables/my_table"),
)
if err != nil {
log.Fatal(err)
}
Table
After creating Catalog
, we can manipulate tables through Catalog
.
You can use following code to create a table:
import (
"github.com/apache/iceberg-go"
"github.com/apache/iceberg-go/catalog"
"github.com/apache/iceberg-go/table"
)
// Create a simple schema
schema := iceberg.NewSchemaWithIdentifiers(1, []int{2},
iceberg.NestedField{ID: 1, Name: "foo", Type: iceberg.PrimitiveTypes.String, Required: false},
iceberg.NestedField{ID: 2, Name: "bar", Type: iceberg.PrimitiveTypes.Int32, Required: true},
iceberg.NestedField{ID: 3, Name: "baz", Type: iceberg.PrimitiveTypes.Bool, Required: false},
)
// Create table identifier
tableIdent := catalog.ToIdentifier("my_namespace", "my_table")
// Create table with optional properties
tbl, err := cat.CreateTable(
context.Background(),
tableIdent,
schema,
catalog.WithProperties(map[string]string{"owner": "me"}),
catalog.WithLocation("s3://my-bucket/tables/my_table"),
)
if err != nil {
log.Fatal(err)
}
Also, you can load a table directly:
// Load an existing table
tbl, err := cat.LoadTable(context.Background(), tableIdent, nil)
if err != nil {
log.Fatal(err)
}
fmt.Printf("Table: %s\n", tbl.Identifier())
fmt.Printf("Location: %s\n", tbl.MetadataLocation())
Schema Creation
Here are some examples of creating different types of schemas:
// Simple schema with primitive types
simpleSchema := iceberg.NewSchemaWithIdentifiers(1, []int{2},
iceberg.NestedField{ID: 1, Name: "id", Type: iceberg.PrimitiveTypes.Int32, Required: true},
iceberg.NestedField{ID: 2, Name: "name", Type: iceberg.PrimitiveTypes.String, Required: false},
iceberg.NestedField{ID: 3, Name: "active", Type: iceberg.PrimitiveTypes.Bool, Required: false},
)
// Schema with nested struct
nestedSchema := iceberg.NewSchemaWithIdentifiers(1, []int{1},
iceberg.NestedField{ID: 1, Name: "person", Type: &iceberg.StructType{
FieldList: []iceberg.NestedField{
{ID: 2, Name: "name", Type: iceberg.PrimitiveTypes.String, Required: false},
{ID: 3, Name: "age", Type: iceberg.PrimitiveTypes.Int32, Required: true},
},
}, Required: false},
)
// Schema with list and map types
complexSchema := iceberg.NewSchemaWithIdentifiers(1, []int{1},
iceberg.NestedField{ID: 1, Name: "tags", Type: &iceberg.ListType{
ElementID: 2, Element: iceberg.PrimitiveTypes.String, ElementRequired: true,
}, Required: false},
iceberg.NestedField{ID: 3, Name: "metadata", Type: &iceberg.MapType{
KeyID: 4, KeyType: iceberg.PrimitiveTypes.String,
ValueID: 5, ValueType: iceberg.PrimitiveTypes.String, ValueRequired: true,
}, Required: false},
)
Table Operations
Here are some common table operations:
// List tables in a namespace
tables := cat.ListTables(context.Background(), catalog.ToIdentifier("my_namespace"))
for tableIdent, err := range tables {
if err != nil {
log.Printf("Error listing table: %v", err)
continue
}
fmt.Printf("Table: %v\n", tableIdent)
}
// Check if table exists
exists, err := cat.CheckTableExists(context.Background(), tableIdent)
if err != nil {
log.Fatal(err)
}
fmt.Printf("Table exists: %t\n", exists)
// Drop a table
err = cat.DropTable(context.Background(), tableIdent)
if err != nil {
log.Fatal(err)
}
// Rename a table
fromIdent := catalog.ToIdentifier("my_namespace", "old_table")
toIdent := catalog.ToIdentifier("my_namespace", "new_table")
renamedTable, err := cat.RenameTable(context.Background(), fromIdent, toIdent)
if err != nil {
log.Fatal(err)
}
Working with Table Metadata
Once you have a table, you can access its metadata and properties:
// Access table metadata
metadata := tbl.Metadata()
fmt.Printf("Table UUID: %s\n", metadata.TableUUID())
fmt.Printf("Format version: %d\n", metadata.Version())
fmt.Printf("Last updated: %d\n", metadata.LastUpdatedMillis())
// Access table schema
schema := tbl.Schema()
fmt.Printf("Schema ID: %d\n", schema.ID)
fmt.Printf("Number of fields: %d\n", schema.NumFields())
// Access table properties
props := tbl.Properties()
fmt.Printf("Owner: %s\n", props["owner"])
// Access current snapshot
if snapshot := tbl.CurrentSnapshot(); snapshot != nil {
fmt.Printf("Current snapshot ID: %d\n", snapshot.SnapshotID)
fmt.Printf("Snapshot timestamp: %d\n", snapshot.TimestampMs)
}
// List all snapshots
for _, snapshot := range tbl.Snapshots() {
fmt.Printf("Snapshot %d: %s\n", snapshot.SnapshotID, snapshot.Summary.Operation)
}
Creating Tables with Partitioning
You can create tables with partitioning:
import (
"github.com/apache/iceberg-go"
"github.com/apache/iceberg-go/catalog"
)
// Create schema
schema := iceberg.NewSchemaWithIdentifiers(1, []int{1},
iceberg.NestedField{ID: 1, Name: "id", Type: iceberg.PrimitiveTypes.Int32, Required: true},
iceberg.NestedField{ID: 2, Name: "name", Type: iceberg.PrimitiveTypes.String, Required: false},
iceberg.NestedField{ID: 3, Name: "date", Type: iceberg.PrimitiveTypes.Date, Required: false},
)
// Create partition spec
partitionSpec := iceberg.NewPartitionSpec(
iceberg.PartitionField{SourceID: 3, FieldID: 1000, Transform: iceberg.IdentityTransform{}, Name: "date"},
)
// Create table with partitioning
tbl, err := cat.CreateTable(
context.Background(),
tableIdent,
schema,
catalog.WithPartitionSpec(&partitionSpec),
catalog.WithLocation("s3://my-bucket/tables/partitioned_table"),
)
if err != nil {
log.Fatal(err)
}
Glossary
This glossary defines important terms used throughout the Iceberg ecosystem, organized in tables for easy reference.
Core Concepts
Term | Definition |
---|---|
Catalog | A centralized service that manages table metadata and provides a unified interface for accessing Iceberg tables. Catalogs can be implemented as Hive metastore, AWS Glue, REST API, or SQL-based solutions. |
Table | A collection of data files organized by a schema, with metadata tracking changes over time through snapshots. Tables support ACID transactions and schema evolution. |
Schema | The structure definition of a table, specifying field names, types, and whether fields are required or optional. Schemas are versioned and can evolve over time. |
Snapshot | A point-in-time view of a table's data, representing the state after a specific operation (append, overwrite, delete, etc.). Each snapshot contains metadata about the operation and references to data files. |
Manifest | A metadata file that lists data files and their metadata (location, partition information, record counts, etc.). Manifests are organized into manifest lists for efficient access. |
Manifest List | A file that contains references to manifest files for a specific snapshot, enabling efficient discovery of data files without reading all manifests. |
Data Types
Primitive Types
Type | Description |
---|---|
boolean | True/false values |
int (32-bit) | Integer values |
long (64-bit) | Long integer values |
float (32-bit) | Single precision floating point |
double (64-bit) | Double precision floating point |
date | Date values (days since epoch) |
time | Time values (microseconds since midnight) |
timestamp | Timestamp values (microseconds since epoch) |
timestamptz | Timestamp with timezone |
string | UTF-8 encoded strings |
uuid | UUID values |
binary | Variable length binary data |
fixed[n] | Fixed length binary data of n bytes |
decimal(p,s) | Decimal values with precision p and scale s |
Nested Types
Type | Description |
---|---|
struct | Collection of named fields |
list | Ordered collection of elements |
map | Key-value pairs |
Operations
Operation | Description |
---|---|
Append | An operation that adds new data files to a table without removing existing data. Creates a new snapshot with the additional files. |
Overwrite | An operation that replaces existing data files with new ones, typically based on a partition predicate. Creates a new snapshot with the replacement files. |
Delete | An operation that removes data files from a table, either by marking them as deleted or by removing references to them. |
Replace | An operation that completely replaces all data in a table with new data, typically used for full table refreshes. |
Partitioning
Term | Definition |
---|---|
Partition | A logical division of table data based on column values, used to improve query performance by allowing selective reading of relevant data files. |
Partition Spec | Defines how table data is partitioned by specifying source columns and transformations (identity, bucket, truncate, year, month, day, hour). |
Partition Field | A field in the partition spec that defines how a source column is transformed for partitioning. |
Partition Path | The file system path structure created by partition values, typically in the format partition_name=value/ . |
Partition Transforms
Transform | Description |
---|---|
identity | Use the column value directly |
bucket[n] | Hash the value into n buckets |
truncate[n] | Truncate strings to n characters |
year | Extract year from date/timestamp |
month | Extract month from date/timestamp |
day | Extract day from date/timestamp |
hour | Extract hour from timestamp |
void | Always returns null (used for unpartitioned tables) |
Expressions and Predicates
Term | Definition |
---|---|
Expression | A computation or comparison that can be evaluated against table data, used for filtering and transformations. |
Predicate | A boolean expression used to filter data, such as column comparisons, null checks, or set membership tests. |
Bound Predicate | A predicate that has been resolved against a specific schema, with field references bound to actual columns. |
Unbound Predicate | A predicate that contains unresolved field references, typically in string form before binding to a schema. |
Literal | A constant value used in expressions and predicates, such as numbers, strings, dates, etc. |
File Formats
Format | Usage | Description |
---|---|---|
Parquet | Data files | The primary data file format used by Iceberg, providing columnar storage with compression and encoding optimizations. |
Avro | Metadata files | Used for manifests and manifest lists due to its schema evolution capabilities and compact binary format. |
ORC | Data files | An alternative columnar format supported by some Iceberg implementations. |
Metadata
Term | Definition |
---|---|
Metadata File | A JSON file containing table metadata including schema, partition spec, properties, and snapshot information. |
Metadata Location | The URI pointing to the current metadata file for a table, stored in the catalog. |
Properties | Key-value pairs that configure table behavior, such as compression settings, write options, and custom metadata. |
Statistics | Metadata about data files including record counts, file sizes, and value ranges for optimization. |
Transactions
Term | Definition |
---|---|
Transaction | A sequence of operations that are committed atomically, ensuring data consistency and ACID properties. |
Commit | The process of finalizing a transaction by creating a new snapshot and updating the metadata file. |
Rollback | The process of undoing changes in a transaction, typically by reverting to a previous snapshot. |
References
Term | Definition |
---|---|
Branch | A named reference to a specific snapshot, allowing multiple concurrent views of table data. |
Tag | An immutable reference to a specific snapshot, typically used for versioning and releases. |
Storage
Term | Definition |
---|---|
Warehouse | The root directory or bucket where table data and metadata are stored. |
Location Provider | A component that generates file paths for table data and metadata based on table location and naming conventions. |
FileIO | An abstraction layer for reading and writing files across different storage systems (local filesystem, S3, GCS, Azure Blob, etc.). |
Query Optimization
Technique | Description |
---|---|
Column Pruning | A technique that reads only the columns needed for a query, reducing I/O and improving performance. |
Partition Pruning | A technique that skips reading data files from irrelevant partitions based on query predicates. |
Predicate Pushdown | A technique that applies filtering predicates at the storage layer, reducing data transfer and processing. |
Statistics-based Optimization | Using table and file statistics to optimize query execution plans and file selection. |
Schema Evolution
Term | Definition |
---|---|
Schema Evolution | The process of modifying a table's schema over time while maintaining backward compatibility. |
Column Addition | Adding new columns to a table schema, which are typically optional to maintain compatibility. |
Column Deletion | Removing columns from a table schema, which may be logical (marking as deleted) or physical. |
Column Renaming | Changing column names while preserving data and type information. |
Type Evolution | Changing column types in ways that maintain data compatibility (e.g., int32 to int64). |
Time Travel
Term | Definition |
---|---|
Time Travel | The ability to query a table as it existed at a specific point in time using snapshot timestamps. |
Snapshot Isolation | A property that ensures queries see a consistent view of data as it existed at a specific snapshot. |
ACID Properties
Property | Description |
---|---|
Atomicity | Ensures that all operations in a transaction either succeed completely or fail completely. |
Consistency | Ensures that the table remains in a valid state after each transaction. |
Isolation | Ensures that concurrent transactions do not interfere with each other. |
Durability | Ensures that committed changes are permanently stored and survive system failures. |