This glossary defines important terms used throughout the Iceberg ecosystem, organized in tables for easy reference.
Term Definition
Catalog A centralized service that manages table metadata and provides a unified interface for accessing Iceberg tables. Catalogs can be implemented as Hive metastore, AWS Glue, REST API, or SQL-based solutions.
Table A collection of data files organized by a schema, with metadata tracking changes over time through snapshots. Tables support ACID transactions and schema evolution.
Schema The structure definition of a table, specifying field names, types, and whether fields are required or optional. Schemas are versioned and can evolve over time.
Snapshot A point-in-time view of a table's data, representing the state after a specific operation (append, overwrite, delete, etc.). Each snapshot contains metadata about the operation and references to data files.
Manifest A metadata file that lists data files and their metadata (location, partition information, record counts, etc.). Manifests are organized into manifest lists for efficient access.
Manifest List A file that contains references to manifest files for a specific snapshot, enabling efficient discovery of data files without reading all manifests.
Type Description
boolean
True/false values
int
(32-bit)Integer values
long
(64-bit)Long integer values
float
(32-bit)Single precision floating point
double
(64-bit)Double precision floating point
date
Date values (days since epoch)
time
Time values (microseconds since midnight)
timestamp
Timestamp values (microseconds since epoch)
timestamptz
Timestamp with timezone
string
UTF-8 encoded strings
uuid
UUID values
binary
Variable length binary data
fixed[n]
Fixed length binary data of n bytes
decimal(p,s)
Decimal values with precision p and scale s
Type Description
struct
Collection of named fields
list
Ordered collection of elements
map
Key-value pairs
Operation Description
Append An operation that adds new data files to a table without removing existing data. Creates a new snapshot with the additional files.
Overwrite An operation that replaces existing data files with new ones, typically based on a partition predicate. Creates a new snapshot with the replacement files.
Delete An operation that removes data files from a table, either by marking them as deleted or by removing references to them.
Replace An operation that completely replaces all data in a table with new data, typically used for full table refreshes.
Term Definition
Partition A logical division of table data based on column values, used to improve query performance by allowing selective reading of relevant data files.
Partition Spec Defines how table data is partitioned by specifying source columns and transformations (identity, bucket, truncate, year, month, day, hour).
Partition Field A field in the partition spec that defines how a source column is transformed for partitioning.
Partition Path The file system path structure created by partition values, typically in the format partition_name=value/
.
Transform Description
identity
Use the column value directly
bucket[n]
Hash the value into n buckets
truncate[n]
Truncate strings to n characters
year
Extract year from date/timestamp
month
Extract month from date/timestamp
day
Extract day from date/timestamp
hour
Extract hour from timestamp
void
Always returns null (used for unpartitioned tables)
Term Definition
Expression A computation or comparison that can be evaluated against table data, used for filtering and transformations.
Predicate A boolean expression used to filter data, such as column comparisons, null checks, or set membership tests.
Bound Predicate A predicate that has been resolved against a specific schema, with field references bound to actual columns.
Unbound Predicate A predicate that contains unresolved field references, typically in string form before binding to a schema.
Literal A constant value used in expressions and predicates, such as numbers, strings, dates, etc.
Format Usage Description
Parquet Data files The primary data file format used by Iceberg, providing columnar storage with compression and encoding optimizations.
Avro Metadata files Used for manifests and manifest lists due to its schema evolution capabilities and compact binary format.
ORC Data files An alternative columnar format supported by some Iceberg implementations.
Term Definition
Metadata File A JSON file containing table metadata including schema, partition spec, properties, and snapshot information.
Metadata Location The URI pointing to the current metadata file for a table, stored in the catalog.
Properties Key-value pairs that configure table behavior, such as compression settings, write options, and custom metadata.
Statistics Metadata about data files including record counts, file sizes, and value ranges for optimization.
Term Definition
Transaction A sequence of operations that are committed atomically, ensuring data consistency and ACID properties.
Commit The process of finalizing a transaction by creating a new snapshot and updating the metadata file.
Rollback The process of undoing changes in a transaction, typically by reverting to a previous snapshot.
Term Definition
Branch A named reference to a specific snapshot, allowing multiple concurrent views of table data.
Tag An immutable reference to a specific snapshot, typically used for versioning and releases.
Term Definition
Warehouse The root directory or bucket where table data and metadata are stored.
Location Provider A component that generates file paths for table data and metadata based on table location and naming conventions.
FileIO An abstraction layer for reading and writing files across different storage systems (local filesystem, S3, GCS, Azure Blob, etc.).
Technique Description
Column Pruning A technique that reads only the columns needed for a query, reducing I/O and improving performance.
Partition Pruning A technique that skips reading data files from irrelevant partitions based on query predicates.
Predicate Pushdown A technique that applies filtering predicates at the storage layer, reducing data transfer and processing.
Statistics-based Optimization Using table and file statistics to optimize query execution plans and file selection.
Term Definition
Schema Evolution The process of modifying a table's schema over time while maintaining backward compatibility.
Column Addition Adding new columns to a table schema, which are typically optional to maintain compatibility.
Column Deletion Removing columns from a table schema, which may be logical (marking as deleted) or physical.
Column Renaming Changing column names while preserving data and type information.
Type Evolution Changing column types in ways that maintain data compatibility (e.g., int32 to int64).
Term Definition
Time Travel The ability to query a table as it existed at a specific point in time using snapshot timestamps.
Snapshot Isolation A property that ensures queries see a consistent view of data as it existed at a specific snapshot.
Property Description
Atomicity Ensures that all operations in a transaction either succeed completely or fail completely.
Consistency Ensures that the table remains in a valid state after each transaction.
Isolation Ensures that concurrent transactions do not interfere with each other.
Durability Ensures that committed changes are permanently stored and survive system failures.