Glossary

This glossary defines important terms used throughout the Iceberg ecosystem, organized in tables for easy reference.

Core Concepts

TermDefinition
CatalogA centralized service that manages table metadata and provides a unified interface for accessing Iceberg tables. Catalogs can be implemented as Hive metastore, AWS Glue, REST API, or SQL-based solutions.
TableA collection of data files organized by a schema, with metadata tracking changes over time through snapshots. Tables support ACID transactions and schema evolution.
SchemaThe structure definition of a table, specifying field names, types, and whether fields are required or optional. Schemas are versioned and can evolve over time.
SnapshotA point-in-time view of a table's data, representing the state after a specific operation (append, overwrite, delete, etc.). Each snapshot contains metadata about the operation and references to data files.
ManifestA metadata file that lists data files and their metadata (location, partition information, record counts, etc.). Manifests are organized into manifest lists for efficient access.
Manifest ListA file that contains references to manifest files for a specific snapshot, enabling efficient discovery of data files without reading all manifests.

Data Types

Primitive Types

TypeDescription
booleanTrue/false values
int (32-bit)Integer values
long (64-bit)Long integer values
float (32-bit)Single precision floating point
double (64-bit)Double precision floating point
dateDate values (days since epoch)
timeTime values (microseconds since midnight)
timestampTimestamp values (microseconds since epoch)
timestamptzTimestamp with timezone
stringUTF-8 encoded strings
uuidUUID values
binaryVariable length binary data
fixed[n]Fixed length binary data of n bytes
decimal(p,s)Decimal values with precision p and scale s

Nested Types

TypeDescription
structCollection of named fields
listOrdered collection of elements
mapKey-value pairs

Operations

OperationDescription
AppendAn operation that adds new data files to a table without removing existing data. Creates a new snapshot with the additional files.
OverwriteAn operation that replaces existing data files with new ones, typically based on a partition predicate. Creates a new snapshot with the replacement files.
DeleteAn operation that removes data files from a table, either by marking them as deleted or by removing references to them.
ReplaceAn operation that completely replaces all data in a table with new data, typically used for full table refreshes.

Partitioning

TermDefinition
PartitionA logical division of table data based on column values, used to improve query performance by allowing selective reading of relevant data files.
Partition SpecDefines how table data is partitioned by specifying source columns and transformations (identity, bucket, truncate, year, month, day, hour).
Partition FieldA field in the partition spec that defines how a source column is transformed for partitioning.
Partition PathThe file system path structure created by partition values, typically in the format partition_name=value/.

Partition Transforms

TransformDescription
identityUse the column value directly
bucket[n]Hash the value into n buckets
truncate[n]Truncate strings to n characters
yearExtract year from date/timestamp
monthExtract month from date/timestamp
dayExtract day from date/timestamp
hourExtract hour from timestamp
voidAlways returns null (used for unpartitioned tables)

Expressions and Predicates

TermDefinition
ExpressionA computation or comparison that can be evaluated against table data, used for filtering and transformations.
PredicateA boolean expression used to filter data, such as column comparisons, null checks, or set membership tests.
Bound PredicateA predicate that has been resolved against a specific schema, with field references bound to actual columns.
Unbound PredicateA predicate that contains unresolved field references, typically in string form before binding to a schema.
LiteralA constant value used in expressions and predicates, such as numbers, strings, dates, etc.

File Formats

FormatUsageDescription
ParquetData filesThe primary data file format used by Iceberg, providing columnar storage with compression and encoding optimizations.
AvroMetadata filesUsed for manifests and manifest lists due to its schema evolution capabilities and compact binary format.
ORCData filesAn alternative columnar format supported by some Iceberg implementations.

Metadata

TermDefinition
Metadata FileA JSON file containing table metadata including schema, partition spec, properties, and snapshot information.
Metadata LocationThe URI pointing to the current metadata file for a table, stored in the catalog.
PropertiesKey-value pairs that configure table behavior, such as compression settings, write options, and custom metadata.
StatisticsMetadata about data files including record counts, file sizes, and value ranges for optimization.

Transactions

TermDefinition
TransactionA sequence of operations that are committed atomically, ensuring data consistency and ACID properties.
CommitThe process of finalizing a transaction by creating a new snapshot and updating the metadata file.
RollbackThe process of undoing changes in a transaction, typically by reverting to a previous snapshot.

References

TermDefinition
BranchA named reference to a specific snapshot, allowing multiple concurrent views of table data.
TagAn immutable reference to a specific snapshot, typically used for versioning and releases.

Storage

TermDefinition
WarehouseThe root directory or bucket where table data and metadata are stored.
Location ProviderA component that generates file paths for table data and metadata based on table location and naming conventions.
FileIOAn abstraction layer for reading and writing files across different storage systems (local filesystem, S3, GCS, Azure Blob, etc.).

Query Optimization

TechniqueDescription
Column PruningA technique that reads only the columns needed for a query, reducing I/O and improving performance.
Partition PruningA technique that skips reading data files from irrelevant partitions based on query predicates.
Predicate PushdownA technique that applies filtering predicates at the storage layer, reducing data transfer and processing.
Statistics-based OptimizationUsing table and file statistics to optimize query execution plans and file selection.

Schema Evolution

TermDefinition
Schema EvolutionThe process of modifying a table's schema over time while maintaining backward compatibility.
Column AdditionAdding new columns to a table schema, which are typically optional to maintain compatibility.
Column DeletionRemoving columns from a table schema, which may be logical (marking as deleted) or physical.
Column RenamingChanging column names while preserving data and type information.
Type EvolutionChanging column types in ways that maintain data compatibility (e.g., int32 to int64).

Time Travel

TermDefinition
Time TravelThe ability to query a table as it existed at a specific point in time using snapshot timestamps.
Snapshot IsolationA property that ensures queries see a consistent view of data as it existed at a specific snapshot.

ACID Properties

PropertyDescription
AtomicityEnsures that all operations in a transaction either succeed completely or fail completely.
ConsistencyEnsures that the table remains in a valid state after each transaction.
IsolationEnsures that concurrent transactions do not interfere with each other.
DurabilityEnsures that committed changes are permanently stored and survive system failures.