Exploring AiiDA’s ORM architecture#

In this post, we’ll take a deep dive into the implementation of AiiDA’s Object-Relational Mapping (ORM) system. We’ll explore how AiiDA leverages SQLAlchemy to create a flexible ORM backend that abstracts database operations and separates concerns between the user-facing Python objects and the underlying database persistence.

Architecture overview: The multi-layer design#

AiiDA’s ORM follows a four-layer architecture that provides clean separation between the user interface, business logic, and data persistence:

┌─────────────────────┐
│   User interface    │  ← Node (Python ORM class)
│     (orm/nodes)     │
├─────────────────────┤
│ Abstract interface  │  ← BackendNode (Abstract base class)
│  (implementation)   │
├─────────────────────┤
│     ORM backend     │  ← SqlaNode (SQLAlchemy implementation)
│  (psql_dos/sqlite)  │
├─────────────────────┤
│  Database models    │  ← DBNode (SQLAlchemy models)
│      (models)       │
└─────────────────────┘

Each layer serves a distinct purpose:

  • User interface (Node): Provides a clean, Pythonic API that hides database complexity

  • Abstract interface (BackendNode): Defines contracts that all backend implementations must follow

  • ORM backend (SqlaNode): Backend connection to the different database systems

  • Database model (DbNode): Defines the actual table schemas and relationships using SQLAlchemy’s declarative approach

Importantly, you, the user, will typically not have to interact directly with database-specific code. Instead, you can work with the high-level Node class (and its child classes), while AiiDA automatically delegates database operations to the appropriate backend implementation. The design further allows AiiDA to support multiple database backends (currently PostgreSQL and SQLite), while providing a unified Python interface.

So let’s start from the bottom and work our way up, shall we?

The database models: DbNode#

AiiDA uses SQLAlchemy’s declarative approach to define database tables. This means the database schema is defined using Python classes rather than raw SQL. Here’s how the core DbNode model is constructed[1]:

from sqlalchemy.dialects.postgresql import JSONB, UUID
from sqlalchemy.orm import relationship
from sqlalchemy.schema import Column, ForeignKey
from sqlalchemy.types import DateTime, Integer, String, Text

class DbNode(Base):
    """Database model to store data for :py:class:`aiida.orm.Node`."""

    __tablename__ = 'db_dbnode'

    # Primary key and unique identifier
    id = Column(Integer, primary_key=True)
    uuid = Column(UUID(as_uuid=True), default=get_new_uuid, nullable=False, unique=True)

    # Node classification
    node_type = Column(String(255), default='', nullable=False, index=True)
    process_type = Column(String(255), index=True)

    # Metadata
    label = Column(String(255), nullable=False, default='', index=True)
    description = Column(Text(), nullable=False, default='')
    ctime = Column(DateTime(timezone=True), default=timezone.now, nullable=False, index=True)
    mtime = Column(DateTime(timezone=True), default=timezone.now, onupdate=timezone.now, nullable=False, index=True)

    # JSON storage for flexible data
    attributes = Column(JSONB, default=dict)
    extras = Column(JSONB, default=dict)
    repository_metadata = Column(JSONB, nullable=False, default=dict)

    # Foreign key relationships
    dbcomputer_id = Column(Integer, ForeignKey('db_dbcomputer.id'), nullable=True, index=True)
    user_id = Column(Integer, ForeignKey('db_dbuser.id'), nullable=False, index=True)

    # SQLAlchemy relationships
    dbcomputer = relationship('DbComputer', backref='dbnodes')
    user = relationship('DbUser', backref='dbnodes')

The Base class from which DbNode inherits is obtained via SQLAlchemy’s declarative_base(), which provides the metaclass and base functionality that allows Python classes to be automatically mapped to database tables:

Base = declarative_base(
  cls=Model,
  name='Model',
  metadata=MetaData(naming_convention=dict(naming_convention))
)

This declarative approach means that table schemas, constraints, relationships, and indexes are all defined in Python code rather than separate SQL files, making the database structure self-documenting and version-controllable alongside the application logic.

Some of the key features of AiiDA’s SQLAlchemy models, that can already be seen from the DbNode definition above, are:

  1. Use of JSON(B) for flexibility: The use of PostgreSQL’s JSONB (and SQLite’s JSON) type for attributes, extras, and repository_metadata provides schema flexibility while maintaining query performance. This is crucial for scientific computing where one can’t predict what data users will want to store. Traditional relational databases would require creating new columns for every new property, but JSON columns allow storing arbitrary structured data (given that it’s JSON-serializable).

  2. UUID-based identity: Each node has both an integer primary key (id in the table, pk in the Python API[2]) for database efficiency and a UUID for global uniqueness and portability. The integer id is user-friendly, fast for database joins and indexing, while the UUID allows nodes to be moved between different AiiDA installations without conflicts.

  3. Automatic timestamps: Creation and modification times are automatically managed through SQLAlchemy’s default and onupdate parameters.

  4. Indexing: Important columns like node_type, process_type, and timestamps are indexed for query performance. Without indexes, queries like “find all calculation nodes from last month” would be extremely slow. The indexes make these common queries fast even with millions of nodes.

The abstract interface: BackendNode#

Above the database layer sits the abstract BackendNode class, which defines the interface contract that all database backend implementations must follow:

class BackendNode(BackendEntity, BackendEntityExtrasMixin, metaclass=abc.ABCMeta):
    """Backend implementation for the `Node` ORM class.

    A node stores data input or output from a computation.
    """

    @abc.abstractmethod
    def clone(self: BackendNodeType) -> BackendNodeType:
        """Return an unstored clone of ourselves."""

    @property
    @abc.abstractmethod
    def uuid(self) -> str:
        """Return the node UUID."""

    @property
    @abc.abstractmethod
    def node_type(self) -> str:
        """Return the node type."""

    # ... more abstract properties and methods

This abstract base class (ABC) ensures that regardless of which ORM backend is used, the same interface is available to higher-level code (currently, AiiDA uses only SQLAlchemy, but also Django ORM was supported in the past[3]). It defines essential properties like uuid, node_type, process_type, and methods for managing attributes, links, and storage operations.

The abstract class serves as a contract —any backend implementation must provide all these methods and properties.

The ORM backend: SqlaNode#

The SqlaNode class bridges the abstract BackendNode interface with the concrete SQLAlchemy models:

class SqlaNode(entities.SqlaModelEntity[models.DbNode], ExtrasMixin, BackendNode):
    """SQLA Node backend entity"""

    MODEL_CLASS = models.DbNode
    USER_CLASS = SqlaUser
    COMPUTER_CLASS = SqlaComputer
    LINK_CLASS = models.DbLink

    def __init__(self, backend, node_type, user, computer=None, **kwargs):
        """Construct a new `BackendNode` instance wrapping a new `DbNode` instance."""
        super().__init__(backend)

        arguments = {
            'node_type': node_type,
            'user': user.bare_model,
            'label': kwargs.get('label', ''),
            'description': kwargs.get('description', ''),
        }

        if computer:
            arguments['dbcomputer'] = computer.bare_model

        self._model = sqla_utils.ModelWrapper(self.MODEL_CLASS(**arguments), backend)

    # ... more properties and methods

    def store(self, links=None, clean=True):
        session = self.backend.get_session()

        if clean:
            self.clean_values()

        session.add(self.model)

        if links:
            for link_triple in links:
                self._add_link(*link_triple)

        if not session.in_nested_transaction():
            try:
                session.commit()
            except SQLAlchemyError:
                session.rollback()
                raise
        else:
            session.flush()

        return self

SqlaNode wraps a raw DbNode SQLAlchemy model in a ModelWrapper that handles session management automatically. This wrapping provides two levels of access: node.model for the wrapped model with automatic session tracking, and node.bare_model for direct access to the raw SQLAlchemy model when you need to bypass AiiDA’s management. Beyond model wrapping, SqlaNode is responsible for actual database storage via its store method (see above). The design further uses generic typing to ensure type safety and proper relationships between related objects like users and computers.

The user interface: Node#

Finally, we reach the top level, the Node class, which is the main interaction point for users. The implementation uses composition to wrap an SqlaNode instance, while providing a Pythonic interface to users:

class Node(Entity['BackendNode', NodeCollection], metaclass=AbstractNodeMeta):
    """Base class for all nodes in AiiDA."""

    def __init__(self, backend=None, user=None, computer=None, extras=None, **kwargs):
        backend = backend or get_manager().get_profile_storage()
        user = user if user else backend.default_user

        backend_entity = backend.nodes.create(
            node_type=self.class_node_type,
            user=user.backend_entity,
            computer=computer.backend_entity if computer else None,
            **kwargs
        )
        super().__init__(backend_entity)

The creation magic happens in backend.nodes.create() which returns an SqlaNode instance[4] that gets stored in the backend_entity attribute. The Node class then delegates all database operations to this contained SqlaNode through the backend_entity property. From the user’s perspective, just a Node is created—the backend selection and delegation are transparent.

Here’s how all of this looks like in action in a verdi shell:

In [1]: n = Int(1).store()  # of type: aiida.orm.nodes.data.int.Int

In [2]: n
Out[2]: <Int: uuid: 2247e32a-c802-449c-9246-35b74e066c3f (pk: 12) value: 1>

In [3]: n.backend_entity
Out[3]: <aiida.storage.psql_dos.orm.nodes.SqlaNode at 0x7d8ac6d53880>

In [4]: n.backend_entity.bare_model
Out[4]: <DbNode id=103522, uuid=UUID('70cd..., node_type='data.core..., process_type=None, label='', description='', ctime=datetime.d..., mtime=datetime.d..., attributes={'value': ..., extras={'_aiida_h..., repository_metadata={}, dbcomputer_id=None, user_id=1,>

In [5]: n.backend_entity.model
Out[5]: <aiida.storage.psql_dos.orm.utils.ModelWrapper at 0x7d8ac6c6fcd0>

Namespacing

AiiDA further uses a namespace pattern on the Node class to organize functionality:

class Node(Entity['BackendNode', NodeCollection], metaclass=AbstractNodeMeta):
    """Base class for all nodes in AiiDA."""

    # ... more properties and methods

    @cached_property
    def base(self) -> NodeBase:
        """Return the node base namespace."""
        return NodeBase(self)

    # ... more properties and methods

class NodeBase:
    """A namespace for node related functionality."""

    @cached_property
    def repository(self) -> 'NodeRepository':
        """Return the repository for this node."""
        return NodeRepository(self._node)

    @cached_property
    def attributes(self) -> 'NodeAttributes':
        """Return an interface to interact with the attributes."""
        return NodeAttributes(self._node)

    @cached_property
    def links(self) -> 'NodeLinks':
        """Return an interface to interact with the links."""
        return NodeLinks(self._node)

This namespace pattern was introduced because having all properties and methods directly on the Node class (as was the case in the past) made the API cluttered and prone to name conflicts.

The namespace approach further groups related functionality together, e.g.:

  • node.base.attributes -> NodeAttributes - attribute management

  • node.base.repository -> NodeRepository - file repository operations

  • node.base.links -> NodeLinks - provenance links and the use of @cached_property ensures that these namespace objects are created only once per node instance[5].

Honorable mentions#

Pydantic models: Modern serialization#

Recently, pydantic models were integrated into AiiDA’s ORM for modern data validation and serialization:

class Node(Entity['BackendNode', NodeCollection]):
    """Base class for all nodes in AiiDA with Pydantic model support."""

    class Model(Entity.Model):
        """Pydantic model for Node serialization and validation."""

        uuid: Optional[str] = MetadataField(None, description='The UUID of the node')
        node_type: Optional[str] = MetadataField(None, description='The type of the node')

        attributes: Optional[Dict[str, Any]] = MetadataField(
            None,
            description='The node attributes',
            orm_to_model=lambda node, _: node.base.attributes.all,
            is_subscriptable=True,
            exclude_to_orm=True,
        )

        repository_content: Optional[dict[str, bytes]] = MetadataField(
            None,
            description='Dictionary of file repository content encoded as base64',
            orm_to_model=lambda node, _: {
                key: base64.encodebytes(content)
                for key, content in node.base.repository.serialize_content().items()
            },
            exclude_to_orm=True,
        )

    def serialize(self, repository_path: Optional[pathlib.Path] = None) -> dict[str, Any]:
        """Serialize the entity instance to JSON."""
        if repository_path is None:
            repository_path = pathlib.Path(tempfile.mkdtemp()) / f'./aiida_serialization/{self.pk}/'
            repository_path.mkdir(parents=True)
        return self._to_model(repository_path).model_dump()

    @classmethod
    def from_serialized(cls, **kwargs: dict[str, Any]) -> 'Node':
        """Construct an entity instance from JSON serialized data."""
        return cls._from_model(cls.Model(**kwargs))

The integration of pydantic brings various additional features and advantages:

  • Automatic validation: Pydantic automatically validates data types and constraints based on type annotations

  • Serialization: ORM objects can be automatically converted to/from JSON for export/import

  • API integration: Pydantic’s automatic serialization and validation capability allows for easy creation standard APIs (e.g., aiida-restapi) and integration with frameworks like FastAPI

The QueryBuilder#

The QueryBuilder is AiiDA’s main Python API to retrieve data from the database. It provides a uniform, backend-agnostic interface:

# Simple node query
qb = QueryBuilder()
qb.append(Node, filters={'ctime': {'>': datetime(2025, 1, 1)}})
results = qb.all()

# Complex relationship query
qb = QueryBuilder()
qb.append(StructureData, tag='structure')
qb.append(CalcJobNode, with_incoming='structure', tag='calc')
qb.append(FolderData, with_incoming='calc', project=['*'])

The QueryBuilder automatically handles several complex translation tasks behind the scenes. Most importantly, it converts the high-level query syntax into the appropriate SQL statement (while respecting the specific dialect of the database backend), ensuring that one can write database-agnostic queries while still leveraging each backend’s specific capabilities. For this, among other operations, it converts ORM classes like Node to their corresponding database entities, translates relationship specifications such as with_incoming and with_outgoing into proper SQL joins, and automatically adds filters for node types and subtypes based on the classes specified in the query.

The case for NodeCollection#

In addition to the QueryBuilder, AiiDA also implements the NodeCollection class which provides a clean interface for managing …wait for it… collections of nodes:

class NodeCollection(EntityCollection[NodeType], Generic[NodeType]):
    """The collection of nodes."""

    def delete(self, pk: int) -> None:
        """Delete a Node from the collection with the given id"""
        node = self.get(id=pk)

        if node.base.links.get_incoming().all():
            raise exceptions.InvalidOperation(f'cannot delete Node<{node.pk}> because it has incoming links')

        if node.base.links.get_outgoing().all():
            raise exceptions.InvalidOperation(f'cannot delete Node<{node.pk}> because it has outgoing links')

        self._backend.nodes.delete(pk)

The collection pattern provides several benefits that complement the QueryBuilder’s capabilities. On one hand, while the QueryBuilder is suitable for complex analytical queries and relationship traversal, collections can handle the simpler but essential task of managing individual entities efficiently. Thus, collections enable efficient bulk operations and batch processing that would be cumbersome with multiple QueryBuilder calls. Finally, they also provide essential validation logic, such as preventing deletion of nodes with existing links to maintain provenance integrity, and again hide database-specific operations behind a clean interface.

Key Takeaways#

AiiDA’s ORM architecture provides the foundation for all derived data types, for both, data and processes. More specialized types like orm.Int, orm.CalcJobNode, and orm.SinglefileData all build on Node and the functionality it provides. The implementation makes use of several important design principles:

  1. Multi-layer abstraction: The multi-layer design achieves a clean separation and allows for a uniform API, while leveraging backend-specific optimizations.

  2. Multi-database support: The architecture supports both PostgreSQL and SQLite backends through SQLAlchemy’s dialect system.

  3. Use of JSON columns: JSON(B) columns provide schema flexibility without sacrificing query performance.

  4. Namespace organization: The namespace pattern on the user-facing Node class keeps the API clean by structuring it in distinct, nested categories.

  5. Modern integration: Pydantic model integration brings modern Python data validation and serialization capabilities to the ORM.

  6. Universal query interface: The QueryBuilder allows for identical query syntax across database backends.

  7. Collection patterns: The collection system provides a consistent, intuitive interface for data access that complements the QueryBuilder.

This architecture allows AiiDA to provide a powerful and flexible ORM that supports different database backends while maintaining a consistent user experience.

Footnotes