Building a Graph Database from Scratch in C: Architecture of a Research Storage Engine

This post was AI-generated from the project’s source code, thesis, and documentation. It is an automated summary, not original writing.

Most developers interact with databases through query languages and APIs. Few ever think about what happens below the surface – how bytes are arranged in files, how pages are loaded into memory, or why the order of records on disk can make one query ten times faster than another.

This post walks through the architecture of a graph database built entirely from scratch in C. It was designed not as a production system, but as a research instrument: a transparent, fully instrumented storage engine where every disk read can be traced, every cache eviction logged, and every record rearranged to test hypotheses about data locality. The project accompanies a master’s thesis at the University of Konstanz and weighs in at roughly 15,000 lines of C.

Design Constraints

Before diving into the architecture, it’s worth understanding the constraints that shaped it:

The Four Layers

The architecture follows the classic database layered pattern, with four distinct modules stacked on top of each other:

  +--------------------+     +------------------+
  |   Layout Methods   |     |     Queries      |
  +--------------------+     +------------------+
  |            Layout Utilities                  |
  +----------------------------------------------+
  |                  Access                      |
  +----------------------------------------------+
  |                  Cache                       |
  +----------------------------------------------+
  |                    I/O                       |
  +----------------------------------------------+

Layer 1: I/O (Disk Space Manager)

At the bottom sits the physical I/O layer. It provides two abstractions:

Files consist of a set of pages. A file tracks which pages are in use via a page directory, and can grow or shrink dynamically. The interface exposes read_page, write_page, grow, shrink, and clear_page operations. Underneath, it uses POSIX fopen, fread, fwrite, fseek, ftruncate, and fsync.

The Disk Manager sits above individual files, providing create_file, get_file, remove_file, and get_slots operations. It manages the collection of database files and coordinates space allocation.

Every read and write operation can be logged to a dedicated file, allowing precise measurement of I/O patterns.

Layer 2: Cache (Buffer Manager)

The page cache mediates between the I/O layer and everything above it. It splits a fixed amount of RAM into frames, each the size of one page. Its responsibilities:

The cache can be resized at runtime (via page_cache_change_n_frames), and its log file can be swapped – both features designed for running comparative experiments.

Layer 3: Access (Record Structures and Operators)

This is where raw bytes become graph elements. The access layer provides:

Node records – simplified versions of Neo4j’s 15-byte node format. Each node has:

Relationship records – inspired by Neo4j’s 34-byte format. Each relationship stores:

The Heap File provides the CRUD interface: create_node, read_node, update_node, delete_node (and equivalents for relationships). Record IDs are implicit – they encode the record’s byte offset in the file, so address translation is just multiplication. No B-tree index is needed.

The critical operation for graph traversal is expand: given a node ID and a direction (OUTGOING, INCOMING, or BOTH), it follows the incidence list pointers to collect all connected edges. This is where the linked-list structure of the storage format meets the traversal algorithms above.

Layer 4: Queries and Layout Methods

The top layer is split into two parallel modules:

Queries implement graph algorithms that use the access layer:

Each algorithm is instrumented to log the sequence of node and relationship IDs it accesses. This access log is the raw data for evaluating layout quality.

Layout Methods implement record rearrangement strategies:

The layout method interface is deliberately simple: given a database, return a mapping from current record IDs to new record IDs. The layout utilities handle the actual record swapping.

The Record Layout Interface

One of the cleanest design decisions is the layout method contract:

unsigned long[] layout_nodes(graph_database db) {
    require db != null;
    require db.num_nodes > 0;
    ensure db.num_nodes == result array length;
    ensure unique node IDs;
}

Any new layout method just needs to implement this function (and the equivalent for relationships). It receives the full database and returns a permutation. The framework handles reordering records according to the permutation, including updating all cross-references (incidence list pointers, node-to-relationship links).

This separation of “what order” from “how to reorder” makes adding new methods straightforward.

Data Structures

The project includes a substantial library of generic data structures, all in C:

Each data structure follows a consistent API pattern: create, destroy, size, insert/append, remove, get, contains, with iterators for hash-based structures.

The Cost Model

Rather than measuring wall-clock time (which depends on hardware, OS scheduling, and background processes), the evaluation uses a deterministic cost model based on block access counting.

Given the sequence of accessed record IDs logged during a query:

  1. Compute the block number for each access: block = (record_id * record_size) / block_size
  2. Count a block I/O whenever the block number changes between consecutive accesses
  3. Count a non-sequential I/O whenever the block number changes by more than one

The cache is modeled as holding exactly one block (worst case). This gives two clean metrics: total block accesses (measuring temporal locality) and non-sequential accesses (measuring spatial locality and prefetch-friendliness).

Setting the spatial locality radius to 8 blocks models a system with 512-byte blocks and 4096-byte pages – realistic for actual storage hardware.

Benchmarking Workflow

The intended workflow for evaluating a new layout method:

  1. Import a SNAP dataset into the database
  2. Run all queries on the natural (insertion-order) layout, logging accesses
  3. Apply the layout method to rearrange records
  4. Optionally sort the incidence lists
  5. Re-run all queries, logging accesses
  6. Compare block access counts and visualize access sequences

The visualization tools (Python with matplotlib/numpy) produce bar charts comparing total I/O across methods and line charts showing 50-access windows to reveal whether access patterns are sequential or jumping.

Lessons from the Implementation

A few interesting engineering observations from the codebase:

Implicit addressing is powerful. Because record IDs directly encode file offsets (ID * record_size = byte offset), there’s no need for an index to find a record. This is a key advantage of fixed-size records and is exactly how Neo4j works. The downside: moving a record changes its ID, so all references to it must be updated.

Doubly-linked incidence lists are expensive to maintain but cheap to traverse. Each relationship participates in two incidence lists simultaneously (one for the source node, one for the target node). Insertion is O(1) at the head, deletion is O(1) with the ID known. But the pointer-chasing nature of traversal is exactly what makes record order matter so much.

The Louvain method isn’t just for community detection. Using it as a record layout strategy (group records by community) is creative, but the lack of block-size awareness means communities don’t map cleanly to disk blocks. A community of 1000 nodes spans hundreds of blocks with no control over their ordering.

ICBL’s memory problems are fundamental, not incidental. The hierarchical clustering step requires a pairwise distance matrix that scales as |V|^2. For 300,000 nodes, that’s 335 GiB. The original papers didn’t address this, making the method impractical for the very scale of data where layout optimization matters most.

Who Is This For?

The project is designed for three audiences:

The code builds with CMake, compiles with Clang, includes unit tests, and has Doxygen documentation for the I/O and cache modules. It uses only two external libraries (curl for downloading SNAP datasets, zlib for decompression).

If you’ve ever wondered what happens below the query optimizer – how bytes are arranged on disk and why it matters – this is a good place to start looking.


Full document: Thesis (PDF)

This post describes the architecture of the Graph Record Layout Research Environment, part of the master’s thesis “Locality Optimization for Traversal-based Queries on Graph Databases” by Fabian Klopfer, University of Konstanz, 2020.