Flashix: KIV specifications

This page describes the KIV projects, specifications and proofs of the different layers in the construction of a verified Flash file system, notably the AFS and VFS layers described in A Formal Model of a Virtual Filesystem Switch [1] (with some small modifications), and the POSIX model + refinement proof described in Verification of a Virtual Filesystem Switch [2].

The effort is structured into layers that are connected by refinement, corresponding to the various logical parts of the file system. The top level is an abstract formal model of the file system interface as defined in the POSIX standard, serving as the specification of the functional requirements, i.e., what it means to create/remove a file/directory and how the contents of files are accessed.

The POSIX specification is implemented by a combination of VFS (generic part) and AFS (operational specification of the file system specific parts). AFS is then further refined to an abstract model of the UBIFS file system, independently of VFS.

At the bottom, UBIFS relies on UBI to provide wear-leveling and Linux MTD as an interface to the hardware.

Click on the layers in the diagram on the left to get a short description of the respective formal model.

The POSIX specification defines an algebraic directory tree with an indirection for hard-links to files at the leaves.

Directories and files are accessed by paths. The operations considered are create, link, unlink, mkdir, rmdir, open, close, read, write, truncate, readmeta (corresponds to stat), setmeta (corresponds to chmod, chown, ...), rename, and readdir.

Formal Model

The Virtual Filesystem Switch realizes aspects common to all file systems, in particular path lookup, access permission checks, mapping of file content to pages, open file handles and orphaned files. It depends on a file system implementation (precisely, on an abstract description of file system behavior) to provide lower-level operations.

Formal Model

The abstract file system specification (AFS) defines the expected behavior of a file system implementation. Its operations roughly correspond to the POSIX level operations, however, the modifications are local and are defined in terms of inodes instead of paths.

Formal Model

The LOGFS specification introduces the conceptual separation of persistent data on flash memory and data stored in RAM to study the effect of power loss.

Formal Model

Abstract UBIFS introduces flash file system specific concepts, in particular flat indices for file system objects, accessed by keys.

Formal Model

The UBIFS model splits up further subcomponents, and introduces caching of several data structures, in particular, it maintains only a partial RAM index, and tracks the set of orphaned files.

The RAM index stores a (partial) mapping from keys to addresses of file system objects. The Flash index is an old copy of that information, created during the last commit.

Purpose of the Journal/LOG is to write file system objects to flash memory in a sequential fashion, and to deal with the fact that flash is structured into erase blocks. It has a subcomponent, the write buffer (given as abstract specification) that maps file system objects to their representation as byte arrays.

The orphan component stores a set of inode numbers that have been unlinked completely from the file system tree and must be deleted after the last file handle is closed. This information is updated incrementally and written to flash during commit.

Properties (logical) Blocks are implemented by this module, examples are the space that is free or used.

The write buffer implements write buffer functionality on top of the UBI interface.

The abstract UBI specification defines the interface through with flash memory is accessed by the upper layers. Blocks are addressed logically and can be written (append only), read (randomly), and erased (as a whole).

The UBI model links the UBI interface to the Flash driver. It implements the mapping from logical to physical erase blocks (which is updated transparently to the upper layers). Its main purpose is wear-leveling, i.e., to distribute erases evenly between all physical blocks of the flash medium.

The MTD model specifies the Linux memory technology device model. It represents the lowest level in our refinement chain and encodes all assumptions about the hardware and the driver implementation.

Index

Data Model
POSIX
Virtual Filesystem Switch
Abstract File System
Refinement from POSIX to VFS
LOGFS (crash recovery)
Eraseblock Management Layer (EBM) Specification
MTD
Eraseblock Management Layer Implementation
Refinement from EBM Specification to Implementation
Scala Simulation

Data Model

Data structures that are shared between projects are defined in a common library

Inodes
Directory Entries
Pages
Errors and a classification ⌊.⌋ into low-level errors, that may arise nondeterministically.
Permissions are defined with respect to abstract metadata (and a user who will be chosen arbitrarily in the ASMs). Permissions are lifted to paths, leaving the file system state generic (which is instantiated later).

Differences to [1] and [2]

The presentation in the publications is slightly different and some minor technical aspects have been neglected. Here is a list of the differences:

Path components are separated by + instead of / and are technically instances of lists.
Pages and the content of files is modeled as arrays, not lists (through an instance buffer).
Partial function update is written as f[x,y] instead of f[x ↦ y], similarly tree update is written as t[p,t], allocation is ++, deallocation is --
Slightly different names, e.g. the constructor fdata is called mkfile
Splice, resize and copy are algebraic functions instead of procedures
File identifiers are defined by an injection from inode numbers. Consequently, file handles need a technical mapping as well abs-of-*.
Several conversions/injections between streams, buffers (= arrays) and lists have been omitted (see here and here)
All generic types (with a type-parameter) are instantiated explicitly, renaming the type name, in particular, partial function types such as for fs or dirs
There is an explicit recursion for the abstraction of the subtrees of a directory instead of iterated separating conjunction, see abs-rec

POSIX

See here for an overview of the specifications in the POSIX project. At the top level is the definition of the Abstract State Machine posix-asm that contains all top-level operations as well as some helper procedures.

The algebraic directory tree is split up in the data-type definition, definition of accessors, and the function fids,

Precondition predicates are defined here.

The POSIX main invariant (called fids-cons) is defined in specification posix-base. Proofs for this invariant can be found here, in particular see posix-op-fids-cons.

Virtual Filesystem Switch

See here for an overview of the specifications in the VFS project. At the top level is the definition of the Abstract State Machine vfs-asm that contains all top-level operations as well as some helper procedures.

Various functions and predicates are defined in vfs-base,

The state invariant of-cons: all open file handles point to valid files.
An abstraction of file content to a stream of bytes.

You can find all proofs about operations here.

All termination lemmas op-term ensure that preconditions of AFS are established, since these abort otherwise.
The lemmas vfs-op-afs-cons and vfs-op-of-cons ensure preservation of invariants.
Several (tricky) postconditions have been proven as well, notably for walk, read, and write.

Abstract File System

See here for an overview of the specifications in the AFS project.

The two state variables dirs and files are defined as instances of heaps (i.e., partial functions), that map to dirs resp. files.

The ASM defines the AFS operations.

The specification afs-base defines

The invariant afs-cons, which is proven for each operation, see here.
A predicate unique-parent, that is equivalent to the invariant given in the paper.

AFS operations (see below) are wrapped here to provide nondeterministic errors and precondition checking.

Refinement from POSIX to VFS

The refinement project is here.

Specification abs defines the abstraction functions, specification R aggregates invariants and abstraction into a predicate R. Various rewrite rules and theorems are lifted from the abstraction function (see here) to consequences of R (see here).

The latter ones are important for proof automation of the refinement proofs. Note that the proof for rename has not been completed yet, which we will do during the next days.

The proofs for read and write rely on invariants, defined here as predicates.

Finally, the generic refinement theory is instantiated, by showing the commuting diagrams, in particular this lemma.

LOGFS

The formal model can be found here. Basically, the AFS state is duplicated into a current version corresponding to the data in RAM, and an outdated version that represents the state of the last commit. Operations manipulate only the RAM state RDIRS, RFILES. The difference to the flash state is recorded in an append-only log LOGLI, which is a list of log entries. These correspond roughly to the different operations.

The main invariant is replay consistency. It specifies that the replay operation recovers the current RAM state from persistent data and the log, where replaying one log entry is a case distiction over the type of the entry, see here. Proofs for this invariant are here. They rely on a splitting lemma for replay, which in turn is established by several commutations.

The main difficulty of this model is to deal with orphaned files, which have been referenced by a file handle previously and must be deleted during the replay. Note that during commit, the current set of orphans is written to Flash.

Further functions and consistency invariants on the state are defined here

orphans denotes the set of orphaned files/directories
orphans-cons, subdirs-cons and logli-cons are disjointness properties of different parts of the state

A more systematic description of this model is yet to be done.

Eraseblock Management Layer (EBM) Specification

The Eraseblock Management Layer model provides read, write and erase functionality on logical eraseblocks (LEBs). Furthermore, it supports asynchronous erasure via unmap. The project (see here for a graphical overview) is structured as follows:

Logical eraseblocks are defined in specification leb.
Preconditions are specified in ebm-pre.
The invariant and initial state are defined in ebm-base.
Specification ebm-ops defines the abstract behavior of the operations. The preconditions and invariant proofs can be found in specification ebm-asm. The write operation is more complex for two reasons: Writing to an unmapped logical eraseblock is not prohibited, but instead leads to a mapping on demand. Second, for the verification of upper layers it is desirable that all modifications to the state are done in one atomic step.
The behavior wrt. power failures is specified in ebm-recover.

MTD

Specification peb contains the definition of a physical eraseblock (PEBs)
The invariant and initial state of the Flash device driver are defined in specification mtd-base
Specification mtd-asm. provides an abstract implementation of the Flash device driver's operations and contains the proofs that the invariant is maintained by the operations. Preconditions are checked by each operation. Violations lead to non-termination (abort).

The existence of a trace that leaves the flash device unmodified is shown explicitly, see theorems ex-trace-op in the theorem base of specification mtd-asm.

Eraseblock Management Layer Implementation (UBI)

The Unsorted Block Images layer consists of two projects.

UBI I/O

The first project (see here for a graphical overview) is an abstract I/O layer (see specifications ubi-io-ops, ubi-io-inv, ubi-io-pre for operations, preconditions and invariants) responsible for en/decoding of EC/VID-headers. It splits the physical eraseblocks of MTD into three distinct data structures: the ec-headers, the vid-headers and the data area (see specifications aecheader, avidheader, apeb). Its state is an array APEBS of elements of type apeb. The implementation of this I/O layer (see specifications ubi-io-impl, echeader, vidheader, data-conv for the data structures and conversion operations) uses the MTD operations. For reads several retries are attempted. The refinement between the abstract I/O specification and the implementation is proven in specifications abs and refinement. The abstraction relates any bad abstract erase block to a bad concrete erase block, because bad blocks are not read/written to and the contents of the abstract erase block therefore may be arbitrary.

Again, existence of a trace that leaves the state unmodified is shown explicitly, see theorems ex-trace-op in the theorem base of specification ubi-io-ops.

UBI

The second project (see here for a graphical overview) consists of the UBI implementation on top of the abstract I/O layer. The operations are implemented in specification ubi-ops. The UBI state consists of

the mapping VOLS,
wear-leveling array WLARRAY,
erase queue ERASEQ,
the index of the PEB used as the volume table VTBLPNUM
maximum sequence number SQNUM and

in addition to the state APEBS of the abstract I/O layer. The wear-leveling array is used to store a status (free, used, erasing, erroneous) for each PEB and caches the erase counter of a good PEB. Invariants are specified in ubi-base. The crucial invariants are:

unique-sqns(APEBS): sequence numbers stored in PEBs VID-header are unique for each mapping, i.e., there are no two PEBs with the same sequence number for a mapping (v,l)
valid-wl(APEBS, WLARRAY): asserts that every used (i.e. mapped) PEB has the maximum sequence number for its mapping
valid-ebatbl(APEBS, WLARRAY, VOLS, VTBLPNUM): states that a PEB is mapped at (v,l) iff it is marked as used in the wear-leveling array and the VID-header stores the mapping (v,l)

Additional predicates are used to specify that the PEB containing the volume table matches the current in-RAM mapping (valid-vtbl) and constrain the erase queue (three types of valid-eraseq). The predicate ubi-cons(APEBS, VOLS, WLARRAY, ERASEQ, VTBLPNUM, SQNUM) is defined as equivalent to the conjunction of all invariants.

It proved easier to use a multiset of sequence numbers to specify uniqueness and maximality of them uniformly. The function valid-sqns(v, l, APEBS) (see specification sqns is defined recursively over APEBS and returns the multiset of all sequence numbers of valid PEBs. Validity is defined in specification valid-peb

The operation to format an initial device as well as the recovery from an unexpected power failure are defined in specification ubi-recover. The operation format is the initialization operation, which writes headers to each PEB and writes an initial version of a volume table without any preexisting volumes.

The invariant proofs for each normal operations is in the theorem base of specification ubi-ops.

The main correctness theorem crash-recover about the recovery can be found in specification ubi-crash-recovery. It asserts that after a crash and subsequent recovery, a state is reached, which

has a maximal volume
the set of by the erase queue referenced LEBs did not increase
if a PEB n is mapped for the LEB (v, l) and was not mapped beforehand, then the erase queue before the crash contained a PEB with last known mapping (v,l), (Therefore, the LEB could not have been 'erased' in the abstract state before the crash)
satisfies the invariant

Refinement from the EBM Specification to Implementation

The refinement is in the project ubi-refines-ebm.

The abstraction relation is defined in specification abs and the refinement proofs in the specification refinement.

The refinement between the recovery operations is proven in specification crash-recovery. The theorem basically follows directly from the above observations.

Scala Simulation

Download from here.

The Simulation code is in src/de/uniaugsburg/flashix/:

Package fuse contains the glue code
Package common provides some data types
Package vfs implements the VFS layer on top of afs
Package posix defines a POSIX-like interface as a Scala-Trait

Note that the package might be slightly out of sync with the formal models.

The main class is

de.uniaugsburg.flashix.fuse.Mount

Requirements

Scala (tested with version 2.9.2)
Fuse (tested with version 2.9.1)
Fuse-J (tested with version 2.4-prerelease1)

Included as lib/fuse-j.jar.

Native library only included for x86_64 (sorry).

To build the native library for ix86, download the package from http://sourceforge.net/projects/fuse-j/ as well as the fuse development package. Compile it (ant) and copy

./jni/libjavafs.so

to lib (or lib64 on 64 bit machines, the correct one will be picked automatically).

fuse-j-2.4-prerelease1 has a compile bug with Java 7. You can copy the updated build script files in folder "patch" and the patch to the jni subfolder of fuse to work around this issue.

Build the Simulation

./build.sh

Run the Simulation

./run <mountpoint> [-d]

where specifies the directory to mount and -d optionally enables debug output and prevents forking to background

fusermount -u <mountpoint>

removes the mount (and terminates the simulation). Please note that it's not the other way round, i.e., killing the simulation does not unmount.