diff options
Diffstat (limited to 'docs/developer/server/data-models.md')
| -rw-r--r-- | docs/developer/server/data-models.md | 68 |
1 files changed, 68 insertions, 0 deletions
diff --git a/docs/developer/server/data-models.md b/docs/developer/server/data-models.md new file mode 100644 index 0000000..f7d42dd --- /dev/null +++ b/docs/developer/server/data-models.md @@ -0,0 +1,68 @@ +# Data models + +## Synchronized data + +Pilcrow is primarily a synchronization service for conversations, messages, and other social data. That synchronization is accomplished by expressing all changes to social data in terms of events, which are generated in response to user actions, recorded in sequence by Pilcrow, and published back to users. + +The Pilcrow server breaks events up into _histories_, internally. A history is a sequence of events that determines the state of a single entity, which can be considered in isolation from other events occurring within the system to make the majority of decisions. For example, a message's history contains everything you need to know to reconstruct a message at any point in time, and to make decisions about operations like deleting or editing the message, since those decisions do not depend on the overall state of the service. + +No model is perfect, and some operations will inevitably need to consider multiple histories. In general, this is acceptable if it's exceptional; however, if two histories end up consulted together on a regular basis, it may be worth reviewing whether they should be combined into a single history. Equally, histories can be split up, if they have been combined in ways that turn out to be less useful. + +### History storage + +The underlying storage generally reflects the specific structure of the kind of history it deals with, rather than being general. Operatiors and developers should be able to reason about a history directly from the single row, or the small set of rows, needed to reconstruct it. + +For example, a message history is represented by: + +- A row in `message`, which contains the information needed to reconstruct the `message` `send` event, and +- Optionally, a row in `message_deleted`, which can be joined to `message` to provide the information eeded to reconstruct a `message` `deleted` event if the message has actually been deleted. + +The resulting stored data reads very much like a classic, state-based relational model. `select * from message left join message_deleted using (id) where id = 'Mw9j2ttdx1f68k71'` gives message `Mw9j2ttdx1f68k71` in full detail: + +``` + id = Mw9j2ttdx1f68k71 + conversation = Cshp86chf87fypf6 + sender = L3twnpw918cftfkh + sent_sequence = 4839 + sent_at = 2025-06-02T05:49:36.418527558+00:00 + body = Oof those headings are too big + last_sequence = 4839 +deleted_sequence = + deleted_at = +``` + +These rows can be further joined to other rows in the database. Pilcrow's schema takes advantage of this for referential integrity checks, where possible, as well. Similar patterns arise in the implementation of other history types. + +### History operations + +A History type generally exposes a few key operations: + +- It can be loaded from an appropriate repository. This is usually provided by the repository interface, rather than by the history type itself. + +- It can be converted into a sequence of events (generally via an `events()` method), which can then be combined with other histories' events to provide a service-wide view of historical events. + +- It can produce snapshots of the underlying entity (generally via methods like `as_created()`, `as_of(instant)`, `as_snapshot()`, and similar) at various points in time. + +However, there is not, at this time, a common trait or type used for all histories, as no code in Pilcrow tries to abstract over them. We've found that generalizations tend to happen at the level of event sequences or streams, rather than at the level of the histories used to produce those sequences. + +### Soft and hard deletes + +One consequence of using events to synchronize data is that the service _must_ retain enough information to tell clients that a given thing has been deleted, at least temporarily, so that clients can remain in synch. This shows up in histories, and is even illustrated above: a deleted entity must have a non-deleted history, or the service cannot tell clients about deletion after the fact. + +To limit data growth, the server hard-deletes ("purges") histories for deleted entities after a while. After this happens, clients can no longer synchronize old state for that entity. Clients that get into this state must start over from the beginning of the event stream and rebuild their state. + +To manage social responsibility, the server aggressively discards data in live histories for deleted entities, as well. For example, a message's body is blanked when the message is deleted. Clients who synchronize with the server after that point will see a `message` `sent` event, but rather than containing the original body, it will contain a placeholder. In general, users who delete entities expect the server to stop disclosing the contents of those entities, even retroactively, and this compromise accomplishes that. + +### Paths not taken: generic event storage + +One reasonable response to the observation that Pilcrow is a synchronization service built on events is to propose storing events as the basic data primitive. When the server does need to make decisions based on moment-in-time state, that state can be derived from the stored events in the same way a Pilcrow client is expected to derive its own state. That has a pleasant symmetry, and would work. + +However, prior experience suggests that a data model consisting of an undifferentiated list of events is difficult for developers to work with. For clients, we've chosen to adopt this burden, in return for keeping Pilcrow's API orthogonal and reducing the benefit of being the "first-party" client. For the server, however, it is useful for developers to be able to determine at a glance what the state of a specific entity is, without reference to the whole recorded history. + +Prior experience also suggests that the same pressure acts on operators. While it's not built for this purpose, the underlying database _is_ necessarily a user interface, facing operators, who will occasionally opt to intervene in it to try to address operational needs or fix problems. They do so with incomplete understanding of the system, so the more the data model can guide them in the direction of correct changes, the more useful it is and the more successful they will be. + +And, finally, storing an event stream makes checking the data for consistency, both in the moment and after the fact, _very_ difficult. One cannot, for example, express the constraint that a message event is only valid if the conversation it addresses exists, whereas this constraint is utterly mundane in noun-based "current state only" data models. + +## Non-synchronized data + +Some data (as of this writing, invites and authentication tokens) are not synchronized to clients. Instead, Pilcrow uses a more classical data modelling approach of storing only their current state, or nothing, in its database, and operates only on the current state regardless of the passage of time. |
