| Commit message (Collapse) | Author | Age |
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The semantics of this endpoint are somewhat complex, and are incompletely captured in the associated docs change. For posterity, the intended workflow is:
1. Obtain Pilcrow's current VAPID key by connecting (it's in the events, either from boot or from the event stream).
2. Use the browser push APIs to create a push subscription, using that VAPID key.
3. Send Pilcrow the push subscription endpoint and keys, plus the VAPID key the client used to create it so that the server can detect race conditions with key rotation.
4. Wait for messages to arrive.
This commit does not introduce any actual messages, just subscription management endpoints.
When the server's VAPID key is rotated, all existing subscriptions are discarded. Without the VAPID key, the server cannot service those subscriptions. We can't exactly notify the broker to stop processing messages on those subscriptions, so this is an incomplete solution to what to do if the key is being rotated due to a compromise, but it's better than nothing.
The shape of the API endpoint is heavily informed by the [JSON payload][web-push-json] provided by browser Web Push implementations, to ease client development in a browser-based context. The idea is that a client can take that JSON and send it to the server verbatim, without needing to transform it in any way, to submit the subscription to the server for use.
[web-push-json]: https://developer.mozilla.org/en-US/docs/Web/API/PushSubscription/toJSON
Push subscriptions are operationally associated with a specific _user agent_, and have no inherent relationship with a Pilcrow login or token (session). Taken as-is, a subscription created by user A could be reused by user B if they share a user agent, even if user A logs out before user B logs in. Pilcrow therefore _logically_ associates push subscriptions with specific tokens, and abandons those subscriptions when the token is invalidated by
* logging out,
* expiry, or
* changing passwords.
(There are no other token invalidation workflows at this time.)
Stored subscriptions are also abandoned when the server's VAPID key changes.
|
| |
|
|
|
|
| |
The `web-push` crate's VAPID signing support requires a private key. The `p256` crate is more than capable of generating one, but the easiest way to get a key from a `p256::ecdsa::SigningKey` to a `web_push::PartialVapidSignature` is via PKCS #8 PEM, not via the bytes. Since we'll need it in that form anyways, store it that way, so that we don't have to decode it using `p256`, re-encode to PEM, then decode to `PartialVapidSignature`.
The migration in this commit invalidates existing VAPID keys. We could include support for re-encoding them on read, but there's little point: this code is still in flux anyways, and only development deployments exist. By the time this is final, the schema will have settled.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
VAPID is used to authenticate applications to push brokers, as part of the [Web Push] specification. It's notionally optional, but we believe [Apple requires it][apple], and in any case making it impossible to use subscription URLs without the corresponding private key available, and thus harder to impersonate the server, seems like a good security practice regardless.
[Web Push]: https://developer.mozilla.org/en-US/docs/Web/API/Push_API
[apple]: https://developer.apple.com/documentation/usernotifications/sending-web-push-notifications-in-web-apps-and-browsers
There are several implementations of VAPID for Rust:
* [web_push](https://docs.rs/web-push/latest/web_push/) includes an implementation of VAPID but requires callers to provision their own keys. We will likely use this crate for Web Push fulfilment, but we cannot use it for key generation.
* [vapid](https://docs.rs/vapid/latest/vapid/) includes an implementation of VAPID key generation. It delegates to `openssl` to handle cryptographic operations.
* [p256](https://docs.rs/p256/latest/p256/) implements NIST P-256 in Rust. It's maintained by the RustCrypto team, though as of this writing it is largely written by a single contributor. It isn't specifically designed for use with VAPID.
I opted to use p256 for this, as I believe the RustCrypto team are the most likely to produce a correct and secure implementation, and because openssl has consistently left a bad taste in my mouth for years. Because it's a general implementation of the algorithm, I expect that it will require more work for us to adapt it for use with VAPID specifically; I'm willing to chance it and we can swap it out for the vapid crate if it sucks.
This has left me with one area of uncertainty: I'm not actually sure I'm using the right parts of p256. The choice of `ecdsa::SigningKey` over `p256::SecretKey` is based on [the MDN docs] using phrases like "This value is part of a signing key pair generated by your application server, and usable with elliptic curve digital signature (ECDSA), over the P-256 curve." and on [RFC 8292]'s "The 'k' parameter includes an ECDSA public key in uncompressed form that is encoded using base64url encoding. However, we won't be able to test my implementation until we implement some other key parts of Web Push, which are out of scope of this commit.
[the MDN docs]: https://developer.mozilla.org/en-US/docs/Web/API/PushSubscription/options
[RFC 8292]: https://datatracker.ietf.org/doc/html/rfc8292#section-3.2
Following the design used for storing logins and users, VAPID keys are split into a non-synchronized part (consisting of the private key), whose exposure would allow others to impersonate the Pilcrow server, and a synchronized part (consisting of event coordinates and, notionally, the public key), which is non-sensitive and can be safely shared with any user. However, the public key is derived from the stored private key, rather than being stored directly, to minimize redundancy in the stored data.
Following the design used for expiring stale entities, the app checks for and creates, or rotates, its VAPID key using middleware that runs before most API requests. If, at that point, the key is either absent, or more than 30 days old, it is replaced. This imposes a small tax on API request latency, which is used to fund prompt and automatic key rotation without the need for an operator-facing key management interface.
VAPID keys are delivered to clients via the event stream, as laid out in `docs/api/events.md`. There are a few reasons for this, but the big one is that changing the VAPID key would immediately invalidate push subscriptions: we throw away the private key, so we wouldn't be able to publish to them any longer. Clients must replace their push subscriptions in order to resume delivery, and doing so promptly when notified that the key has changed will minimize the gap.
This design is intended to allow for manual key rotation. The key can be rotated "immedately" by emptying the `vapid_key` and `vapid_signing_key` tables (which destroys the rotated kye); the server will generate a new one before it is needed, and will notify clients that the key has been invalidated.
This change includes client support for tracking the current VAPID key. The client doesn't _use_ this information anywhere, yet, but it has it.
|
| |
|
|
| |
We'll be building separate entities around this in future commits, to better separate the authentication data (non-synchronized and indeed "not public") from the chat data (synchronized and public).
|
| |
|
|
|
|
| |
I've - somewhat arbitrarily - started renaming column aliases, as well, though the corresponding Rust data model, API fields and nouns, and client code still references them as "channel" (or as derived terms).
As with so many schema changes, this entails a complete rebuild of a substantial portion of the schema. sqlite3 still doesn't have very many `alter table` primitives, for renaming columns in particular.
|
| | |
|
| | |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is meant to limit the amount of messages that event replay needs to examine. Previously, the query required a table scan; table scans on `message` can be quite large, and should really be avoided. The new schema allows replays to be carried out using an index scan.
The premise of this change is that, for each (channel, message), there is a range of event sequence numbers that the (channel, message) may appear in. We'll notate that as `[start, end]` in the general case, but they are:
* for active channels, `[ch.created_sequence, ch.created_sequence]`.
* for deleted channels, `[ch.created_sequence, ch_del.deleted_sequence]`.
* for active messages, `[mg.sent_sequence, mg.sent_sequence]`.
* for deleted messages, `[mg.sent_seqeunce, mg_del.deleted_sequence]`.
(The two "active" ranges may grow in future releases, to support things like channel renames and message editing. That won't change the logic, but those two things will need to update the new `last_sequence` field.)
There are two families of operations that need to retrieve based on these ranges:
* Boot creates a snapshot as of a specific `resume_at` sequence number, and thus must include any record whose `start` falls on or before `resume_at`. We can't exclude records whose `end` is also before it, as their terminal state may be one that is included in boot (eg. active channels).
* Event replay needs to include any events that fall after the same `resume_at`, and thus must include events from any record whose `end` falls strictly after `resume_at`. We can't exclude records whose `start` is also strictly after `resume_at`, as we'd omit them from replay, inappropriately, if we did.
This gives three interesting cases:
1. Record fully after `resume_at`:
event sequence --increasing-->
x-a … x … x+k …
resume_at start end
This record should be omitted by boot, but included for event replay.
2. Record fully before `resume_at`:
event sequence --increasing-->
x … x+k … x+a
start end resume_at
This record should be omitted for event replay, but included for boot.
3. Record overlapping `resume_at`:
event sequence --increasing-->
x … x+a … x+k
start resume_at end
This record needs to be included for both boot and event replay.
However, the bounds of that range were previously stored in two tables (`channel` and `channel_deleted`, or `message` and `message_deleted`, respectively), which sqlite (indeed most SQL implementations) cannot index. This forced a table scan, leading to the program considering every possible (channel, message) during event replay.
This commit adds a `last_sequence` field to channels and messages, which is set to the above values as channels and messages are operated on. This field is indexed, and queries can use it to rapidly identify relevant rows for event replay, cutting down the amount of reading needed to generate events on resume.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
I've exempted inserts (they never scan in the first place), queries on `event_sequence` (at most one row), and the coalesce()s used for event replay (for now; these are obviously a performance risk area and need addressing).
Method:
```
find .sqlx -name 'query-*.json' -exec jq -r '"explain query plan " + .query + ";"' {} + > explain.sql
```
Then go query by query through the resulting file.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Canonicalization does two things:
* It prevents duplicate names that differ only by case or only by normalization/encoding sequence; and
* It makes certain name-based comparisons "case-insensitive" (generalizing via Unicode's case-folding rules).
This change is complicated, as it means that every name now needs to be stored in two forms. Unfortunately, this is _very likely_ a breaking schema change. The migrations in this commit perform a best-effort attempt to canonicalize existing channel or login names, but it's likely any existing channels or logins with non-ASCII characters will not be canonicalize correctly. Since clients look at all channel names and all login names on boot, and since the code in this commit verifies canonicalization when reading from the database, this will effectively make the server un-usuable until any incorrectly-canonicalized values are either manually canonicalized, or removed
It might be possible to do better with [the `icu` sqlite3 extension][icu], but (a) I'm not convinced of that and (b) this commit is already huge; adding database extension support would make it far larger.
[icu]: https://sqlite.org/src/dir/ext/icu
For some references on why it's worth storing usernames this way, see <https://www.b-list.org/weblog/2018/nov/26/case/> and the refernced talk, as well as <https://www.b-list.org/weblog/2018/feb/11/usernames/>. Bennett's treatment of this issue is, to my eye, much more readable than the referenced Unicode technical reports, and I'm inclined to trust his opinion given that he maintains a widely-used, internet-facing user registration library for Django.
|
| |
|
|
|
|
|
| |
This accomplishes two things:
* It removes the need for an additional `channel_name_reservation` table, since `channel.name` now only contains non-null values for active channels, and
* It nicely dovetails with the idea that `null` means an unknown value in SQL-land.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Previously, when a channel (message) was deleted, `hi` would send events to all _connected_ clients to inform them of the deletion, then delete all memory of the channel (message). Any disconnected client, on reconnecting, would not receive the deletion event, and would de-synch with the service. The creation events were also immediately retconned out of the event stream, as well.
With this change, `hi` keeps a record of deleted channels (messages). When replaying events, these records are used to replay the deletion event. After 7 days, the retained data is deleted, both to keep storage under control and to conform to users' expectations that deleted means gone.
To match users' likely intuitions about what deletion does, deleting a channel (message) _does_ immediately delete some of its associated data. Channels' names are blanked, and messages' bodies are also blanked. When the event stream is replayed, the original channel.created (message.sent) event is "tombstoned", with an additional `deleted_at` field to inform clients. The included client does not use this field, at least yet.
The migration is, once again, screamingingly complicated due to sqlite's limited ALTER TABLE … ALTER COLUMN support.
This change also contains capabilities that would allow the API to return 410 Gone for deleted channels or messages, instead of 404. I did experiment with this, but it's tricky to do pervasively, especially since most app-level interfaces return an `Option<Channel>` or `Option<Message>`. Redesigning these to return either `Ok(Channel)` (`Ok(Message)`) or `Err(Error::NotFound)` or `Err(Error::Deleted)` is more work than I wanted to take on for this change, and the utility of 410 Gone responses is not obvious to me. We have other, more pressing API design warts to address.
|
| | |
|
| |
|
|
|
|
| |
The original version of this migration happened to work correctly, by accident, for databases with exactly one login. I missed this, and so did Kit, because both of our test databases _actually do_ contain exactly one login, and because I didn't run the tests before committing the migration.
The fixed version works correctly for all scenarios I tested (zero, one, and two users, not super thorough). I've added code to patch out the original migration hash in databases that have it; no further corrective work is needed, as if the migration failed, then it got backed out anyways, and if it succeeded, you fell into the "one user" case.
|
| | |
|
| |
|
|
|
|
| |
The migration path from the original project inception to now was complicated and buggy, and stranded _both_ Kit and I with broken databases due to oversights and incomplete migrations. We've agreed to start fresh, once.
If this is mistakenly started with an original-schema-flavour DB, startup will be aborted.
|
| |
|
|
| |
Per-channel event sequences were a cute idea, but it made reasoning about event resumption much, much harder (case in point: recovering the order of events in a partially-ordered collection is quadratic, since it's basically graph sort). The minor overhead of a global sequence number is likely tolerable, and this simplifies both the API and the internals.
|
| |
|
|
|
|
|
|
| |
expires.
When tokens are revoked (logout or expiry), the server now publishes an internal event via the new `logins` event broadcaster. These events are used to guard the `/api/events` stream. When a token revocation event arrives for the token used to subscribe to the stream, the stream is cut short, disconnecting the client.
In service of this, tokens now have IDs, which are non-confidential values that can be used to discuss tokens without their secrets being passed around unnecessarily. These IDs are not (at this time) exposed to clients, but they could be.
|
| | |
|
| | |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
streams.
The timestamp-based approach had some formal problems. In particular, it assumed that time always went forwards, which isn't necessarily the case:
* Alice calls `/api/channels/Cfoo` to send a message.
* The server assigns time T to the request.
* The server stalls somewhere in send() for a while, before storing and broadcasting the message. If it helps, imagine blocking on `tx.begin().await?` for a while.
* In this interval, Bob calls `/api/events?channel=Cfoo`, receives historical messages up to time U (after T), and disconnects.
* The server resumes Alice's request and finishes it.
* Bob reconnects, setting his Last-Event-Id header to timestamp U.
In this scenario, Bob never sees Alice's message unless he starts over. It wasn't in the original stream, since it wasn't broadcast while Bob was subscribed, and it's not in the new stream, since Bob's resume point is after the timestamp on Alice's message.
The new approach avoids this. Each message is assigned a _sequence number_ when it's stored. Bob can be sure that his stream included every event, since the resume point is identified by sequence number even if the server processes them out of chronological order:
* Alice calls `/api/channels/Cfoo` to send a message.
* The server assigns time T to the request.
* The server stalls somewhere in send() for a while, before storing and broadcasting.
* In this interval, Bob calls `/api/events?channel=Cfoo`, receives historical messages up to sequence Cfoo=N, and disconnects.
* The server resumes Alice's request, assigns her message sequence M (after N), and finishes it.
* Bob resumes his subscription at Cfoo=N.
* Bob receives Alice's message at Cfoo=M.
There's a natural mutual exclusion on sequence numbers, enforced by sqlite, which ensures that no two messages have the same sequence number. Since sqlite promises that transactions are serializable by default (and enforces this with a whole-DB write lock), we can be confident that sequence numbers are monotonic, as well.
This scenario is, to put it mildly, contrived and unlikely - which is what motivated me to fix it. These kinds of bugs are fiendishly hard to identify, let alone reproduce or understand.
I wonder how costly cloning a map is going to turn out to be…
A note on database migrations: sqlite3 really, truly has no `alter table … alter column` statement. The only way to modify an existing column is to add the column to a new table. If `alter column` existed, I would create the new `sequence` column in `message` in a much less roundabout way. Fortunately, these migrations assume that they are being run _offline_, so operations like "replace the whole table" are reasonable.
|
| | |
|
| |
|
|
|
|
|
|
| |
issued.
This lets us shorten the expiry interval - by quite a bit. Tokens in regular use will now live indefinitely, while tokens that go unused for _one week_ will be invalidated and deleted. This will reduce the number of "dead" tokens (still valid, but _de facto_ no longer in use) stored in the table, and limit the exposure period if a token is leaked and then not used immediately.
It's also much less likely to produce surprise logouts three months after installation. You'll either stay logged in, or have to log in again much, much sooner, making it feel a lot more regular and less surprising.
|
| |
|
|
|
|
| |
This came out of a conversation with Kit. Their position, loosely, was that seeing scrollback when you look at a channel is useful, and since message delivery isn't meaningfully tied to membership (or at least doesn't have to be), what the hell is membership even doing? (I may have added that last part.)
My take, on top of that, is that membership increases the amount of concepts we're committed to. We don't need that commitment yet.
|
| | |
|
| | |
|
| |
|
|
|
|
|
|
|
|
| |
This is a beefy change, as it adds a TON of smaller pieces needed to make this all function:
* A database migration.
* A ton of new crates for things like password validation, timekeeping, and HTML generation.
* A first cut at a module structure for routes, templates, repositories.
* A family of ID types, for identifying various kinds of domain thing.
* AppError, which _doesn't_ implement Error but can be sent to clients.
|
| |
|