Routing Protocol

danger

This is a work-in-progress and is under active development. I am deeply sorry to anyone who has to read this in its current state. Please check back soon™️ for a more complete and coherent version.

Each user has a unique ID (derived from their public key ???).

Sending a message to someone's UID first tries internet, then P2P BLE, then Amazon Sidewalk (BLE), then Sidewalk (LoRa), then P2P LoRa. This order can change, especially if Sidewalk can be detected as an option easily/with minimal power impact, in which case it may be beneficial to be before its P2P counterpart.

If internet connection is established (MQTT) we send the message to the user's UID topic (QoS 1).

If no internet, we send a broadcast of the message and the recipient's UID over the next medium in the list.

Wait x amount of time (where x >= the round trip time for the subsequent medium(s) (including the round trip time for the MQTT QoS 1 packet over the internet medium)) for a response before iterating mediums again.

If another device (we'll call it our "relay") is connected to the internet and receives the broadcast message, it (the relay) uploads it via MQTT (QoS 1) to the recipient's UID topic. Once this message is uploaded and confirmed, the relay broadcasts a message (over the same medium it received the message to be uploaded on) that the sender ID's message ID was uploaded. This is a QoS 0 message, as it is the ACK for the QoS 1 message from the sender. The relay should also store (for __ amount of time) the message ID and sender ID so that it can re-ACK with QoS 0 if the sender doesn't receive the ACK and implements a retry as per QoS 1. If the relay during this process overhears a broadcast of this same message ID and sender ID, it should drop the message and not do anything. To prevent a malicious relay from sending a fake ACK and preventing transmission of the message, when messages are relayed/delivered to the MQTT broker, the broker should sign/complete a nonce challenge that only the broker could complete so that the relay can't fake the ACK. This idea of ignoring messages that have an ACK should only be done if the relay can confirm the ACK it hears was actually sent by the broker.

If the target device is within the medium's range, it should receive the message (and perform all normal decryption and CRC/FEC checks and assert they pass) and send a recipient signed ACK (QoS 0) back to the sender. This is effectively a full delivery receipt.

If a non-internet connected relay receives the broadcast message, it should increment the hop count (for this medium) and rebroadcast the message, attaching (or more probably replacing the previous hop ID, if applicable, with its own storing the previous hop ID for some time) its ID on as a chain. It should only do this, however, if the hop/medium count meets the criteria below, AND if it hasn't already rebroadcast the message (based on the message ID). This means message IDs should be stored for some amount of time to prevent loops. Additionally, the RSSI of the incoming message should be considered, and if it is not below some threshold (potentially a moving threshold based on the device type, antenna gain, battery size, remaining charge, etc.) than it is not considered a "significant" enough improvement in RSSI to relay the message, so it ignores it. While a malicious relay could rebroadcast the message with a higher hop count than it should have, we have a max hop limit for each medium at a protocol definition/implementation level that valid devices won't exceed, which mitigates impact from the attack. Additionally, it is assumed that since the attack would have limited impact, it is not worth the effort to prevent it at this time, nor is it expected to happen frequently. This could be revisited if it becomes a problem. If the hop count is greater than some threshold for that specific medium, it should look to the next medium. This continues either until the message is delivered or the hop count is too high for all mediums. Once the hop count is too high for a medium, it should descend the medium list again with fresh hop counts for each medium. It gives up once it has tried all mediums with a hop count greater than the threshold and the message has ascended and descended the medium list without delivery.x

If a device receives a message that it has already received (based on the message ID), it should drop the message and not do anything.

If an internet connected device (or the intended recipient) eventually receives the message, it should upload it to the recipient's UID topic (QoS 1) and then broadcast a message that the sender ID's message ID was uploaded. If it was done through another prior relay, this relay's ID is also a secondary target on this QoS 0 ACK. Then, the message can back propagate through the network to the sender in the same way it was delivered, but in reverse, with each device sending a QoS 0 ACK to the previous device in the chain. This is effectively a full delivery receipt that also mirrors the same medium path. It is assumed here that the path the message got to its recipient and/or the internet in was "ideal" purely because it had the lowest latency, so we would like to use this path again if possible. Alternatively, the route chain may have broken, so the sender ID is still the primary target, and a new route chain could form to deliver the ACK (without adding the relay ID to the ACK, as it won't need to back propagate again). This is a new route formation in this case, but should mean that one mechanism or the other should get the ACK delivered and the route to do it could be the back propagation or a new route formation, whichever is more efficient (in terms of latency) (although both may end up delivering it). ACKs should also have (some algorithmically determined) less number of max hops than the original message, as we don't want them to have as much of an impact on the network as the original message, and if the sender doesn't receive the ACK and retries the message, some device on the incomplete ACK path will receive the message again and send the ACK back without the message ID (which was already seen) from needing to be sent all the way again.

Additionally, any internet connected device that hears any sender or relay IDs (basically, any IDs it hears that it is able to deduce are not internet connected themselves) should subscribe to those non-internet connected devices' UID topics for a period of time, perhaps similar to the broadcast/advertising period for that medium, resubscribing (in implementation this may be setting a custom subscription timeout or unsubscribing after a certain amount of time) if it hears the sender or relay ID again. This is so that it can act as a route for any messages intended for those devices. This is effectively a route discovery mechanism. The number of IDs it can subscribe to at max should be a customizable number that could be varied based on factors like battery size, remaining charge, internet connection quality/bandwidth, etc. If the number of IDs it can subscribe to is reached, it should drop the least recently relayed ID and subscribe to the new one. Note that this is not the least recently heard ID advertising/beaconing, but the least recently used. The assumption here is that the ID's recency in advertising its need for a route is less important than if the ID actually will make use of said route (i.e. if it has messages to send or receive) with the premise being if a message was sent or received recently, it is likely another will be soon too, whereas a longer period of silence suggests more silence to come. The one exception to this is if the ID is advertising its need for a route for the first time, in which case it should be subscribed to immediately. This is because the device is likely just coming online and will likely have messages to send or receive soon. The broker should also keep track of the last time it had a subscription to a device's UID topic, and internet connected devices can ignore devices with > some number of subscriptions. This does mean that, again, device IDs should be stored for some amount of time to prevent loops of subscribing and unsubscribing to the same device repeatedly thinking it is a new device.

Non-internet connected devices should also broadcast their own ID periodically (if they haven't already for some other reason) so that internet connected devices can subscribe to their UID topics and act as a route for messages intended for them. The time between broadcasts should be based on the medium's range and become less frequent as the range increases, as it is assumed devices will be able to hear the broadcasts/stay in this range for longer periods of time. Additionally, the periodicity of the broadcasts should be exponentially backed off when no internet connected devices are heard acknowledging the broadcasts. This is so that the device doesn't waste power broadcasting if there are no internet connected devices around to hear it. In this case, all messages would have to be delivered through P2P anyway, which wouldn't require periodic message checking. The exponential back off should be based on the number of broadcasts that have gone unacknowledged, but can be reset if an internet connected device acknowledges the broadcast or if the user forces a refresh of the device.

Internet connected devices should listen for these periodic broadcasts and subscribe to the UID topics of the devices that broadcasted them. If there are non-internet connected advertisers within the medium's range of this internet-connected device that are advertising their ID, the internet connected device should acknowledge the broadcast with their own ID so that the non-internet connected device doesn't need to switch to the next medium to try to deliver the request for route message.

Non-internet connected devices should also hold onto non-internet connected device IDs they hear for some amount of time so that if they become internet connected, they can subscribe to those devices' UID topics and act as a route for messages intended for them. Additionally, if they become aware of a device that is internet connected, they should communicate the IDs they have heard to that device so that it can act as a route for messages intended for those devices. This could be a customizable number of IDs to store that were heard within the last __ amount of time (at least the minimum broadcast yourself frequency) or even a custom number of hops away from the device (probably not a good idea).

Internet Routing

The original protocol above specifies the use of MQTT to connect to centralized servers, but some modifications to our network may require a more decentralized approach. In this case, we would use a DHT (Distributed Hash Table) to store the routing information for each device. This would allow for a more decentralized approach to routing, as each device would be responsible for storing the routing information for a subset of devices. This would also allow for more efficient routing, as devices could query the DHT for the routing information for a specific device, rather than having to broadcast the message to all devices on the network.

Alternatively, if we stick with the centralized approach, we could connect to Matrix servers for cross-compatibility with other chat services. We also could then add security to our protocol that MQTT may not have, like hash validation before publishing messages to the broker, bloom filters for message IDs, instead of full lists, etc. If doing MQTT, let's use regional brokers - i.e. put the region in the topic, and have a broker for each region. This would help with scaling and reduce latency.

Ideas:

May be worth adding a timestamp to messages to prevent replay attacks/old or out of date hopped messages from causing congestion.
Stabilization period between online and offline states.
Potential changes to all configuration parameters for "priority" emergency messages.
Jamming signal on collisions
Need CSMA/CA or CSMA/CD for LoRa and BLE channels (with some randomization after detecting dead air to transmit on)
Randomized back off for LoRa and BLE channels
Time slots for LoRa and BLE channels (requires synchronization...?)
Change from purely exponential back off to a more adaptive approach (see this)
Clarify that only sender and destination are allowed retries, not relays
Figure out if storing IDs can be done in a more efficient way (i.e. bloom filter) (can a bloom filter roll out old IDs?)
Tiered MQTT brokers (local, regional, global) to help with scaling and reduce latency
Look for potential areas in protocol that could lead to closed loops
Clarify that delivery order is not guaranteed – we attempt to guarantee delivery, but not that the messages will be delivered in the order they were sent or that it will be done in a timely manner
Tie ACKs to message IDs in some way to prevent flooding of ACKs in reply to repeated messages, as then devices with ACK caches ready to repeat the ACK instead of relaying the message again can hear that someone else sent the cached ACK and not bother doing it themselves. Potentially also consider ACK expiration times. Also randomize ACK times to prevent flooding of ACKs.
SaaS for Matrix client server to allow messaging through bridges like Beeper over our protocol
Eliminating packet sizes and properly handling fragmentation (Maybe just use TCP?)
Allowing internet connectivity generically (not just routing)
- Could enable auto firmware updates
- Mobile apps could act as a VPN for the device to get more apps working when offline

Internet Routing​

Internet Routing