It was drawn to our attention this afternoon that there is a bug in our GDPR data portability tooling that resulted in the data dump including some events that should not have been included.
This tooling has recently been updated (here is the new code), and the bug only affects reports generated with the updated tool. So far we have generated one report using the updated tooling.
The bug affects events which:
As a reminder, 'shared' message visibility means anyone in the room can view the message, from the point in time at which visibility was set to 'shared' and 'world readable' means anyone can read the messages without joining the room, from the point in time at which visibility was set to 'world readable'.
Events are pulled onto a homeserver over federation when a user on that homeserver tries to access events which, for whatever reason, their homeserver does not already have a local copy. This most often happens when their homeserver is offline for any period of time, but can also happen when a user is the first user from their homeserver to join a room with active participants on other homeservers.
We're still analysing the data but so far it looks like the bug resulted in only a small number of events that were not publicly-accessible being shared (there were also publicly-accessible events mistakenly included). At this stage we have identified 19 events from 4 users across 2 rooms (the dump contained ~3.5 million events). This is not to diminish the severity of the bug - just to reassure that the scale of its impact appears to be extremely limited.
It is also worth noting that any encrypted events erroneously included in the dump will not have been decryptable (since the data subject would not have had access to the keys).
In our original analysis we stated that 19 events were shared erroneously. On closer analysis we missed 5 other timeline events - the correct figure is 24 timeline events originating from 4 users over 2 rooms. However, this figure focused on timeline data and does not take into account all state events (such as user joins, parts, topic changes etc). When considering these too, a further 56 state events were erroneously shared, referencing 64 users across these 2 rooms (mainly detailing when users had joined/left the room after the requesting user themselves had left). These membership events contained avatar & display name details which may not have been public (but in practice, the vast majority appear to be public data).
Aside from the events referenced above, the full dump contained ~20,000 events that also ought not to have been included; however these events were already publicly accessible due to being part of publicly accessible rooms (eg Matrix HQ) and so we do not consider them a breach of data.
Events that are pulled in over federation are assigned a negative 'stream ordering' ID. This is designed to avoid their being sent down the sync (where they would likely be out of sequence). In normal operation (accessing your homeserver via a Matrix client) these events would be appropriately filtered, but a bug in the data dump tooling caused them to be included.
The bug was introduced as a result of two factors:
We are working to fix the bug, and we'll update here when it is resolved. As a reminder, please do report security bugs responsibly as per the Security Disclosure Policy so we can validate the issue and mitigate abuse.
As is standard practice for any data breach, we have notified the ICO.
As a step towards implementing Terms of Service for Sydent Identity Servers (MSC2140), we're rolling out a couple of changes to the two Identity Servers run by New Vector (running at vector.im and matrix.org):
The impact of these changes is that users on homeservers not run by New Vector will no longer be discoverable by their email or telephone number via the Identity Servers running at vector.im and matrix.org. As we roll out the rest of the changes for Terms of Service for Identity Servers, this functionality will again be made available for users who make an informed choice to opt in.
In the short term, the New Vector Identity Servers will continue to support registration with email (signing up with an email address as well as a matrix username) and password reset. However, as we continue to improve Identity Server data hygiene practices, we will phase out their use in registration with email and password reset entirely. We have already made the change to Synapse to support password reset without relying on an Identity Server (though this can optionally be re-enabled).
Once Synapse can support registration with email without relying on an Identity Server we will announce a schedule for disabling registration with email and password reset in our Identity Servers entirely. After this point, homeserver administrators will have to make sure their homeservers are configured to send email to keep registration with email and password reset working. More details on this to follow - please watch this space.