When working on Tact, I sometimes encounter unexplored and underdocumented corners of CloudKit which push the system a bit. I need to do experiments on it to understand the system’s design and behavior, and design my client to work within those constraints.

This is one of those posts, exploring how querying based on CKReference works, and what are the limits and behaviors of that.

Let’s start from the basic schema, and work our way up from that.

The Tact schema

Tact has the following schema for its chats, messages, and reactions.

Tact schema

It’s as basic as you’d expect: a very simple tree structure. Chat is the top-level object that provides all the context and permissions. A chat can have zero or many messages, and each message belongs in exactly one chat. A message can have zero or many reactions, and each reaction belongs to exactly one message.

In CloudKit terms, these relationships are expressed with the parent property of CKRecord. Message records have their parent set to the chat record, and reaction records have their parent set to the message record. Parent relationships establish the permissions, among other things: in Tact, everyone who can see a chat, can also see all the messages and reactions in that chat.

Here is one important bit: you can’t do queries by parent property. I couldn’t immediately find the authoritative source for this in the SDK documentation, but the recommendation in Apple forums and elsewhere has always been to add your own CKReference fields if you want to query by parent record. So the pattern I use in Tact is that all child records have a relatedRecords field which is a set of CKReferences. A message’s relatedRecords contains reference to the chat, and a reaction’s relatedRecords contains refererences to the message and chat. So I can do queries like “give me all messages for this chat”, “give me all reactions for this message”, and so on.

Designing “Load more” queries

I recently updated the “Load more” functionality in Tact, and encountered some unexpected behaviors that this post documents.

Tact can host chats that literally continue for years and have lots of messages, reactions, pictures, videos, and other content. If you install Tact on a new device, it only loads recent content for each chat by default, but you can “Load more” yourself if you want to go back in time. The user experience of “Load more” may vary, and will improve in future Tact versions, but from the system point of view, it’s always the same: you want to load some more messages and reactions in a given chat.

Here’s how I set up the query to load messages.

let chatRecordId = task.model.chat.cloudKitRecordID
let referenceToMatch = CKRecord.Reference(recordID: chatRecordId, action: .none)
let date = task.model.earliestMessage?.date ?? Date()
let predicate = NSPredicate(format: "creationDate < %@ AND relatedRecords CONTAINS %@", date as NSDate, referenceToMatch)
let query = CKQuery(tactRecordType: .message, predicate: predicate)
query.sortDescriptors = [NSSortDescriptor(key: "creationDate", ascending: false)]
let result = await databaseAPI.queryRecords(
  with: query,
  in: task.model.chat.cloudKitRecordZoneID,
  resultsLimit: 100,
  qualityOfService: .userInitiated
)

In a human language this is: query for messages for a given chat whose creation date is earlier than the date of the earliest known message in the chat (or current date if there are no earlier messages), order them by creation date in descending order, and retrieve 100 of those messages. (Why 100? I’ll get back to that.)

(Side note: all this assumes that you have relevant indexes set up on the CloudKit side, which I will not cover in this post.)

Now, how to query reactions? You might think of doing the same and querying them based on the creation date and chat. The schema affords that. But consider that reactions might get created at a different time than the messages they refer to. When you just query for reactions based on time, you may get reactions which refer to messages that you don’t have locally, and you may miss reactions for messages that you do have, if the reaction creation time falls outside of the time window.

So, for consistency, it makes sense to me that I first get the messages, and then get the reactions for exactly those messages. Here is how to set up such a query, if I have previously obtained a bunch of message records:

let messageIdReferences = messageRecords.map { CKRecord.Reference(recordID: $0.recordID, action: .none) }
let predicate = NSPredicate(format: "ANY %@ in relatedRecords", messageIdReferences)
let query = CKQuery(tactRecordType: .reaction, predicate: predicate)
let result = await databaseAPI.queryRecords(
  with: query,
  in: recordZoneID,
  qualityOfService: .userInitiated
)

The key bit is the predicate for the reference set. I got the idea for setting up this way from this StackOverflow post. The official CloudKit documentation does have examples for querying based on references, but not reference sets and more complex scenarios.

Now, if you just attempt to run all this, what do you think will happen? Will it all work out of the box?

Here’s where the interesting part of this post begins. 😀

How reference-based queries behave with various inputs

Here’s what will happen if you try to run the above queries with various inputs.

First, let’s say that you don’t want to limit the number of message ID inputs to your reactions query. You just say, let’s YOLO, and retrieve all messages, and all reactions for all of those messages right away. The message side of this works fine: if you have many messages, CloudKit will return a page of messages, and a cursor to retrieve the remaining ones, and you rinse and repeat this until there is no more cursor. And it’s all reasonably fast. You can retrieve thousands of messages distributed across several pages in just seconds.

Now, if I attempt to grab all those thousands of messages and run the above reactions query with them, here’s what I get:

Error querying chat reactions: CanopyTypes.CKRecordError(code: 27, localizedDescription: "Query filter exceeds the limit of values: 250 for container \'iCloud.com.justtact.Tact", retryAfterSeconds: 0.0, errorDump: "<CKError 0x600002ba6b50: \"Limit Exceeded\" (27/2023); server message = \"Query filter exceeds the limit of values: 250 for container \'iCloud.com.justtact.Tact\"; op = DAD629CFCE8A1C85; uuid = 60D6BC5A-8ED7-489D-9C75-08A8433E3C1C; container ID = \"iCloud.com.justtact.Tact\">", batchErrors: [:])

I have wrapped the CloudKit error in some types, but you see the CloudKit error right there. CKError code 27 is indeed CKErrorLimitExceeded.

Fair enough. This tells me that the filtering condition in the clause ANY %@ in relatedRecords can’t contain more than 250 objects. So, let’s limit it, leave some headroom, and run it with 200 records instead.

If I run the reactions query with 200 message records in the filtering condition, here’s what I get:

Error querying chat reactions: CanopyTypes.CKRecordError(code: 15, localizedDescription: "Request failed with http status code 500", retryAfterSeconds: 0.0, errorDump: "<CKError 0x6000032f6bb0: \"Server Rejected Request\" (15/2001); \"Request failed with http status code 500\"; uuid = 1A749162-6811-45E4-87E7-2A61F12A50D2>", batchErrors: [:])

Even though the input contains only 200 objects, I get this error. I speculate that even though the input amount is below the nominal stated limit, some relationship query to construct the results causes some internal limit overflow on the CloudKit side, and it doesn’t handle this situation well.

Over several days, I got different CKErrors in the same situation when doing this same query. Sometimes I got code 6, serviceUnavailable, and interestingly, it also had rate limiting set (retryAfterSeconds was >0). When I tried to YOLO and did another query without honoring the rate limit, I then expectedly got CKError code 7, requestRateLimited. Sometimes I also got CKError code 12, invalidArguments.

The way to recover from all of these is to split the input set into two, and re-run the query with those two smaller batches (respecting the retryAfterSeconds parameter), and merge the results. I built some code to do that, and it appears to work fine. With Canopy, it’s easy to build deterministic fast tests for it to assert it behaves correctly in all cases, but I didn’t move this code (to re-run the CKReference queries with smaller batches if the initial batch is too big and returns an error) itself to Canopy yet.

I mentioned that the optimal input set to the filter is 100 message records because when I was testing one day with batches of 200, I always got serviceUnavailable with rate limiting set to something like 20 seconds. Re-running the query immediately is fine, but running it with a large rate limit is bad user experience, when the user has to wait for results for a long time. But when I ran it with 100 records, everything worked as expected, and I never got the rate limit, and that’s why I am keeping it at 100 right now. It’s possible that CloudKit varies its server-side behavior and the errors it returns based on its usage load and possibly other parameters that I don’t know and can’t control.

Conclusion

You can do advanced CloudKit queries based on CKReference set fields that tie multiple levels of records together, but be aware of the system limitations and quirks. Often, the error is recoverable by modifying your input parameters into smaller batches, and re-running your query with multiple smaller batches. In some cases, you may get rate limited, and choosing a smaller default input size may help you avoid the rate limiting for best user experience.