Updated: Apr 17, 2020
Welcome to our first multi-lingual blog post, coming to you in English, Spanish, Italian, and "Digital". Today, we discuss how to perform live bulk uploads, updates, and deletions of over 300,000 records on FileMaker Server/FileMaker Cloud in about 35 minutes or less, safely and without incurring any downtime. Hold on to your hats, this one is definitely for the "square pegs in the round holes".
Whether on premise or in the cloud, most multi-user databases often require routine updating, loading, or unloading of large data sets. Often times data feeds originate in other systems and exist in the form of a flat file such as a .csv or fixed-length ASCII file. The go-to method for such updates is to take the database offline and perform all mass imports in single user mode, but what if that isn't an option?
Some applications are "Always On", 24-hours a day. In a scenario where there are three work shifts accessing your application around the clock, the maximum available downtime may be less than one (1) hour, sometimes as little as 30 minutes. In some extreme scenarios, downtime may not be an option.
What are the challenges presented by typical import processes? There are several factors to consider:
1. An import locks a table
Operations such as importing records or performing a "replace all" values in a field place an exclusive lock on the entire table. On a 300,000 record import, that can mean the table will remain locked for a prolonged period of time until the import is complete.
2. Concurrent imports will actually run slower
If there are multiple such large imports to perform into different tables in a limited timespan, runnig two or more concurrent imports to save time isn't advisable either, as performance for each individual import may actually suffer. There may also be unintended consequences caused by unknown dependencies such as joins between the tables (such as auto-entered/computed fields being populated with erroneous information due to incomplete calculations).
3. An import can corrupt startup restoration logs
Compounding the problem is the issue of transaction logs used by the database server for startup restoration. When performing a large import, Claris recommends disabling transaction logging to avoid potential database corruption. Unfortunately disabling transaction logging requires a soft-reboot of the database engine, which defeats the purpose of avoiding server downtime.
4. A server-side import script takes too long
Although a safer practice, in our tests a server-side scheduled import of 300,000+ records into a single table can take upwards of 48 hours to complete, even for a narrow table of less than 40 fields. If you have three such tables to populate (approximately 1 million records to upload) and a limited time window in which to do it, that math simply doesn't add up.
5. Time may be of the essence
While similar results can be achieved using certain slight of hand techniques such as "sliding window" table switching (which may be achievable via table graph manipulation and the separation model), for any OLTP system under maintenance window constraints, the bottleneck continues to be how long it takes to upload the data to the server. For situations where the data needs to be online "now", some heavy lifting is required.
The solution – first in 'Digital':
Implement a client-side RPC Bot with chunking and throttling algorithms
Rather than a bulk import, we decided to perform INSERT operations using stored procedures. The outcome was highly successful, as we were able to quickly and safely populate our table with a large data set (nearly 300,000 rows across 34 columns) in just over 33 minutes (even when testing with an iPad!).
Using this approach, we are able to perform the equivalent of a mass "import" live and without interruption (no table locking, no taking file offline, etc.). The process is also high performance, with numerous throughput tests across multiple hosts (both virtual and bare metal) consistently averaging 145-150 rows per second.
For more full-featured table maintenance functionality , we extended our RPC Queue to include selective UPDATE operations, selective DELETE operations, and optionally TRUNCATE-table operations. Based on numerous tests across multiple hosts (both virtual and bare metal), total updates and total deletions of every row in the entire table only took slightly longer to execute than the original insert of all rows.
Here's an overview of the techniques we employ:
Input Stream: We use a Buffered Input Stream to read the entire data file and ingest it into a temporary (global) text column within a working table. Reading our test 170MB text file and inserting its contents into a column takes us less than 30 seconds.
Chunking: We developed a data chunking algorithm to parse the data file into payloads that can be parametrized for consumption by stored procedures.
Looping: To expedite data processing, a For-Loop iterates over the chunks, firing off one stored procedure per chunk in rapid fire sequence.
Stored Procedures: A stored procedure iterates over each payload, processing one row at a time. The host excutes multiple stored procedures in parallel.
Throttling: To guard against crashing the server's script engine or overtaxing the server's resources, we developed a self-regulating, queue throttling mechanism for managing the client-side queue.
Truncating: For situations where the table must be re-populated from scratch, the table can optionally be truncated.
Interruptibility: Throughout execution, our client-side RPC Queue can be paused, resumed, or aborted.
Robotic Process Automation: In addition to on-demand tasks, we also enable time-based scheduling of tasks using "Install OnTimer Script". This converts your device running FileMaker client into a basic Robotic Process Automation station.
Now for the details:
If the source data file has been delivered as a BLOB file inserted into a binary column, the file is first exported to the Documents directory on the local file system, from which it can then be read back in as plain text using File Input Stream commands.
The practice of data chunking is well understood and commonly employed for data synchronization and deduplication use cases. For our algorithm, chunks are capped at 560,000 characters, which is just over half of FileMaker's maximum parameter length for stored procedures (current limit is 1 million characters).
Likewise, most database platforms support stored procedures. We leverage the FileMaker Server Script Engine (FMSE), which implements this functionality across the entire server product line and can be invoked via Perform Script on Server.
The missing piece of the puzzle on the FileMaker platform (at least as of this writing) has been RPC queuing. Whether client-side or server-side, there is currently no native support for RPC queuing/stored procedure queuing on the FileMaker platform. Scheduled Tasks/Cron Jobs are available, but that is not the same as a queue.
Because it is easier to configure, monitor, manage (and interrupt if necessary) our solution implements a self-monitoring, interruptible, client-side queue. This means a dedicated work