By David Bachowski, Director – Technology, Mission Capital Advisors
Big Data, or maybe just big files?
Our tech team does random technical support for deals on our Asset Sales team. This could mean anything from teaching them curl for scripting thousands of downloads, writing python scripts to parse proprietary file formats, or creating thousands of PDFs from rows of excel data.
A couple months ago we were involved with a deal that included loan documents with sensitive data. The client shipped us a secure, password protected hard drive with password protected zip files on it. We opened up the zip files and ended up with a couple million files. Normally we upload these to our Virtual Data Room (a cloud fileshare, named VDR from now on) where we can delegate access to potential buyers of these loans.
None of this would be much of an issue, except that we found out several hundred thousand of these files had Tiff extensions, which our VDR did not support for watermarking. Our VDR supports ONLY pdf files, which left us in quite a bind.
The tiff files needed to be converted to PDF within a few days. How do we convert 220K files in time?
The main difficulty here is time. Following the 1st rule of optimization (don’t), our first attempt looked something like this.
We found a great library called ImageMagick that converts pretty much any image type to another image type. We created a simple powershell script to recursively loop through each file and do the conversion.
We let this run for about an hour and checked the progress. Conversions were running at about 3-4 seconds per file. Doing the math…
This wasn’t going to be fast enough! Plus, what if the script failed in the middle of processing? There was no way to keep track of which tiff file had been processed already without deleting the tiff file which is dangerous if the conversion needs to be reprocessed.
The next logical step was to parallelize the processing. Maybe we could use a bunch of workstations to connect to the fileshare, download the images, convert, and re-upload. We split the files into 6 ‘pools’, and hooked each workstation up to 1 pool and let it run overnight.
How did it do? Conversions didn’t speed up as much as we had hoped. Presumably, the download and upload overhead added time to the processing. In addition, we still had the risk of a script failing, and the scripts were locking up user’s computers and making it hard for them to do their normal work. Processing an image now took around 2-3 sec per file, which was disappointing.
Note: Throughout this process we also considered adding additional cores and RAM to the server, but unfortunately the virtual server was on a host that had maxed out its cores and RAM already. We were dealing with a machine that had only 4 cores and 8GB of RAM.
Move it to the cloud
We were left with very few options at this point, and our next inclination was to process these in Amazon Web Services (AWS), where we had potentially limitless compute and networking resources. We created some terraform scripts that could create a secure VPC, add some EC2 worker instances, and execute the conversion with the EC2 worker cluster.
We had already been using terraform for other projects, so we had a template that we could copy and modify for expediency (We were still under pressure to resolve all files within a few days). Having this template around also ensured that we maintained strict, standardized security protocols around encryption, networking, and data access.
Getting the files into s3:
- We created an AWS S3 bucket to house the raw tiff files.
- We put a trigger on the S3 bucket to put a ‘processing message’ in an SQS queue for each new file.
- We uploaded all of the documents to s3 using powershell.
Powershell Upload Script:
Processing the files:
We then used our terraform script to spin up a cluster of around 18 servers that
- Pulled a message off of the processing queue
- Downloaded the tiff file based on the message
- Converted the image using ImageMagick
- Uploaded the file to a new s3 location
Note: We would have used more than 18 servers but we hit some AWS-imposed limits on the number of concurrent servers and it took too long to increase the limit.
Downloading the files
The new s3 location for processed files also had a trigger to send a message to the ‘Download Queue’. A new powershell script pulled messages off of that queue, downloaded the appropriate file, and saved it into the proper location.
The final architecture looked like this:
The whole conversion process, including writing the scripts, took about 5 days. There was then another day or two of uploading to the VDR, for a total of 10 days.
3 days of unzipping + 5 days of conversion + 2 days of uploading = 10 days
Most of the conversion time was trial and error problems. For example, we used a different, pure python conversion library that failed to convert some files, so we had to re-process all of them. The actual conversion of 220K tiff files took around 10 hours. This is a drastic improvement from the original 9 days of a single script running on the windows server. Also, since we were using a queue we could guarantee conversion of every file and, in the case of corrupted files, produce a list of failed conversions for additional investigation.
Oh no, we got another batch!
Earlier this month, a new batch of files came in, but this time there was twice as many files! We reconsidered our architecture and realized that the major pain points were not actually in the cloud processing, but in the unzipping of the files and uploading to the VDR. We took our cluster script and made it generic enough to handle any type of processing against a queue. We also removed the windows server from the equation since it was underpowered and became both a CPU and networking bottleneck.
Our next version of the project takes less than 3 days to process files, including unzipping and uploading to the VDR.
- Some huge benefits of the cloud are having access to infinite compute power, but the biggest difficulties we had involved the interface with other systems – unzipping 1-2TB of data and uploading 1-2TB of data to our VDR.
- We’ve spent time writing code to instantly create entire secured networks ready for processing and deployment. Having your infrastructure ‘in code’ with a tool like Terraform allows you to repurpose it for unforeseen projects without limiting security or letting the project ‘bleed into’ the rest of your infrastructure.
- Another problem we ran into very quickly was the default limits in AWS on how many servers you can have at any given point. We wanted to run 50 worker instances at the same time but ran into a hard limit of 15-20. You can increase that limit for free, but you have to open a support ticket and wait for them to approve it. We would recommend increasing your limits before you actually need them to something reasonable but still high enough for processing large volumes.
- In subsequent runs we’ve moved the initial pass of conversions to Docker/Fargate. Some percentage of these conversions fail due to memory constraints, but it’s much cheaper to process. We can get 80-90% of the files converted through smaller container images on Fargate and then throw the heavier EC2 instances against the rest.
There’s tons of different ways to solve a problem like this – we’ve even used it as a problem-solving interview question since it doesn’t require any actual coding and can pretty easily be drawn on a whiteboard. You can find out pretty easily how comfortable someone is with estimating scale and constraints and their knowledge of certain paradigms like queues and worker processing.