Finding Files in S3 (without a known prefix)
S3 is a fantastic storage service. We use it all over the place, but sometimes it can be hard to find what you’re looking for in buckets with massive data sets. Consider the following questions:
What happens when you know the file name, but perhaps not the full prefix (path) of the file?
I hit this in production today, which is the motive of the blog post. The next question is what I thought might be a useful example, when I wanted to extrapolate what I’d learned to other use cases.
How do you find files modified on specific dates, regardless of prefix?
Typically, this is when things start to get difficult. As a starting point, I use Transmit - it’s a fantastic tool. However, it starts to fall down when you need to either deal with “folders” that contain many tens of thousands of documents, or when you need to look for things that could be in multiple folders.
Both of the above questions can be answered relatively easily by using the
--query
parameter of the aws cli. You can pass to --query
any
JMESPath query.
Find file by partial name
For this example, we will search for a file name containing 1018441
. I
have slightly redacted some bits of the path as this is a query of
production data.
In this case, we can quickly see that when we were expecting only two files by this name (a cache, and a processed file) we have two different processed files! Here’s our problem.
Find files modified on a given date
Find files modified between given times
Technically, the ability to perform a comparison on a date is not part of the official spec for JMESPath, but I found a GitHub issue which describes that many users rely on the ability to do just this.
That issue was closed by this commit which mentions that it will be added explicitly to the spec:
The spec doesn’t officially support string types yet, but enough people are relying on this behavior that it’s been added back. This should eventually become part of the official spec.
Caveats
When you use the --query
option, it’s all done as post-processing, on your
local system. Nothing is done remotely by s3, it’s essentially the same as
piping the output to jq
, jp
or another filtering program. This means if
you have many 100s of thousands, or millions of objects, or a slow connection
it will not be a fast process.