A fairly common task I need to do is:
- query a remote service which returns a large amount of data
- extract just the bits I need from that data
- do something with the extracted bits of data
Often the initial query will take a few seconds to run and I’ll be thinking I can’t be bothered to wait for this, why don’t I just cache the data.
If I decide it is worth it, the next question is where and how to cache.
Perl AnyEvent
And what I generally think of first is an AnyEvent-based Proxy Server. Just as quickly, I discard that option as I can’t be bothered to figure out how to ensure the proxy is up when I need it. For example, what happens when the physical host where the proxy is running reboots? What happens if the sysadmin kills my process? Etc.
Storable and Freeze / Thaw
So, next I think: the filesystem is always available (hopefully!), I’ll use that. So then I’m thinking about Cache::File, Storable and Freeze/Thaw.
This often leads to a couple of issues too – where should I store my datafiles. Should I store it somewhere in $HOME and then if multiple people want to cache the data, they are each hitting the remote service, or should I find I have a shared area? Then, should this shared area be on a network drive, or local to the box?
Database
So, finally, I get to thinking about caching the data in the database. I know for sure that will always be available.
Can you guess which option I generally choose? Does anyone else have any thoughts and where and how to cache data?
We have just resolved this problem as well. We have an analysis pipeline which calls out to multiple webservices to retrieve sample information, often multiple times.
Because the pipeline cannot work without the Filesystem (it all manipulating files in a run folder), we chose to create a cache folder inside the run folder on the FS, and then retrieve from there. To save there, the first thing we also do is spider out for all the information we are going to need, which has the added advantage of croaking early if any of the webservice information isn’t available.
We have found so far that this has been a viable setup.
Cheers
Andy
I haven’t play yet – but just came across http://search.cpan.org/dist/CHI/ which looks very interesting – even if not a direct solution you could create a driver for it that does what you want.
DB_File, BerkeleyDB, SQLite
@setitesuk – I do often carefully think about caching to the filesystem. Then I generally worry about permissioning who can write to the cache, who can read from it and that sort of thing and as the database already handles all kinds of permissions for me, I default to database.
@Leo – Is CHI the modern version of Cache::* then? How did you come across it? Speaking of which, that brings another blogpost to mind.
@Anonymous – thanks. But I’m more interested in the reasons why you use those things. Why don’t you just store to a plain file with storable for example?
@Jared – it was being discussed on IRC – yea, ‘modern’ version of Cache::* stuff – as I say I’ve not played, but several people I respect are now using it.
@Jared – It’s goal is to be the Cache equivalent of DBI – hence my suggestion to create a driver that works with it.