Sunday, August 22, 2010

Google’s “High Performance Image Serving”

[Update] I stand corrected - I do now have billing enabled on my Apps account, and can confirm that images are served with all the correct response headers set. The URLs do indeed support 304 conditional GETs on the production infrastructure, which makes this a very attractive image hosting solution. Well done Google, apologies for the mis-representation.

(NB – this article is based on the development SDK (1.3.6) – I haven’t been able to test on the production infrastructure as yet as for some reason Google won’t authorize billing on my AppEngine account, without which the Blobstore is unavailable!)

The concept of a hosted service that manages image serving / caching, resizing & cropping is something that has cropped up in projects I’ve been involved with for the past 5 years (and the rest).

I had a hand in a startup about 6 years ago now that processed MMS (picture messages) that people sent in. At its simplest this involved storing people’s uploaded images, and then cropping / rotating and resizing the images to fit a certain profile (we were printing them out, and needed to convert landscape to portrait, and crop to a fixed aspect ratio.)

Subsequent to that experience I worked for many years in online digital entertainment, developing a platform for the processing of music / video assets, and one of the elements we also needed was the ability to take in hi-res images, shrink to an acceptable size / format (e.g. not TIFF), and then host them for delivery across a CDN.

The project I’m currently working on has a combination of the two – processing user-generated content for serving back over the web. What we need is an image processing service, backed with a large data store, and a high-performance cache.

So, it was with great excitement that I noticed that the Google App Engine SDK comes with an in-built library to do exactly this. It’s built on top of the Picasa library (it even includes the “I feel lucky” transformation), and enables cropping, resizing, rotation etc. The App Engine platform has no file-backed storage, but the datastore does include the BlobProperty type, which can be used to store binary data (such as images). A simple image processor using this took about fifteen minutes to set up (which was mainly cut-and-paste from their sample app here).

(Some people may by now be thinking “Tinysrc” – the online service that resizes images for mobile screens – well, no surprises, tinysrc runs on AppEngine – this is precisely how they do it, except that they pull the images from a remote server – they are not stored.)

The downside of this approach is the the datastore has a 1MB limit per entity, which makes it borderline useful if you’re dealing with UGC (web-optimised images should never be 1MB, but the image someone just uploaded from their new digital camera could easily be.)

Fortunately, Google provides a secondary datastore specifically for large binary objects, called the Blobstore. It’s well documented (here), so I won’t repeat that, but what I can say is that it integrates directly with the Image api (see here). There are some complex limits about the amount of data you can process, so read the article carefully, but suffice to say it can be done. (See here for a nice example of the blobstore / image api interaction.)

A killer function, which has been publicised this last week (as “High Performance Image Serving”), is the “get_serving_url” function in the Images api – which takes in a Blobstore object key, and returns a fixed URL that can be used as the static image URL.This looks almost like a Google CDN – the ability to serve images as static content, with the ability to crop / resize on the fly (albeit using fixed sizes) thrown in for free.

[Updated – see intro] And yet… if you set up an image service using these amazing (and practically free) resources, you’ll find a fairly large hole in the implementation. It’s our old friend HTTP status codes. The fixed URL exposed by the images service does not support a 304 (content unmodified) status – meaning that every time you call for it, you get the whole thing, increasing server bandwidth and client download times. (See introductory note – this may just be a development server issue – TBC.)

I can only assume that this is deliberate – as Google gets its money from the bandwidth charge. It is however extremely annoying.

Links:

Saturday, August 21, 2010

Cloud computing – where’s the PHP platform?

As the fog begins to lift from the world of cloud computing the classification of cloud services is becoming clearer. Whilst the likes Salesforce and Google Docs set the running in the Application space (Sofware-as-a-Service), and Amazon was the clear leader in cloud infrastructure (IaaS), the most interesting area (at least for me) is the Platform-as-a-Service (PaaS).

PaaS has been around for a while, and offers some compelling advantages over IaaS for the software development community. Virtualisation (the basis for IaaS) might be cost effective, but from the point of view of the software developer it’s still an O/S. PaaS removes much of the boilerplate code that developers have to write, and can make architectural decisions around scaling and performance redundant (the platform providers do the hard work, you just have to follow the rules.)

The two leading offerings at the moment are Microsoft’s Azure (.NET based) and Google’s AppEngine (Python or Java). There are a couple of Ruby offerings (Heroku and Engine Yard), and a new arrival with Djangy.com (Python / Django). The obvious missing link is a PHP-based PaaS offer – the LAMP community is now behind the curve where it once led (most early EC2 adopters were touting their LAMP credentials).

Now, where could a PHP service come from? What we need is a company that has experience running a global infrastructure, supporting well-understand PHP web-frameworks (documented and preferably OSS) and looking to encourage the army of keen LAMP developers out there to stick with the stack, and not migrate to Python / Ruby (or heaven forbid .NET!)

Facebook have just posted an update on their developer support blog (here), and PHP isn’t in it – it’s all about Facebook apps and integration. It would be nice to see them being a little more ambitious – they are the obvious choice, and integrating things like HipHop, Hive, Scribe, Tornado, Memcache, and of course Cassandra, they would have an incredibly compelling service on offer.

Links:

Saturday, August 07, 2010

Schema-less data and strongly-typed objects

There has been some talk on the NoSQL grapevine recently about schema-mapping and the associated issues when dealing with schema-less data stores (specifically document-databases).

To my mind the core problem is not that it’s difficult to control the schema – that, after all, is the point, but instead by the use of statically-typed languages to access the contents. Deserializing documents back into strongly-typed binary objects will cause a problem if the structure of the documents changes. Deserializing JSON back into JavaScript objects doesn’t suffer from this problem.

It then becomes a client-side issue of understanding what to do with an object when it doesn’t have the property you’re looking for. This is a great example of why you should consider the use of NoSQL data stores carefully – caveat emptor is the guiding principal.

(Here’s an example of the sort of thing going on to mitigate such problems - http://www.jwage.com/2010/07/30/doctrine-mongodb-odm-schema-migrations/ – and yes, I do know that MongoDB uses BSON, not JSON – I’m just illustrating the point.)

Thursday, August 05, 2010

Times Online pricing

I received another email from the Times today extolling the virtues of its new site, and suggesting I take out a subscription. The thing is, I don’t want to read the digital equivalent of the back-breaking forest-destroying Sunday Times – however I might want to read a few choice articles or sections / categories. Which is surely the major advantage of digital over physical. I can pick the articles I want to read, and ignore everything else. Except that the Time has applied the same pricing policy to their online version. Aaargh. Idiots.

Tuesday, August 03, 2010

IT, innovation and the internet

After seeing this post by Martin Fowler, I thought I should revisit / update the article I posted here, given that the IT / innovation divide seems to be gaining traction.

My original post was a bit OTT, and primarily a reaction to the conference I’d attended the previous day. I’ve had plenty of time to think through the issues since then, and as a result I now believe in the divide more than ever.

There is an excellent article here that describes some of the issues, but I still think there is a further distinction between companies who make their technical innovation part of their corporate DNA, and those who don’t – those who pursue strategies of operational efficiency and economies of scale.

It seems to me that this distinction is clearest in the case of pure-play internet businesses. The internet is an entire ecosystem within which innovation is key. People talk about “internet time” precisely because the rate of change is so great. And in this environment, I think that the responsibility for innovation (whether that be application development or infrastructure-led) should not sit with the IT department, however clear the distinction. IT should report to the operational director (or COO), “innovation” should report to whoever is responsible for strategy and growth.

I’d almost go so far as to suggest that the Waterfall / Agile schism is a reflection of this change. Perhaps all projects that can / should be scoped in full in advance of implementation should fall under IT, and those with a more ‘uncertain’ outcome should fall under the auspices of the new department, whatever it’s called.

What is clear (to me) is that the fact that a project / program / initiative involves either a computer, or software, does NOT automatically mean that it should come under the banner of IT. When the web was starting, the IT department appropriated web development simply because they knew one end of a computer from the other.

As Ross Pettit points out “IT has no definition on its own … it only has definition as part of a business”.