Big Data and Singing Crowds

Written by Piers Cawley on

I watched the rugby yesterday. England vs Wales at Cardiff Arms Pack. It was a great game of rugby - England were comprehensively outthought by a Welsh side with more experience where it counts, but by gum, they went down fighting to the very end. It’s going to be an interesting few years in the run up to the next World Cup.

While the game was going on, I found myself wondering why the crowd’s singing sounded so very good. It’s not a particularly Welsh thing (though Cwm Rhonda, Bread of Heaven and the whole Welsh crowd’s repertoire are have fabulous tunes). The Twickenham crowd getting behind Swing Low, Sweet Chariot sound pretty special too, even if I wish they still sang Jerusalem occasionally. How come a crowd of thousands, singing entirely ad lib with no carefully learned arrangements or conductor can sound so tight?

After all, if you took, say, 30 people and asked ‘em to sing a song they all know, it would sound ropey as hell (unless they were a choir in disguise and had already practiced). Three or four together might sound good because, with that few of you, it’s much easier to listen to your fellow singers and adapt, but 30’s too many for that and without some kind of conductor or leader, things aren’t likely to sound all that great.

I think it’s a statistical thing. Once you get above a certain number of singers, the fact that everyone’s going to sing a bum note now and again, or indeed be completely out of tune and time with everyone else, the song is going to start to make itself heard. Because, though everyone is wrong in a different way, everyone is right the same way. So the wrongs will start to cancel themselves out and be drowned by the ‘organised’ signal that is the song. And all those voices, reinforcing each other make a mighty noise.

That’s how big data works too. Once you have sufficient data (and for some signals sufficient is going to be massive) then the still small voices of whichever fraction of that data is saying the same thing will start to be amplified by where the noise is dissipated.

Just ask an astrophotographer. I have a colleague who takes rather fine photographs of deep space objects that are, to the naked eye nothing more than slightly fuzzy patches of space, only visible on the darkest of nights but which, through the magic of stacked imaging can produce images of stunning depth and clarity.

The Flame Nebula (NGC 2024), star Alnitak, and the Horsehead Nebula, Orion

If you’ve ever taken photographs with a digital camera at the kind of high ISO settings that Mike used to take this, you’ll be used to seeing horrible noisy images. But it turns out that, by leveraging the nature of the noise involved and the wonder of statistics, great photographs like this can be pulled out of noisy data. It works like this:

Any given pixel in a digital photograph is made up of three different componants:

  • Real light from the scene in front of the camera
  • Systematic error which is the same in every image
  • Thermal (usually) noise

The job of an astrophotographer is to work out some way of extracting the signal at the expense of the noise. And to do that, they have one massive advantage compared to the landscape or portrait photographer. The stars and nebulae may be a very very long way away. They may be very dim. But they don’t move. Once you’ve corrected for the motion of the earth, if you point your scope at the horsehead nebula today it’s going to look the same as it did yesterday and the day before that. Obviously, things do change, but, from the distance we’re looking, the change only happens on multi-hundred year timescales. This constancy makes the astrophotgraphers task, if not easy, at least possible.

So… the stars (like the tune of Cwm Rhonda) are unchanging, but the noise is different with every exposure (that’s why it’s called noise after all). Even if, on any given exposure the noise is as strong as the signal, by taking lots and lots of exposures and then averaging them, the noise will get smeared away to black (or very dark grey) and the stars will emerge from the gloom. Sorry. The stars and the systematic error will emerge from the gloom. So, all that remains to do is to take a photograph of the systematic error and take that away from the image.

Huh? How does one take a photograph of systematic error? You do it by photographing a grey sheet. Or, because it’s probably easier, by throwing your telescope completely out of focus so what you see is to all intents and purposes a grey sheet and taking a photograph (or lots of photographs - you’ve still got noise to contend with…) and subtracting the resulting error map from your stack of photographs and bingo, you’re left with an image that’s mostly signal. All that remains is to mess with the levels and curves and possibly to stack in a few false colour images grabbed from the infra red or the hydrogen alpha line where there’s lots of detail and you’re on your way to a cracking photograph.

Obviously, it’s not as easy as that - telescope mounts aren’t perfect, they drift, camera error changes over time. It’s bloody cold outside on a clear night. Sodium street lights play merry hell with the sky. And so on. But if you persevere, you end up with final images like the one above. That sort of thing’s not for me, but I’m very glad there are folk like Mike taking advantage of every clear night to help illuminate the awesome weirdness of our universe.

Noisy data is a pain, but, we’re starting to realise that, if you have enough data and computing power, you can pull some amazing signals out of it. Whether that’s the sound of thousands of Welsh rugby fans combining to sound like the voice of God; an improbably clear photograph of something that happened thousands of years ago a very long way away; your email client getting the spam/ham classification guess right 99 times out of 100; or Google tracking flu epidemics by analysing searches, if you have enough data and the smarts to use it, you can do some amazing things.

Some of them are even worth doing.