We need to focus on Data Basics before embarking on Big Data

bigdataI got incredible response from this post on LinkedIn, with more than 3,000 views, 224 likes but more importantly 27 comments. You can read the comments here at LinkedIn:

Here is the post in its entirety, I’d welcome more comments and discussion here also…

With the proliferation of software-as-a-service applications across most organisations, it is likely that many organisations are suffering from a fragmented data environment. This is a problem because just at the time that most organisations need to homogenise their data strategy to take advantage of Big Data learnings, the opposite is happening: data decentralisation and even chaos.

In many cases, organisations have been focussed on data storage and not data quality. Just managing the demanding growth of data volumes for the last 15 years has been enough of a challenge for CIOs. Rapidly scaling data storage infrastructure – including software and networking as well as hardware – has been overwhelming and all too often the actual quality of the data has not been good. How many companies can genuinely claim their database was sound, that their CRM data was clean and that the insane complexity of spreadsheets was under control let alone consolidated? The age old adage “garbage in, garbage out” scales in severity with the size of data volume.

Yet as data storage now decamps to the cloud and the focus moves to Big Data strategies, it seems that data quality is still not a priority. I wonder if the industry – here in Australia as well as globally – is doing enough to enhance the human data skills rather than relying solely on Hadoop et al to do all the work. I’ve written before on the disconnect between data technology and human data skills. There is a lot of talk about “Data Scientists” but is that nothing more than just a fancy title for BI analysts?

Bona fide Data Scientists are like real life scientists. They have a hypothesis, they test this hypothesis againsts different sets of data and validate or disprove their hypothesis. Then they try and look for further causation, correlation and then they might come up with some real insight and a discovery. But in our a commercial situation, the data scientist might invest a lot of time in developing a hypothesis but then find that the data isn’t available or is too messy to use. So what then? (It is worth reading this New York Times story on “Data Wrangling”).

Organisations need to work out – strategically and operationally – how to collect data appropriately, what data they need and then what they might need to look for. There are data scrubbing tools, deduping tools and analytical tools but if the raw data is not in an appropriate state, obviously it isn’t possible to scrub or dedupe data that doesn’t exist.

So it is crucial for CIOs to look initially at their overall application architecture and work out the data flows and how they integrate, and then what insight we might need and operationally what data is needed and where it can be sourced. This isn’t difficult but it requires formality and strategy rather than ad hoc evolution. The current trend in SaaS proliferation and services bought ad hoc on the credit card at the departmental level is haphazard and making data increasingly difficult for CIOs to manage. Not only because the data is decentralised, in different clouds, but because there are now different data models that are often quite difficult to access and often quite complicated to understand.

If organisations want to truly benefit from the Big Data opportunity there needs to be some sober and disciplined thoughts about data analysis skills, data quality control and data strategy before the kind of frantic technology acquisition that the media and vendors promote and discuss. Otherwise we are going to get no closer to any kind of data optimisation than we are now – we will just create more data mayhem and the Return on Investment will remain just as elusive.

Picture credit: bigdatapix.tumblr.com

Forecast is Cloudy with the chance of Pain

weather-icons-headerI am continually frustrated with the way that the IT industry has sometimes embraced cloud computing in a manner that I can only describe as naive and short-sighted. I wrote earlier this year about this, questioning whether CIOs are really considering what the Total Cost of Ownership (TCO) of their cloud investment is as they leap onto the cloud bandwagon at the behest of vendor and media hype.

Well I have been forced to put pen to paper again this week as I question whether enough thought has been put into the future roadmap of all the integration required to pull all their various ad hoc and short term cloud deployments together.

Rust Report: Cloud has come of age, but now it’s time to grow 

It has led me to try and draw a map of what I think the future of the cloud industry looks like and who I think the key cloud players will be. This should help any speculative investment decisions I think because it identifies where the real value is going to come from. I think we have moved out of the first phase of cloud – its childhood if you like – and it will be interesting to see how it matures into a fully grown industry from here. 

Just like any adolescent, this industry could learn the lessons of its past and adapt according to how the world needs it to develop; or it could completely go off the rails, neglect its study and concentrate on the partying! I certainly hope for all our sakes it is the former…but to mix metaphors, I think the forecast will get worse before it gets better.

[Picture Credit: nature.com]