I got incredible response from this post on LinkedIn, with more than 3,000 views, 224 likes but more importantly 27 comments. You can read the comments here at LinkedIn:
Here is the post in its entirety, I’d welcome more comments and discussion here also…
With the proliferation of software-as-a-service applications across most organisations, it is likely that many organisations are suffering from a fragmented data environment. This is a problem because just at the time that most organisations need to homogenise their data strategy to take advantage of Big Data learnings, the opposite is happening: data decentralisation and even chaos.
In many cases, organisations have been focussed on data storage and not data quality. Just managing the demanding growth of data volumes for the last 15 years has been enough of a challenge for CIOs. Rapidly scaling data storage infrastructure – including software and networking as well as hardware – has been overwhelming and all too often the actual quality of the data has not been good. How many companies can genuinely claim their database was sound, that their CRM data was clean and that the insane complexity of spreadsheets was under control let alone consolidated? The age old adage “garbage in, garbage out” scales in severity with the size of data volume.
Yet as data storage now decamps to the cloud and the focus moves to Big Data strategies, it seems that data quality is still not a priority. I wonder if the industry – here in Australia as well as globally – is doing enough to enhance the human data skills rather than relying solely on Hadoop et al to do all the work. I’ve written before on the disconnect between data technology and human data skills. There is a lot of talk about “Data Scientists” but is that nothing more than just a fancy title for BI analysts?
Bona fide Data Scientists are like real life scientists. They have a hypothesis, they test this hypothesis againsts different sets of data and validate or disprove their hypothesis. Then they try and look for further causation, correlation and then they might come up with some real insight and a discovery. But in our a commercial situation, the data scientist might invest a lot of time in developing a hypothesis but then find that the data isn’t available or is too messy to use. So what then? (It is worth reading this New York Times story on “Data Wrangling”).
Organisations need to work out – strategically and operationally – how to collect data appropriately, what data they need and then what they might need to look for. There are data scrubbing tools, deduping tools and analytical tools but if the raw data is not in an appropriate state, obviously it isn’t possible to scrub or dedupe data that doesn’t exist.
So it is crucial for CIOs to look initially at their overall application architecture and work out the data flows and how they integrate, and then what insight we might need and operationally what data is needed and where it can be sourced. This isn’t difficult but it requires formality and strategy rather than ad hoc evolution. The current trend in SaaS proliferation and services bought ad hoc on the credit card at the departmental level is haphazard and making data increasingly difficult for CIOs to manage. Not only because the data is decentralised, in different clouds, but because there are now different data models that are often quite difficult to access and often quite complicated to understand.
If organisations want to truly benefit from the Big Data opportunity there needs to be some sober and disciplined thoughts about data analysis skills, data quality control and data strategy before the kind of frantic technology acquisition that the media and vendors promote and discuss. Otherwise we are going to get no closer to any kind of data optimisation than we are now – we will just create more data mayhem and the Return on Investment will remain just as elusive.
Picture credit: bigdatapix.tumblr.com