The arrival of new data types in amazing volumes, the phenomenon known as big data, continues to cause CIOs and business leaders to rethink their technology portfolios. Few companies build their own infrastructure. Most will buy it. But what should they buy? And how can they put the pieces together into a coherent whole?
The first challenge of big data is that it requires new technology. On the other hand, the arrival of big data has not rendered all other types of data and technology obsolete. Hadoop, NoSQL databases, analytic databases, and data warehouses live side by side. Analysts don’t care where the data comes from: they will crunch it from any source.
The second challenge is data integration. How can the new technology for processing big data use all of the data and technology now in place? How can existing technology and data be improved by adding big data? How can new forms of analytics and applications use both the old and the new?
CITO Research believes that CIOs and business leaders can speed progress by focusing on the task of integrating the new world of big data with the existing world of business intelligence (BI). This buyer’s guide describes how to think about purchasing technology for big data integration.
Challenges of Big Data Integration: New and Old
Aficionados of big data will be familiar with the ways that big data is different from previous generations of data. It’s often described in terms of 3 V’s—volume, variety, and velocity—a handy construct for thinking about big data introduced by Gartner analyst Doug Laney.
The challenge is to find a repository that can handle huge volumes of data. A related problem is analyzing streams of data that come from machines, servers, mobile devices, and sensors from the Internet of Things (IoT). The Hadoop ecosystem has emerged to handle the volume, velocity, and variety of data.
Another challenge is dealing with the fact that big data often requires new techniques for exploration and analysis. Big data is typically unstructured or semi-structured. In addition, raw text documents and video are often included as data types. Machine learning, text and video analytics, and many other techniques applied to the data in Hadoop, NoSQL databases, and analytics databases help messy data become meaningful. Once you have met these challenges, the tasks related to using big data start to feel very much like those involved in processing existing data (see “Challenges Big Data and Existing Data Share”).
The equation for handling big data looks something like this:
While big data may change many things about the way BI is done, it will not make BI obsolete. This means that the right path to big data integration is likely to come through existing data integration solutions that have been adapted to incorporate big data.
In addition, there is a difference between performing proof of concepts and operationalizing big data. A big data integration technology should not only enable a science experiment but should also support the beginning, middle, and end of the journey toward making full use of big data in conjunction with existing applications and systems for BI.
What You Need for Big Data Integration
To make the right choice about assembling a system for big data integration, consider what you will need. Most organizations will need the following capabilities.
Connect, Transport, and Transform
Accessing, moving, and transforming data have been at the heart of several generations of
data integration technology. Big data integration adds some new twists.
The ability to handle complex data onboarding at scale from many data sources is critical. Today, enterprises face the challenge of ingesting hundreds of data sources with many different formats and formats that change over time. Additional data sources are added regularly. Solutions should leverage templates and automation to reduce the manual work for creating jobs and transformations to onboard data into Hadoop. The data onboarding process should be flexible, regular, reliable, and automated as much as is feasible.
Access to data through Hadoop, NoSQL databases, and analytic databases must be supported, as well as connectivity to a variety of data formats, including JSON, XML, various log formats, emerging IoT standards, and so on. The ability to define or discover a schema is crucial. Modern data integration technology must be deployed in both cloud and on-premise.
When companies make use of data, it is vital that everyone—analysts, end-users, developers, and anyone else who is interested—is able to explore the data and ask questions. This need for a hands-on way to examine and play with the data is required at all levels of the system.
It doesn’t matter whether the data resides in a Hadoop cluster, a NoSQL database, a special purpose repository, an in-memory analytics environment, or an application. The best results will come when anyone with a question can bang away and see if the data can answer it.
For big data, this usually means that some sort of exploratory environment will be used in conjunction with the repositories, which typically only allow data access through writing programs or using complicated query mechanisms.
It is well known among data analysts in any domain that as much as 80 percent of the work to get an answer or to create an analytical application is done up front to clean and prepare the data. Data integration technology has long been the workhorse of analysts who seek to accelerate the process of cleaning and massaging data.
In the realm of big data, this means that all of the capabilities mentioned so far must be present: easy to use mechanisms for defining transformations, the ability to capture and reuse transformations, the ability to create and manage canonical data stores, and the ability to execute queries. Of course, all of this needs to be present for big data repositories as well as those that combine all forms of data.
This guided approach is required because so many machine learning and advanced analytical techniques are available for many different types of data. The machine learning techniques used to create predictive models from streaming data are far different from those used for categorizing unstructured text.
Once an analyst has created a useful, clean data set, the value of the work can be amplified by allowing sharing and reuse. Right now, environments to support sharing and collaboration are emerging. Some environments support architected blending of big data at the source to enable easier use and optimal storage of big data. Big data integration technology should support such environments.
Preferred Technology Architecture
The ideal system for big data integration is different at every company. The most data intensive firms will likely need every capability mentioned. Most companies will need quite a few of them and more as time goes on.
The best way to provision the capabilities for big data integration is to acquire as few systems as possible that have the needed features. Most of the capabilities mentioned are stronger when they are built to work together.
The ideal big data integration technology should simplify complexity, be future proof through abstractions, and invite as many people and systems as possible to make use of data.
The Rewards of Getting Big Data Integration Right
Data does no good unless it is presented to a human who can somehow benefit from it or unless it is used in an automated system that a human designed. The point of big data integration is to make it as easy as possible to access, understand, and make use of data.
The rewards of getting big data integration right are the benefits that come from greatly expanded and timely use of all available data. Reducing delays, eliminating skill bottlenecks, and getting fresh data to analysts and applications means that an organization can move faster and more effectively.