Data drives the modern organizations of the world. Making sense of this data, discovering the various patterns and revealing the unseen connections has become critical to business success. Organizations are using more data than ever before to make better decision and unlock new data driven revenue streams. Being able to handle the massive data growth and extract timely insights from this increasingly important asset is therefore a strategic imperative for enterprises today and has spurred an entire Big Data industry.
Hadoop has emerged as the platform of choice for Big Data Analytics. Hadoop Distributed File System (HDFS) as the defacto scale out file system for Hadoop has served to scale storage and compute needs for big data workloads across commodity hardware. However as we scale beyond the petabyte range, economics related to storing and archiving of this valuable asset has become a major concern for the big data industry.
The challenges that limit the overall adoption and scalability of traditional big data
- Need for moving data from source system’s storage into HDFS cluster resulting in delayed analytics.
- Increased operational and capital cost of holding copies of data in different systems.
- Security risk of associated with data residing outside of enterprise IT security domain in shadow IT environments.
- Inability to scale storage nodes independent of compute in HDFS, leading to unnecessary resource cost. Specially for warm data/archive data that only uses storage and little to no compute resources.
- Reliance on data replication (factor of 3) for data protection at large scale which increases the overall cost of data storage.
SOLUTION FEATURES & BENEFITS
- Independent scaling of storage from expensive compute resources
- Efficient data durability, with erasure coding – double the usable capacity as compared to 3x replication
- In-place analytics directly from storage leading to faster insight and reduced infrastructure costs
- Support for multi-site automated DR and local storage access at remote locations • Single namespace across locations eliminates management workload and complexity through a unified view of data
- Start small and easily expand solution with non-disruptive scaling
- Hybrid and MultiCloud interoperbility with AWS, Google Cloud Platform, and Azure.
Object-based storage systems over come these limitations and are an attractive complement to HDFS giving a scalable and cost effective option for creating Big Data lakes.
Cloudera’s Enterprise Data Hub (EDH) is a modern big data platform powered by Apache Hadoop at the core. It provides a central scalable, flexible, secure environment for handling workloads from batch, interactive, to real-time analytics. Cloudian HyperStore provides a Software Defined Storage platform that delivers industry leading scalability supporting 100s of petabytes (PBs) with via the S3 Restful API, with data integrity and protection.
A Big Data lake built on Cloudian’s HyperStore working on Cloudera’s EDH via an S3A
connector, provides a economical and scalable storage option, for Big Data workloads.
This enables enterprises to,
- Create a tiered Big Data storage environment where hot data can reside within HDFS for immediate access, while HyperStore provides the storage/archive layer for warm data
- Analyze data in place without moving it from the source system into HDFS
- Scale storage independent of compute nodes, which is a huge cost saving benefit
- Take advantage of the associated metadata stored within HyperStore for accessing and searching through data under archive
- Use erasure coding instead of 3x replication for the HyperStore layer, getting increased storage density and doubling the usable storage capacity as compared to traditional HDFS
In addition HyperStore supports multi-site deployment for data protection and disaster recovery. HyperStore’s policy-based replication ensures that key assets are automatically copied and available at two or more sites, providing failover instances if the primary site goes offline.