Streaming data is becoming a core component of enterprise data architecture. Streaming technologies are not new, but they have considerably matured over the past year. The industry is moving from painstaking integration of technologies like Kafka and Storm, towards full stack solutions that provide an end-to-end streaming data architecture.
What is Streaming Data Architecture?
A streaming data architecture can ingest and process large volumes of streaming data from multiple sources. While traditional data solutions focused on writing and reading data in batches, a streaming data architecture consumes data immediately as it is generated, persists it to storage, and may perform real-time processing, data manipulation and analytics.
Why Streaming Data Architecture? Benefits of Stream Processing
Stream processing is becoming an essential data infrastructure for many organizations. Typical use cases include click stream analytics, which allows companies to track web visitor activities and personalize content; eCommerce analytics which helps online retailers avoid shopping cart abandonment and display more relevant offers; and analysis of large volumes of streaming data from sensors and connected devices in the Internet of Things (IoT).
Stream processing provides several benefits that other data platforms cannot:
- Able to deal with never-ending streams of events—some data is naturally structured this way. Traditional batch processing tools require stopping the stream of events, capturing batches of data and combining the batches to draw overall conclusions. In stream processing, while it is challenging to combine and capture data from multiple streams, it lets you derive immediate insights from large volumes of streaming data.
- Real-time or near-real-time processing—most organizations adopt stream processing to enable real time data analytics. While real time analytics is also possible with high performance database systems, often the data lends itself to a stream processing model.
- Detecting patterns in time-series data—detecting patterns over time, for example looking for trends in website traffic data, requires data to be continuously processed and analyzed. Batch processing makes this more difficult because it breaks data into batches, meaning some events are broken across two or more batches.
- Easy data scalability—growing data volumes can break a batch processing system, requiring you to provision more resources or modify the architecture. Modern stream processing infrastructure is hyper-scalable, able to deal with Gigabytes of data per second with a single stream processor. This allows you to easily deal with growing data volumes without infrastructure changes.
The Components of a Traditional Streaming Architecture
1. The Message Broker
This is the element that takes data from a source, called a producer, translates it into a standard message format, and streams it on an ongoing basis. Other components can then listen in and consume the messages passed on by the broker.
The first generation of message brokers, such as RabbitMQ and Apache ActiveMQ, relied on the Message Oriented Middleware (MOM) paradigm. Later, hyper-performant messaging platforms emerged which are more suitable for a streaming paradigm. Two popular streaming brokers are Apache Kafka and Amazon Kinesis Data Streams.
Unlike the old MoM brokers, streaming brokers support very high performance with persistence, have massive capacity of a Gigabyte per second or more of message traffic, and are tightly focused on streaming with no support for data transformations or task scheduling. You can learn more about message brokers in our article on analyzing Apache Kafka data.
2. Stream Processor / Streaming Data Aggregator
The stream processor collects data streams from one or more message brokers. It receives queries from users, fetches events from message queues and applies the query, to generate a result. The result may be an API call, an action, a visualization, an alert, or in some cases a new data stream.
A few examples of stream processors are Apache Storm, Spark Streaming and WSO2 Stream Processor. While stream processors work in different ways, they are all capable of listening to message streams, processing the data and saving it to storage. Some stream processors, including Spark and WSO2, provide a SQL syntax for querying and manipulating the data.
3. Data Analytics Engine
After streaming data is prepared for consumption by the stream processor, it must be analyzed to provide value. There are many different approaches to streaming data analytics. Here are some of the tools most commonly used for streaming data analytics.
4. Streaming Data Storage
With the advent of low cost storage technologies, most organizations today are storing their streaming event data. Here are several options for storing streaming data, and their pros and cons. A data lake is the most flexible and inexpensive option for storing event data, but it has several limitations for streaming data applications.
Upsolver provides a data lake platform that ingests streaming data into a data lake, creates schema-on-read, and extracts metadata. This allows data consumers to easily prepare data for analytics tools and real time analytics.