Using data to make business decisions is nothing new. Once, “data-based decision-making” might have meant noticing a correlation between a print ad campaign and anecdotal accounts of higher-than-usual sales. Businesses used whatever data they could get, when they could get it.
Two factors make the current landscape different from past evolutions. The first is an exponential increase in the volume and diversity of data being generated by billions of users and devices. The second is a demand for immediate access to high-quality data and insights. Each has brought new urgency to how companies manage data. In addition, the cost and performance of many cloud capabilities have reached a tipping point, helping make machine learning (ML) and artificial intelligence (AI) accessible to every business.
Despite widespread recognition of the value of data, few companies have implemented modern data strategies.1 Building on original research and Google’s own contributions in the cloud, this guide is designed to help IT and business leaders implement modern, cloud-based strategies for data management. In each section, we highlight technologies helping companies turn a vast, complex data landscape into useful business insights.
Google Cloud’s Guide to Data Analytics & Machine Learning draws upon Google’s twenty years of tackling some of the industry’s toughest data problems. Along the way, we’ve contributed original research that has helped to shape the Big Data landscape: from two research papers in late 2003 and 2004, which together spawned the Hadoop movement; to the Dremel paper, which forms the basis for the cloud data warehouse capability you’ll read about in this guide.
We designed, built, and deployed Spanner, the first system to distribute data at global scale and support externally consistent distributed transactions—and, in 2017, made it generally available to our customers.3 More recently, Google Brain has helped fuel the industry’s renewed interest in AI, leading up to the release of our TensorFlow Project into open source.4 With this Guide, we look forward to sharing our experience with leaders looking for ways to unlock the promise of machine learning and AI for their organizations.
THE NEW DATA LANDSCAPE
Managing data would be easier if growth were limited to a few sources, or if data were uniform. The challenge lies in the diversity of sources and formats. This includes the growing volume of unstructured data: emails, system logs, web pages, customer transcripts, documents, slides, informal chats, and an exploding volume of rich media like HD images and video. Enormous volumes of information are available instantaneously from any device connected to the Internet, driving new expectations around the availability and immediacy of data.
IT teams are stuck in the middle. They must find ways to deliver a realtime view of the business while also managing a larger and more complex data landscape. As with many software initiatives, reducing complexity is an important determinant of success.
SERVERLESS: THE PATH TO IT PRODUCTIVITY
Modern serverless architectures are the culmination of a series of efforts to shrink the surface area of responsibility that developers and IT teams must manage. Fundamentally, the goal of serverless computing is to eliminate commodified work—managing server clusters, sharding databases, load balancing, capacity planning, ensuring availability—so IT teams can focus on what matters to the business. Serverless draws a sharp distinction between commodified IT—the mundane maintenance work that looks roughly the same at every company—and differentiated work that elevates IT to a direct provider of business value.
CLOUD STORAGE & DATA WAREHOUSING
Centralizing raw data from key business processes into cloud storage is one of the first steps organizations can take to modernize. In doing so, they position themselves to tap analytics capabilities in the cloud.
Data silos scattered across the enterprise continue to vex business and IT teams alike, with new silos (whether for organizational or technical reasons, or both) created daily.6 Harvard Business Review has published about the need for a single source of truth for data, as well as distinct lenses through which different lines of business can view the data.
Capture Raw Data for Future Analysis
IDC estimates that less than 1% of all files gets analyzed.8 The other 99%—depending on the timing of business needs—contains insights material to decision-making. Since organizations cannot predict the business questions that will arise, they need frictionless ways to store large volumes of data cheaply and flexibly. This is especially true for unstructured files, which make up the majority of data generated.
IDC ESTIMATES THAT LESS THAN 1% O F ALL DATA GE TS AN ALYZED.
Besides saving money, cloud storage serves as the basis for powerful analytics. Businesses can capture structured and unstructured files seamlessly in their native formats. Because storage is intentionally separated from processing and analysis, teams can defer structuring raw data for analytics until business questions arise. Crucially, raw data from the same foundation can be restructured easily to answer new questions on the fly. What sets cloud storage apart is how efficiently these data-capture and repurposing steps can happen. To position an organization to benefit from analytics, teams need to ensure that raw data from their business processes is captured and centralized.
Once business goals are determined, companies must identify sources of input data across silos to import into a cloud data warehouse for analysis. Here’s a list of typical input sources:
Data from cloud storage can be imported into a cloud data warehouse for analytics. At this stage, a schema can be formalized based on the business questions that need answering bringing structure to raw data for analysis.
Analytic and transactional databases
Data stored in analytic and transactional databases can be batch-loaded or streamed row by row directly into a cloud data warehouse.
Data stored within cloud services
Data stored with popular SaaS providers can be imported into a cloud data warehouse—in
many cases automatically.
Data from web, mobile, and IoT applications can bypass cloud storage and be streamed directly into a cloud data warehouse
With role-based access, any individual or application developer can query data stored in a cloud data warehouse, generate reports, or access visualizations. Cloud data warehousing supports individualized, need-to-know access management. Tailored access controls and complete auditability help democratize data science, while still maintaining security safeguards. Indeed, over one-half of firms across the U.S., Europe, and Asia-Pacific report they either are implementing, have implemented, or are expanding their use of self-service business intelligence tools across the enterprise.
REAL-TIME DATA INTEGRATION
Data scientists report spending upwards of 50–80% of their time mired in the “data wrangling, data munging, and data janitor work” required to prepare data for analysis.18 The need to provision resources and scale server clusters up and down against unpredictable workloads continues to plague teams doing data preparation on-prem.
Fully managed cloud services help insulate IT from the infrastructure work involved in doing large-scale data prep and data integration. Consider a smart thermostat seeking to learn and adjust to the preferences of different teams in an office building. While the thermostat is in use, the cloud ingests raw usage data, such as temperature settings and energy consumption levels throughout the day. As data comes in, a processing pipeline can be spun up on demand to prepare the raw data: ensuring inputs fall within a valid range, converting temperature and energy use into the desired units, formatting time data. The data pipeline formally structures this data, then loads the transformed results into a cloud data warehouse. Queries, visualizations, and reports are available instantly.
Toward Real-Time Data Analytics
With real-time streaming data analytics, data streams directly into processing pipelines. The transformed data can then be integrated into a cloud data warehouse—allowing for queries, visualization, and reporting within seconds. In this way, the processing pipeline serves as a kind of middleware that can be spun up on demand, able to join data streaming in real time with batch data pulled in from storage. Data can be structured flexibly to answer an organization’s business questions as they arise.
Recent breakthroughs in machine learning (ML) and artificial intelligence (AI) frequently make headlines. Computers have bested human world champions in Go, a board game with more positions than there are atoms in the universe.22 They’ve mastered popular video games and, critically, learned to recognize cats. 23 More recently, an AI effort achieved massive savings in energy costs, highlighting machine learning as “a general-purpose framework to understand complex dynamics.”24 This framework is starting to find diverse applications—and deliver results—across many industries.
The AI opportunity extends beyond simply automating once-manual tasks. In online retail, for example, machine-learning algorithms can ingest and analyze enormous volumes of consumer data as potential buyers navigate through a retailer’s online store or mobile app. The more data the model ingests, the closer it comes to understanding exactly when—and why—a particular buyer will decide to make a particular purchase. Eventually, this learning becomes predictive, enabling the retailer to surface the right product for the right person at the right time. This level of personalization—once typified by the small-town shopkeeper who knew the names and birthdays of her customers’ children—is now possible at scale.
ML: THE NEW PROVING GROUND FOR COMPETITIVE ADVANTAGE
The age of machine learning has finally arrived—and it’s already in full swing within smaller, tech-forward companies, according to a new survey of business and technology leaders by MIT Technology Review Custom. Some key findings:
In an age of abundant data and immediate answers, the ability to extract value from data—regardless of source, size, and requirements around timeliness—will be at the heart of an organization’s competitive advantage.