Fishing for Knowledge in the Data Lake

Cast a line, Catch a Byte: Fishing for Knowledge in the Data Lake

In the vast oceans of digital data, where waves of ones and zeros crash onto the shores of our technological era, there lies a unique and expansive body of water – the Data Lake. Just as seasoned fishermen know that the richest catches aren’t always visible on the surface, data enthusiasts recognize that beneath the placid surface of these lakes lie untapped reservoirs of information. So, grab your digital fishing rod and prepare your data nets, because we’re about to embark on an expedition into the depths, casting lines to catch bytes and fishing for the invaluable knowledge hidden within Data Lakes.

Alright, enough with the fishing puns. You may be thinking, “So what the %&@# is a data lake and why do I need one?” Put simply, data lakes at their core are centralized storage repositories that house historical data that businesses can query for analytics and business intelligence purposes. Sound like a data warehouse? That’s because it is, and a whole lot more. While traditional data warehouses store only structured data- data in tabular, row/column format with a predefined schema- data lakes specialize in both structured and unstructured data. This means traditional databases, CSV files, videos, data from IoT devices, social media comments, text files, emails, text messages, call logs, sensor data…the list goes on. Any and all data types that don’t fit neatly into a tabular format can be stored by a data lake.

With the exponential growth of data generation in recent years, the importance of having an efficient, scalable, and flexible data storage system cannot be overstated. Traditional databases, while effective for specific structured data tasks, are often ill-suited to handle the vast and varied streams of real-time data of today’s enterprise. Data lakes store vast volumes of raw data in its native format but also provide powerful tools and platforms for advanced analytics, machine learning, and artificial intelligence. By harnessing the potential of data lakes, organizations can gain unprecedented insights, drive innovation, and streamline decision-making processes.

On the shores of these vast bodies of unstructured data sits the “data lake house”. A relatively new term in the big data ecosystem, the data lake house aims to blend the best features of data warehouses and data lakes- the low-cost storage and support for disparate data types of a data lake with the performance, reliability, and maturity of a BI ecosystem typically found in a data warehouse. The result? A data lake house typically consists of a distributed query engine, BI and machine learning integration, and data streaming/data ingestion. The result? A single, unified data lake management platform that allows an organization to analyze any number of disparate data sources, sizes, and types, at virtually any scale.

Cast-a-line-Catch-a-Byte-Fishing-for-Knowledge-in-the-Data-Lake-Middle Let’s dive into a real-world scenario. EComPro (ECP), a high-volume e-commerce company, services hundreds of thousands of online orders per day. Their main transaction processing system is backed by a finely tuned, high performance OLTP database on MySQL. MySQL has change data capture configured, which streams data changes in real time to EComPro’s data lake in Amazon’s Simple Storage Service (S3). A rapidly expanding organization, EComPro runs distribution centers all over the country. These distribution centers are equipped with all types of network-enabled sensors- temperature sensors, weighing stations, x-ray machines, even radiation dosimeters. All of these sensors output logs in either CSV or text files daily, which get uploaded to S3.

Over the last month (August), ECP has received numerous return requests for a perishable food item, Japanese brand Kewpie Mayonnaise. All return requests have the same complaint- that the product is separated and appears spoiled upon delivery. However inbound shipments of the product are confirmed fresh, and all expiration dates are well in the future. With return requests increasing for this product, ECP turns to its data analysts to identify any possible trends using their vast data lake.

First, analysts query the data lake data quality for all orders involving Kewpie mayonnaise in the last month for which return requests were submitted. Using this first query, they are able to narrow down all problem orders as originating from two distribution centers, one in Texas and one in North Carolina, and obtain dates and times these orders were present in these distribution centers warehouses. Using the same query engine, analysts then query the temperature sensor data of these two warehouses (stored in raw text format) from the data lake using these dates and times as a filter, and perform a JOIN on this data. This outputs a table of problem orders for this product, dates and times the orders were present in the warehouses of these 2 distribution centers, and the temperatures of the warehouses for those time periods. Finally, they plot this data using a graphing/charting tool to visualize the trends.

So what did the analysts discover? All of these orders were present in these 2 warehouses during midday hours, 11am-2pm in their respective time zones, and temperatures in the warehouses spiked to over 100 F for at least 1 hour during those times. This could be a cause of the spoiled product being shipped from these facilities. This prompted warehouse maintenance staff to check HVAC systems, who confirmed freon was low and units were entering defrost cycles regularly to compensate. HVAC system was repaired, which stopped the spikes, and ultimately resulted in elimination of repeated return requests for this product.

In the realm of digital evolution and data lake exploration, the vastness of data can sometimes seem as incomprehensible and mysterious as the deepest corners of our oceans. Yet, as EComPro’s case underscores, there’s a tangible, practical, and powerful reality awaiting those willing to chart these waters. The seamless marriage of vast storage capabilities with refined analytical tools—embodied in the data lake house—offers businesses an unparalleled ability to understand, adapt, and thrive. The modern-day challenge isn’t just about collecting the data, but interpreting it in meaningful, actionable ways. EComPro’s effective response to the crisis of spoiled products serves as a testament to the capabilities and potentials of data lakes and lake houses. In our digital age, this is the new frontier of problem-solving. And for those equipped with the right tools, the depth of data is no longer an intimidating abyss but a treasure trove of insights waiting to be uncovered. Just as explorers once navigated the uncharted waters of our world, today’s data pioneers are charting the new territories of the digital landscape, ensuring not just survival, but thriving in an ever-evolving market.