What is a data lake?
A data lake is storage platform designed to hold, process, and analyse structured and unstructured data. It is a system or repository of data stored in its natural/raw format, usually object blobs or files. Compared to a Data Silos (data are stored in closed containers and they cannot interact), data lakes give access to the entire system of data, immediately.
A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as Reporting, Visualization, Advanced Analytics and Machine Learning.
How do we implement a data lake?
A data lake can be established “on premise” (within an organization's data centers) or “in the cloud” (using cloud services from vendors such as Amazon, Google and Microsoft).
Many companies use cloud storage services such as Google Cloud Storage and Amazon S3 or a distributed file system such as Apache Hadoop.
Earlier data lakes (Hadoop 1.0) had limited capabilities with its batch-oriented processing (MapReduce) and was the only processing paradigm associated with it. Interacting with the data lake meant one had to have expertise in Java with map reduce and higher level tools like Apache Pig, Apache Spark and Apache Hive (which by themselves were batch-oriented).
Typically, the IT organization takes the lead on vetting potential technology options and approaches to building data lakes, with little input from the business units.
What type of data could we use?
A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).
Examples of Data lake technologies: Amazon Web Services S3
Amazon S3 is the core service at the heart of the modern data architecture. Amazon S3 is unlimited, durable, elastic, and cost-effective for storing data or creating data lakes.
A data lake on S3 can be used for Reporting, Analytics, Artificial Intelligence (AI), and Machine Learning (ML), as it can be shared across the entire AWS Big Data Ecosystem.
Once the data is available on S3, various AWS services can be used to consume the data for Reporting, Analytics, ML, or Ad-hoc queries.
Amazon Athena allows you to use S3 data for Ad-hoc queries using SQL, or with visualization tools for dashboards. Amazon Redshift Spectrum allows you to query data as external tables from Amazon Redshift, which is the AWS data warehousing service. The infrastructure costs for building an Amazon S3 data lake are modest, and using a consulting service helps by shaving off months on the implementation. It can also cut down on ongoing maintenance, lowering the overall TCO of the solution.
Note that the time to build does not include the time for getting access, connectivity, troubleshooting network issues, or any other infrastructural issues addressed. It assumes that all access is available, and all prerequisites have been completed as listed in the manual. It also assumes that data up to 100 GB is being ingested, as the initial refresh depends on the network speeds.