In this paper:
What are Data Lakes?
A data lake is a centralized repository that allows an organisation to store all its structured and unstructured data at any scale. Data can be stored ‘as-is’, without having to first to be structured, and different types of analytics can be run on the as-is data. These can range from dashboards and visualizations to big-data processing, real-time analytics, and machine learning to guide better decisions.
According to Aberdeen Research, organizations that implemented a data lake tend to outperform similar companies by 9% in organic revenue growth. These technical leaders were able to carry out new types of analytics like machine learning, over new data sources such as log files, data from click-streams, social media, and internet-connected devices stored in the data lake. This helped them to identify and act upon opportunities for business growth faster such as attracting and retaining customers, boosting productivity, proactively maintaining devices, and making informed decisions.
Data Lakes compared to Data Warehouses
Depending on the requirements, a typical organization will need both a data warehouse and a data lake as they serve different purposes (use cases). A data warehouse is a database structured and optimized to analyse relational data coming from transactional systems and line of business applications. The data structure, and schema are defined in advance to optimize for fast SQL queries, where the results are typically used for operational reporting and analysis. Data is cleaned, enriched, and transformed so it can act as the “single source of truth” that users can trust. A data lake is different, because it stores both relational data from line of business applications, and non-relational data from mobile apps, IoT devices, and social media. The structure of the data or schema is not fully defined when data is captured. This means organisations can store all of their data without careful design or the need to know in advance what questions might need answers in the future. Different types of analytics on corporate data such as SQL queries, big data analytics, full text search, real-time analytics, and machine learning can be used to uncover insights.
As organizations with data warehouses see the benefits of data lakes, they are evolving their data storage to include data lakes, thus enabling diverse query capabilities, data science use-cases, and advanced capabilities for discovering new information models.
How Data Lakes relate to financial services
For Financial Services organizations, managing and analysing customer data is essential. These data can help identify cost savings and operational efficiencies, ensure compliance, and drive innovation for ADI customers.
Harnessing data can improve internal business outcomes, such as more streamlined processes, accurate forecasting, and risk assessment. Corporate data can also impact external business outcomes, such as understanding market trends, delighting customers with personalized solutions, and uncovering new opportunities. In short, when used to full advantage, corporate data can directly impact the bottom line.
Financial Services organizations also face increasing compliance and security pressure. ADI’s must comply with regulatory requirements, such as the General Data Protection Regulation. Increasingly, banks must ensure that personalised data is used only in ways for which customers have given explicit consent and ensure that data is stored securely while maintaining the access required to make critical decisions in real-time. Through Data Lakes on Cloud services, ADI’s can integrate data across siloes, visualize it, take steps toward protecting it, and plan for data governance. The four key benefits for financial institutions from using Data Lakes are:
1) Data Integrity
2) Adherence to regulations
3) Compliance planning for data volatility and scalability
4) Mitigating risk and fraud threats - useful when reporting to AUSTRAC or maintaining ongoing license obligations.
Further, ADI’s can collect and store different types of data from multiple sources across the organization, at any scale. They can catalogue, search, and find the relevant data from a central repository, and finally use it to gather deeper insights.
Additionally, when looking from a compliance and governance perspective, using Data Lakes help to simplify data in a secure way to reduce unauthorized access, placing policies and parameters around data to ensure it is compliant with internal processes and regulatory requirements.
In the Australian market, Banks and Insurance companies are racing to flood new data lakes with millions of customer records in order to have a central pool of information that can be trawled for insight. At the same time, Australian ADIs must be aware that data lakes do not offer a silver bullet.
Because of the way financial sector information systems have traditionally been structured, customer data remains widely dispersed, stored in multiple silos and systems. In order to get to a single view of the customer or perform big data analysis to identify patterns in customer behaviour, organisations have first to bring that data together. Data lakes are seen as the best, most flexible, currently available option. The industry hope is that emerging data lake analysis systems, such as Apache Hadoop or Microsoft Azure, can then be used to trawl the lake for insights about an individual customer.
One example of fast data lake adoption is NAB. NAB is designing its data architecture to meet foreseeable higher standards of privacy in the domestic market. The bank is already subject to the general data protection regulation (GDPR) rules since it processes data relating to European economic area (EEA) residents. However, even while NAB has well-established models for credit risk, market risk, and liquidity risk, the bank decided to enhance assurance and governance processes with AI and machine learning (ML). NAB has developed a series of key components that make up its data architecture, focused on the data lake – known as the NAB data hub or NDH.
NAB also implemented a data management layer within NDH as a book-of-record for metadata, lineage, and data quality. The addition of Kafka to stream data into the data hub, and Apache Beam to move data between the zones and Amazon Web Services storage to store the data, completes the data management infrastructure for now. NAB is evolving towards a multi-cloud approach in which Microsoft Azure sits side by side with AWS, offering a comprehensive alternative platform, thus avoiding concentration risks.
One of the main pain points that drove NAB to implement such a comprehensive data infrastructure was the need to demonstrate the provenance and lineage of the data, i.e. to prove where data came from and how it had been transformed into its current ‘state’ with transparency. Or to put it another way, to avoid creating a data swamp.
Sources within NAB stated, “as we're building out our platform, we're actually landing data in our raw zone and tagging it with business and technical metadata, and then as we move it from raw to curated and curated to conformed, we are publishing up those lineage components so that we have an automated trail of the lineage of the data as it moves through the cloud”.
For the Technologists
The four main advantages of data lakes to financial institutions are:
1. Banks and ADIs can consolidate their data into one place: Data lakes, such as Azure Data Lake, bring together big data from disparate sources across cloud and on-premises environments into one central place. Stored data can be monitored and managed more easily, without having to go back and forth between multiple silos.
If there is a need to reduce the number of places data is stored or to consolidate the tools used for data analytics, data lakes are an ideal solution for data consolidation.
2. Data lakes are cost-effective: Options are available for on-demand clusters or a pay-per-job model when data is processed. No hardware, licenses or service-specific support agreements are necessarily required. The data lake scales up or down with the bank’s business needs, meaning that banks and ADIs never pay for more than they need. Data Lakes enable scale storage and compute, delivering increased economic flexibility when compared with traditional big data solutions. So, using a Data Lake can minimise costs while maximising the return on data investment.
3. Data Lakes are secure and compliant and fully managed and supported by the service provider backed by SLAs and support services and service desk. A Data Lake protects data assets and extends on-premises security and governance controls to the cloud easily.
4. Data lakes are supported by developer-friendly platforms. Traditionally finding the right tools to design and tune big data queries could be difficult.
5. For the more technically minded: Data should always be encrypted – in motion using SSL, and at rest using service or user-managed HSM-backed keys kept in a Key Vault. Capabilities such as single sign-on (SSO) and multi-factor authentication are typically built into the platform through additional directory services.)
6. Data lakes provide deep integration with Jira, Confluence, Azure DevOps, Visual Studio, Eclipse, and IntelliJ, and so on, so that familiar tools are available to run, debug and tune code. Visualisations, for example, U-SQL, Apache Spark, Apache Hive, and Apache