In this paper:
What are Data Lakes?
A data lake is a centralized repository that allows an organisation to store all its structured and unstructured data at any scale. Data can be stored ‘as-is’, without having to first to be structured, and different types of analytics can be run on the as-is data. These can range from dashboards and visualizations to big-data processing, real-time analytics, and machine learning to guide better decisions.
According to Aberdeen Research, organizations that implemented a data lake tend to outperform similar companies by 9% in organic revenue growth. These technical leaders were able to carry out new types of analytics like machine learning, over new data sources such as log files, data from click-streams, social media, and internet-connected devices stored in the data lake. This helped them to identify and act upon opportunities for business growth faster such as attracting and retaining customers, boosting productivity, proactively maintaining devices, and making informed decisions.
Data Lakes compared to Data Warehouses
Depending on the requirements, a typical organization will need both a data warehouse and a data lake as they serve different purposes (use cases). A data warehouse is a database structured and optimized to analyse relational data coming from transactional systems and line of business applications. The data structure, and schema are defined in advance to optimize for fast SQL queries, where the results are typically used for operational reporting and analysis. Data is cleaned, enriched, and transformed so it can act as the “single source of truth” that users can trust. A data lake is different, because it stores both relational data from line of business applications, and non-relational data from mobile apps, IoT devices, and social media. The structure of the data or schema is not fully defined when data is captured. This means organisations can store all of their data without careful design or the need to know in advance what questions might need answers in the future. Different types of analytics on corporate data such as SQL queries, big data analytics, full text search, real-time analytics, and machine learning can be used to uncover insights.
As organizations with data warehouses see the benefits of data lakes, they are evolving their data storage to include data lakes, thus enabling diverse query capabilities, data science use-cases, and advanced capabilities for discovering new information models.
How Data Lakes relate to financial services
For Financial Services organizations, managing and analysing customer data is essential. These data can help identify cost savings and operational efficiencies, ensure compliance, and drive innovation for ADI customers.
Harnessing data can improve internal business outcomes, such as more streamlined processes, accurate forecasting, and risk assessment. Corporate data can also impact external business outcomes, such as understanding market trends, delighting customers with personalized solutions, and uncovering new opportunities. In short, when used to full advantage, corporate data can directly impact the bottom line.
Financial Services organizations also face increasing compliance and security pressure. ADI’s must comply with regulatory requirements, such as the General Data Protection Regulation. Increasingly, banks must ensure that personalised data is used only in ways for which customers have given explicit consent and ensure that data is stored securely while maintaining the access required to make critical decisions in real-time. Through Data Lakes on Cloud services, ADI’s can integrate data across siloes, visualize it, take steps toward protecting it, and plan for data governance. The four key benefits for financial institutions from using Data Lakes are:
1) Data Integrity
2) Adherence to regulations
3) Compliance planning for data volatility and scalability
4) Mitigating risk and fraud threats - useful when reporting to AUSTRAC or maintaining ongoing license obligations.
Further, ADI’s can collect and store different types of data from multiple sources across the organization, at any scale. They can catalogue, search, and find the relevant data from a central repository, and finally use it to gather deeper insights.
Additionally, when looking from a compliance and governance perspective, using Data Lakes help to simplify data in a secure way to reduce unauthorized access, placing policies and parameters around data to ensure it is compliant with internal processes and regulatory requirements.
In the Australian market, Banks and Insurance companies are racing to flood new data lakes with millions of customer records in order to have a central pool of information that can be trawled for insight. At the same time, Australian ADIs must be aware that data lakes do not offer a silver bullet.
Because of the way financial sector information systems have traditionally been structured, customer data remains widely dispersed, stored in multiple silos and systems. In order to get to a single view of the customer or perform big data analysis to identify patterns in customer behaviour, organisations have first to bring that data together. Data lakes are seen as the best, most flexible, currently available option. The industry hope is that emerging data lake analysis systems, such as Apache Hadoop or Microsoft Azure, can then be used to trawl the lake for insights about an individual customer.
One example of fast data lake adoption is NAB. NAB is designing its data architecture to meet foreseeable higher standards of privacy in the domestic market. The bank is already subject to the general data protection regulation (GDPR) rules since it processes data relating to European economic area (EEA) residents. However, even while NAB has well-established models for credit risk, market risk, and liquidity risk, the bank decided to enhance assurance and governance processes with AI and machine learning (ML). NAB has developed a series of key components that make up its data architecture, focused on the data lake – known as the NAB data hub or NDH.
NAB also implemented a data management layer within NDH as a book-of-record for metadata, lineage, and data quality. The addition of Kafka to stream data into the data hub, and Apache Beam to move data between the zones and Amazon Web Services storage to store the data, completes the data management infrastructure for now. NAB is evolving towards a multi-cloud approach in which Microsoft Azure sits side by side with AWS, offering a comprehensive alternative platform, thus avoiding concentration risks.
One of the main pain points that drove NAB to implement such a comprehensive data infrastructure was the need to demonstrate the provenance and lineage of the data, i.e. to prove where data came from and how it had been transformed into its current ‘state’ with transparency. Or to put it another way, to avoid creating a data swamp.
Sources within NAB stated, “as we're building out our platform, we're actually landing data in our raw zone and tagging it with business and technical metadata, and then as we move it from raw to curated and curated to conformed, we are publishing up those lineage components so that we have an automated trail of the lineage of the data as it moves through the cloud”.
For the Technologists
The four main advantages of data lakes to financial institutions are:
1. Banks and ADIs can consolidate their data into one place: Data lakes, such as Azure Data Lake, bring together big data from disparate sources across cloud and on-premises environments into one central place. Stored data can be monitored and managed more easily, without having to go back and forth between multiple silos.
If there is a need to reduce the number of places data is stored or to consolidate the tools used for data analytics, data lakes are an ideal solution for data consolidation.
2. Data lakes are cost-effective: Options are available for on-demand clusters or a pay-per-job model when data is processed. No hardware, licenses or service-specific support agreements are necessarily required. The data lake scales up or down with the bank’s business needs, meaning that banks and ADIs never pay for more than they need. Data Lakes enable scale storage and compute, delivering increased economic flexibility when compared with traditional big data solutions. So, using a Data Lake can minimise costs while maximising the return on data investment.
3. Data Lakes are secure and compliant and fully managed and supported by the service provider backed by SLAs and support services and service desk. A Data Lake protects data assets and extends on-premises security and governance controls to the cloud easily.
4. Data lakes are supported by developer-friendly platforms. Traditionally finding the right tools to design and tune big data queries could be difficult.
5. For the more technically minded: Data should always be encrypted – in motion using SSL, and at rest using service or user-managed HSM-backed keys kept in a Key Vault. Capabilities such as single sign-on (SSO) and multi-factor authentication are typically built into the platform through additional directory services.)
6. Data lakes provide deep integration with Jira, Confluence, Azure DevOps, Visual Studio, Eclipse, and IntelliJ, and so on, so that familiar tools are available to run, debug and tune code. Visualisations, for example, U-SQL, Apache Spark, Apache Hive, and Apache Storm jobs show how code runs at scale and identify bottlenecks and cost optimisations.
Because of their scalability and cost-effective nature, data lakes are increasingly being used to handle big data. Big data is generally high in volume and takes a long time to process and analyse for meaningful insights, so having a scalable and centralised solution to store massive amounts of raw, unstructured information without having to transform it first - while having native integration with powerful data analysis tools - is becoming an increasingly essential toolset for a business that want to become more data-driven in their decision making.
Impact on Data Governance on Data Lakes
When it comes to data, many banks do not have a documented set of processes that outline how essential data assets are to be managed, and how data quality is to be maintained. This is potentially a significant source of avoidable risk and may leave these institutions without the ability to focus on data quality, and accountability.
Data lakes demand excellent data management. It should be no surprise then, that the lack of data governance is the root cause of many data lake failures. Banks and ADIs need to consider how universally data is accessed throughout their organisations and through which applications. During the initial phase of any data lake implementation, the focus may not be on how to organise and control data. However, given that data is to be accessed by multiple users through multiple applications, governance is essential. It is essential to address significant data governance problems early in the process of building the data lake. Banks need to think before they move data into a data lake – governance can’t be an afterthought.
Hadoop and Azure Data Lake are simple storage systems which are amenable to a wide variety of control mechanisms. This enables banks and ADIs to ingest data in an intentional and planned way that leads with governance. 
As the next generation of data lake technologies emerges the focus on tactical data ingestion, enterprise planning, data quality and security, will drive new approaches to data governance, build upon a more robust foundation. Gen Advisory is here to help our clients to turn data lakes into valuable, well-governed, resources.
Data lake engineering will mature as banks and financial institutions learn from previous experience and tools such as Hadoop and Azure evolve new capabilities. Those banks that can achieve the vision of making data lakes truly user-centric and useful will be at the cutting edge of analytics-driven product and service innovation, safeguarding what is surely one of their most significant technology investments and ensuring ROI is delivered quickly and efficiently.
Banks are under pressure to meet compliance standards. Central to this are governance issues associated with data aggregation. Banks are required to implement policies around data governance, data aggregation, IT infrastructure, reporting and more to ensure a timely, accurate and complete view of data across multiple lines of business in order to better understand, anticipate, manage and mitigate risks. A well-managed data lake enables financial institutions to capture and automate the aggregation of data across the organization to better understand and improve data quality. Banks can aggregate risk data in near real-time and generate risk reports with frequency and on-demand, including during crisis situations, to support changing internal needs and for auditing or supervisory queries. When financial crises occur, they tend to escalate quickly, and executive leadership needs accurate up to the minute status reports in order to manage the crisis. Well governed data lakes can be a vital component of that.
Transaction fraud, identity fraud and money laundering are also major concerns for financial services organizations. One reason is that many fraudsters are able to manipulate billing faster than investigators can audit. A well-managed data lake can enable (near) real-time data ingestion and automated fraud detection with algorithms that detect patterns of potential fraud hidden in huge volumes of data. Creating a centralized data catalogue provides an intuitive user interface for search and ad-hoc analytics of all data and supports non-technical staff such as attorneys to quickly perform self-service data analytics through an intuitive user interface.
How Gen Advisory can support
· We support banks on automated compliance and regulatory reporting
· Gen Advisory provides strategic advice to ADI and FinTech on data lakes
· Delivering use cases in terms of opportunities vs threats of data lakes implementation and usage
· Delivering bespoke research
· Conduct due diligence on solutions from a governance perspective