The importance of data for generative AI

October 24, 2024

SADA Says | Cloud Computing Blog

As more businesses embrace the transformative power of generative AI, they’re coming to appreciate the importance of data integrity to producing quality results. Data and AI go hand in hand, and it’s never been more important to develop best practices around the data we’re feeding to LLMs. In this post, we’re taking a close look at the kinds of considerations you’ll need to make when using data for generative AI.

New Gartner research suggests that at least 30% of generative AI businesses testing the technology will end proofs of concept by the end of 2025. Early adopters often struggle with escalating costs, and deployments can range from $5M to $20M.

With so much on the line, ensuring that AI and GenAI projects succeed is imperative for organizations that want to drive efficiency and generate new revenue streams. The challenges facing any business that hopes to take advantage of the promise of GenAI include collecting, storing, organizing, and managing the data that informs today’s models.

generative AI models, retrieve data, data scientists, reliable data, access controls, preparing data, contextually relevant response, ensuring data quality

An AI cautionary tale

One cautionary example of how data can negatively impact the outcomes of AI projects is the case of IBM Watson and the MD Anderson Cancer Center.

In 2017, IBM Watson got the opportunity to partner with MD Anderson Cancer Center to use AI to help oncologists diagnose and recommend treatments for cancer patients. After several years and over $60M invested, the project was halted due to data issues. Since medical records were stored in multiple locations in unstructured formats across different systems, it was difficult to unify data in a way that Watson could effectively utilize. As a result, Watson struggled to produce accurate results, and the project was ultimately considered too costly.

This is just one example of how the best-laid plans can go awry when the integrity of data isn’t attended to from the beginning of a major initiative. It’s one thing to ask whether an organization is ready to take on an AI project. A better question is whether the data itself is ready.

Common reasons GenAI models fail to deliver ROI

According to a new RAND Corporation report, an estimated 80% of AI projects fail to achieve their intended goals, with data quality emerging as the second most pressing hurdle. This high failure rate often stems from:

Lack of clear business objectives. Without well-defined goals, organizations struggle to identify suitable AI use cases and measure success.
Inadequate data. Either a lack of data quality or the quantity of data necessary to properly train AI models hinders organizations in the long run.
Too much focus on novelty. Many AI projects fail because the organization is more concerned about using the latest technology to generate flashy content than solving tangible problems for their users.
Improper infrastructure. Often organizations lack the adequate infrastructure to manage data and deploy AI models efficiently.
Project complexity. AI projects fail when the technology is applied to problems that are too complex for it to handle.

data challenges, existing data, generative ai projects, ai applications, artificial intelligence, existing datasets, large language model, ai systems, ai governance, business growth

GenAI model success starts and ends with your data

Data quality and data infrastructure are two major reasons why AI and GenAI projects can come up short. Since GenAI models require large volumes of high quality data for training, inaccurate and incomplete data can impact the results. In the case of LLM models, many users have seen hallucinations that provide erroneous answers to questions that can be easily answered with access to common knowledge. Let’s look at two case examples of projects that suffered from poor data quality.

In May 2023, a lawyer used ChatGPT to help with legal research and ended up submitting a brief with six completely made-up cases. The judge was not impressed, to say the least, and the lawyer faced potential sanctions. This incident highlighted the dangers of relying on AI without double-checking its output, especially in high-stakes situations.

In another case, An Australian mayor was falsely accused of bribery and imprisonment by ChatGPT. This incident demonstrated how AI hallucinations can damage reputations and spread false information about individuals, leading to troubling real-world consequences.

Hallucinations are just one of many examples of feeding an AI model inaccurate data and experiencing inconsistent or inaccurate results.

Three V’s of data and their impact on AI models

AI may be evolving at breakneck speed, but many of the core rules around data that professionals have relied upon for decades have remained the same. Ten years ago, big data technology was rapidly transforming how organizations understood and used data.

The 3 V’s (volume, velocity, and variety) was one of the primary frameworks defining the properties of big data. These rules still apply today, especially when it comes to data for generative AI. Let’s break down how the 3 V’s can adversely impact today’s AI and GenAI models if not managed correctly.

Volume. Since GenAI models like LLMs need to be trained on vast amounts to provide more accurate results, the volume of data is critical. Machine learning models provide better insights when they’re trained on large amounts of data. As data scales and the cost of storage rises, data management becomes even more critical.
Velocity. “Time is money, money is time..” Indeed it is, especially when it comes to critically important data. In the domain of cyber security, ML models constantly search for anomalous behaviors that can result in a data breach. The better prepared your organization is to handle real-time or near real-time data, the better fortified you are against a potential security breach.
Variety. In 2024, there are all types of data with which people want to engage. Many of today’s LLM models are now multimodal large language models (MLLM), a type of AI that can process and generate information from multiple modalities including text, audio, images, and video. As data sets become more diverse, the complexity of data storage and data management become critical to producing valuable results. This is where proper labeling and the capacity to handle both structured and unstructured data is so important, as the IBM Watson example illustrates.

using generative, ethical considerations, domain experts, drive innovation, new and original content , sensitive data, preparing data, data scientists, ensuring data quality

Google Cloud Databases and GenAI

Google played a major role in defining the modern space of data management, data operations, analytics, and scalable storage solutions. Their solutions are built to support AI and GenAI applications. When it comes to the proper implementation of the 3 V’s of data, Google Cloud brings a set of solutions designed to handle complex data architectures that perfectly integrate with powerful AI models like Vertex AI and Gemini.

Three reasons to consider Google Cloud solutions for AI data readiness

Experience and frameworks. Google Cloud products represent years of engineering experience managing massive datasets from search, Gmail, and other cloud-based SaaS products. Frameworks like Dataflow, Bigtable, Google File System, and MapReduce established a robust foundation for AI and data-intensive workloads.
Building enterprise GenAI apps faster. Google Cloud databases simplify integration with existing developer ecosystems. Support for popular open-source standards like PostgreSQL and HBase ensures easy migration of legacy databases. With Spanner, Google Cloud also supports global scale while providing high performance solutions with BigTable.
AlloyDB for AI: AlloyDB is purpose-built for enterprise GenAI applications that require real-time and accurate responses. Its superior performance for transactional, analytical, and vector workloads makes it ideal for demanding GenAI applications.

Data readiness for AI-driven transformation with SADA

The path to AI-driven transformation begins with preparing your data. SADA’s expertise includes a track record of identifying, cataloging, and examining the data sources organizations need to achieve desired business outcomes. What steps does SADA take to make your data AI-ready?

Step 1: Discover

Oftentimes, organizations need a clarified data strategy to avoid data sprawl and disorganization. SADA collaborates with your team to locate and organize your data, ensuring that everything is accounted for.

Step 2: Move

Sometimes data needs to be relocated. For example, you might have legacy data sitting in outdated systems like Teradata or Hadoop, or scattered across aging databases in your data center. SADA frequently migrates such data to the cloud for optimal processing. If your data resides in another cloud platform, you’ll have several options, from leaving it in place to summarizing, hashing, or moving it to Google Cloud.

Step 3: Validate and enhance

SADA combines your domain expertise with AI and data expertise to resolve issues at the data’s source. Your dedicated SADA team ensures that the corrected data is accurately flowing to AI training systems. If data is missing, your SADA team will help you fill in those gaps. In other instances, corrections necessitate using methods like third-party APIs for address verification, deduplication, cleaning up formatting issues, or fixing erroneous records. These processes are automated whenever possible.

Get your data ready and start innovating

For more insights into how to harness your operational databases for enterprise GenAI apps, we recommend the informative new guide from Google Cloud, Accelerating generative AI-driven transformation with databases. This helpful resource includes such topics as:

Why gen AI is making modernization more urgent than ever
What factors put databases at the heart of GenAI apps
How Google Cloud databases can help you build apps grounded in your enterprise

When you’re ready to talk to a SADA AI and data expert, feel free to sign up for SADA’s GenAI Journey Accelerator program for your customized blueprint for GenAI success.

SADA Says | Cloud Computing Blog

SADA, An Insight company, provides thought leadership, announcements, and insights related to Google Cloud products and services to organizations of any size, in every industry.

LET'S TALK

Our expert teams of consultants, architects, and solutions engineers are ready to help with your bold ambitions, provide you with more information on our services, and answer your technical questions. Contact us today to get started.