Planning ahead with an AI readiness plan: How AI will impact software development and security

The upsurge in AI tools is changing software development. But is it for the better, or the worse? This detailed plan makes sure your team, your Delivery Playbook, and your SDLC pipeline are ready.

Nov 01, 2024

I hadn’t intended to cover this topic, but the more I thought about it, the more important it became. The Delivery Playbook is about creating reliability and repeatability. Chapter 2.8 Delivery processes & tools isn’t complete without discussing AI and how it affects your pipeline. Hence, this deep dive into “Artificial Intelligence:” its opportunities, benefits, and risks. Using AI means introducing new threat vectors. I’ve included a plan: strategies, team practices, and tools that you can deploy today to leverage AI and establish protections.

What is “AI,” really?

Before we get started, let’s agree on what “AI” is. Everyone is throwing the term around. It seems there’s no industry that isn’t influenced by AI.

Daniel Warfield wrote about the surge in AI interest, “If duct tape suddenly became a trendy technology, startups would start slapping duct tape on their keyboards just so they could say ‘made with duct tape.’” He added that it wouldn’t make duct tape less useful — but it would make it harder to understand duct tape’s actual utility.1

That’s true in many ways, especially for some types of AI. The fact is, true artificial “intelligence” doesn’t exist. Intelligence is, at least according to one definition, “the exercise of understanding; readiness of comprehension.” But AI systems have neither understanding nor comprehension. AI is an umbrella term, loosely applied to a lot of different software.

If you’re new, welcome to Customer Obsessed Engineering! Every week I publish a new article, direct to your mailbox if you’re a subscriber. As a free subscriber you can read about half of every article, plus all of my free articles.

Anytime you'd like to read more, you can upgrade to a paid subscription.

Probably the most common applications of AI are machine learning and, even more specifically, large language models. There are others:

AI models fall into different categories. “LLMs” are the most commonly used.

Most of what we run into today is a large language model (“LLM”). That’s true of ChatGPT and GPT-4o, Llama, Claude (one of my personal favorites) and many more. These are a type of computational model designed for natural language processing, such as language generation. They build statistical relationships between training data, giving them a certain predictive power in regard to language — but they also inherit inaccuracies and biases present in the data on which they are trained.2

The potential benefits of AI in software development

And LLMs are a lot like duct tape. A large language model is essentially a text transformer. A system that takes in text — a lot of text, like the whole internet — and then uses that text to predict what you want. Using whatever prompt you feed it. If I type, “what comes after 5,” Claude will helpfully respond with “6.” It does that because there is a lot of evidence that when someone asks that question, “6” is probably the correct response. It’s duct-taping together connections to find a probable answer. And there are a lot of ways we can use this kind of duct tape!

Knowledge, exploration and learning may be one of the most promising applications of large language models. Using Claude instead of Google search is a big win in my book. Claude is able to analyze that vast internet of information, and in most cases find relevant information faster than if I read dozens of web sites. It’s a great tool for synthesizing questions derived from large volumes of knowledge, including documents and code (which you can feed to it, providing a great “assistant analyst”).
Code generation is a very common application. Some LLM products can assist across the entire software lifecycle, from design, to coding and peer reviews, to test development and maintenance.
Linguistic “chatbots” are providing improved customer experiences across industries. The better ones are modeling warmth and human tones, better reasoning, and can even “remember” a previous interaction.
Simulating computer use is on the horizon. Anthropic recently released an upgrade to Claude that actually “uses” other software — by accessing a computer screen it can move the cursor, click buttons, type text. It’s a bit rough around the edges yet, but in the future we’ll use AI more and more to perform tedious, repetitive tasks, such as “manual” UX and user interface testing.
Robotic process automation is another opportunity for “AI” to take repetitive, tedious, and error prone tasks off human hands. Autonomous robots, manufacturing and construction systems, self-driving cars (and vacuums) are just a few applications.

There’s a lot more, but this article is about the software pipeline, not exploring all of AI’s potential applications.

The limitations and risks of AI

Today, we’re seeing AI show up everywhere in software development. AI tools are advising engineers, performing “peer” code reviews, and generating code for us. And as the business swings around to the AI bandwagon, we’re starting to integrate LLM technology into our products — often, without fully understanding what, exactly, we’re integrating.

I use Claude and a few other tools increasingly. There’s no question that as a research tool, it’s a time saver. And often the results I get from Claude point me in interesting new directions.

At the same time, there are limitations. I can ask Claude for, “a secure authentication module that I can plug in to my Phoenix Framework app, so that users have to login with their email address, and a 2FA key.” At first glance, the results are both usable and pretty good:

Claude’s initial rendition of a 2FA-enabled Elixir app looks almost good.

The code is a good starting point, and it was a lot faster to find than searching through API docs (the code is nearly identical to examples in the Phoenix documentation). I could just plug it in, and go. And that’s the root of the problem.

What Claude failed to tell me is that the answer is woefully incomplete:

It’s a simple plug-and-play bit of code that will authenticate a user — but there is no architectural forethought.
No planning beyond basic authentication, no hint at an integrated authorization model.
The code to email 1-time codes to a new user functions, but only by happenstance. It won’t work or scale in production, it’s a simple implementation meant to indicate, “insert your enterprise grade email solution here.”
There are potential security holes as well. Storing the tokens in the users table might be secure, but only if the database itself is very secure. It’s not a best practice.
No rate limiting or session invalidation logic is provided or hinted at.
No token aging or rotation, no way to securely reset or even lock an account.
The 2FA flow is not robust because it’s based on user input alone, not true verification of a 2FA code. And, there’s no serious input validation, which could be a potential problem.

But a junior developer who’s struggling with the basics won’t know any of that. To a junior, it looks like a plug-and-play working solution. Actually, it seems that, “AI is like having a very eager junior developer on your team,” according to Addy Osmani. That means you need a senior developer’s mindset to benefit from AI tools.3

If you use the referral button below, you’ll earn free premium access to Customer Obsessed Engineering. Just three referrals will earn a free month!

Refer a friend

I also asked Claude for a test battery for my existing code. Its suggestion, again, seems great at first glance. But more careful inspection is disappointing. Instead of using the available generator library, it pulled in a new dependency, complicating my code dramatically. The tests themselves are cursory, not testing anything beyond basic functional expectations. There’s nothing to test the intent behind my design, no integration tests, no acceptance testing. But it looks good and it works, insofar as a bunch of tests pass and my code coverage stats went up.

In both cases Claude gave me something usable but fundamentally flawed and incomplete. Only my knowledge and years experience inform me of the missing pieces. Of the vulnerabilities in this seemingly complete code Claude offered me.

In a recent article on how AI is turning developers into “-10X” (yes, minus 10X) developers, Andrew Zuo makes the point:

“Programming is like writing down a list of instructions. It’s a way of guiding a machine through a task, of breaking down complex actions into their component parts, of thinking.”4

But there’s the rub, because AI’s do not think. LLMs cannot reason. There is no thinking going on, only parroting of information that already exists.

Ignacio de Gregorio wrote in-depth on the overwhelming evidence that AI progress seems to be plateauing. He cites research from Apple to back up his case. Apple’s research concludes that, “[AI’s] can’t genuinely reason, meaning their intelligence claims are highly overstated.”5

Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark. Furthermore, we investigate the fragility of mathematical reasoning in these models and demonstrate that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data. When we add a single clause that appears relevant to the question, we observe significant performance drops (up to 65%) across all state-of-the-art models, even though the added clause does not contribute to the reasoning chain needed to reach the final answer. Overall, our work provides a more nuanced understanding of LLMs’ capabilities and limitations in mathematical reasoning.6 [emphasis mine.]

Apple makes the very clear point that “LLMs are not capable of genuine logical reasoning,” and that they are, effectively, just duct-taping together what they find in training data. That, and that increasing the complexity of what we ask from an LLM results in dramatic falloff in the quality of any result.

The evidence indicates current LLM tools can be useful, for research, review, and as an aid to developing simple solutions. But complex output becomes very risky, requiring a higher level of understanding and the ability to review and fix critical omissions.

In other words, if you’re a senior, experienced developer, an LLM can be a useful accelerator. But if you are a junior developer, they may present more danger than benefit. Using one may well introduce bugs, security flaws, and maintainability concerns that far, far outstrip any benefit.

One example comes from this conversation, where the developer points out, “The one time I trusted Claude to write a full class — with excessive rounds of prompts and revisions — it introduced a very subtle bug that I would have never made but took a couple months to show up.”

So it seems there are many risks we need to be aware of. Prepare for. Head off, before they come back and bite us. And chances are, if we aren’t proactive, they will. Here are some of the problems we can expect:

Code quality risks.
- The more LLMs become a mainstay tool, the more easily and readily developers will use them to generate code.
- Yet, LLMs tend to generate incomplete and even poor-quality code, especially as our requests become more complex.
Architectural risks.
- LLMs are not able to reason, to establish a complex chain of thought that leads to understanding.
- Reasoning about complex architecture, its dependencies and implications thereof, is foundational to programming.
False information risk.
- LLMs are only as good as their training data — and, widely speaking, that means only as good as what they can find on the internet.
- LLMs frequently report false information, which means careful verification is necessary. The more complex the request, the more likely you’ll get back incorrect information.
- AI platforms are now attempting to avoid “bad training data.” But it’s not possible to sufficiently clean all input data.

Is AI a useful tool, or a threat vector?

As is often the case, advances in underlying technology exposes new potential threat vectors.

For example, a context window is the amount of text data a language model can consider at one time when generating responses. The size of this window directly influences the model’s understanding and response accuracy. Early models were built to handle small context windows — just 1,000 tokens or even less. Today’s most advanced models can handle 200,000 or more.

But as is often the case, advances in the foundational technology also introduce new threat vectors. Using “many-shot jailbreaking,” models can be “poisoned” with malicious training and context data. As the context window increases in size, it becomes possible to “stuff” the context with false data, effectively swaying the LLM to produce potentially harmful responses despite having been trained not to.

AI platforms train their models to avoid harmful data. For instance, if you outright ask Claude, “how do I build a bomb?” it won’t help you. But, by poisoning it’s context window enough, you might be able to alter the LLMs responses. Anthropic’s white paper on the topic gives us a simple example.7

Stuffing an LLMs context window with false information can change its training bias.

Keep in mind that LLMs do not reason. It’s not possible for an LLM to make a cognitive leap, to connect “that is bad” and “therefore, this other thing must also be bad.” An LLM purely evaluates the data input, and generates a predicted output — so by putting a finger on the scale, it’s possible to tip the balance and influence that output.

And hence the problem: The effectiveness of many-shot jailbreaking relates to in-context learning, and the larger and better an LLM is, the better it tends to be at in-context learning. It’s a balancing act between features and security. One that is evolving as we watch.

Safely putting AI to use

It’s easy to start using AI products. In fact, it’s quite likely that some of the tools you use are AI driven, whether you like it or not (for example, Apple’s Siri).8

This month, Alexey Boas, Ken Mugrage and Neal Ford, talked about the “cambrian explosion of generative AI tools.” They go deep on coding assistance anti-patterns and the risk of relying too heavily on AI, pointing out that we still need to think for ourselves and “double down on good engineering practices.” Otherwise, we risk, “encrypting your own codebase with complexity.” It’s a great podcast, well worth listening too.9

It’s important to be vigilant about potential weaknesses and vulnerabilities. Then we can develop strategies for successfully adopting — and benefiting from — AI-driven tools.

Careless acceptance of AI-generated code without proper review is likely the most significant risk in our early delivery pipeline. We risk introducing outdated practices, anti-patterns, unnecessary complexity, and security flaws.

Intellectual property infringement is also a concern. The risk of incorporating copyrighted code is very real. Likewise, there is legal uncertainty in situations involving generated code. It’s virtually impossible to track down the origin of these code snippets, so on both fronts we risk contaminating our codebase.

Relying on AI coding assistants too readily could easily weaken our entire SDLC with:

Future scalability limitations.
Poor reuse of system-wide architectural patterns.
Inconsistent company-specific coding standards.
Long-term maintenance problems.
Poor, missing, or inadequate documentation and provenance information.

When it comes to maintainability, there can be a high cost as well, with suspect code introducing latent problems:

Unnecessary memory allocations.
Inefficient database queries.
Redundant API calls.
Poor resource utilization.
Suboptimal algorithm choices.

Over-reliance may have an impact on us, too. Degradation of developers' fundamental coding skills could be a real thing. Using AI as a crutch too readily could lead to reduced understanding of system internals due to AI abstraction. We run headlong into knowledge gaps between AI-generated and human-written code.

We do, indeed, need to think for ourselves — and in so doing, ensure our adoption of AI tools achieves an overall benefit.

Building your AI readiness plan

We need a comprehensive approach to safely adopting AI tools. We need to look at every stage of our delivery pipeline and ask, “what do we need to do here?” That means adopting tools to protect code quality, avoid over-reliance, avoid security vulnerabilities, detect active threats.

Just a reminder: Chapter 2.8 Delivery processes & tools of the Delivery Playbook goes deep into a lot of these topics. If you haven’t already, I’d recommend reading the companion article, When should you think about security?.

The first step is establishing your plan. The following outlines develops a readiness plan that will prepare your team to safely adopt AI capabilities. Then, with a clear picture of what needs to be done, you can address your timeline and prepare a roadmap for adoption.

Share this post! Can you think of someone that would be interested in learning about AI? Sharing helps grow my audience, which keeps me writing content for you!

Risk Assessment and Policy Framework

Initial Risk Assessment
1. Conduct comprehensive AI threat modeling (I recommend this Threat Modeling Cheat Sheet). This sounds intimidating, but it’s a straightforward approach to answering:
  1. What are we working on?
  2. What can go wrong?
  3. What are we going to do about it?
  4. Did we do a good enough job?
2. Identify potential attack vectors (data poisoning, model extraction, prompt injection, code quality and security, standards and compliance management, legal risk).
3. Assess current AI usage and dependencies within your organization.
4. Evaluate third-party AI service risks.
5. Document regulatory compliance requirements.
Opportunity Analysis
1. Explore and assess opportunities for AI usage within the organization (cross reference with current usage assessment), including:
  1. Requirements analysis and refinement.
  2. Coding acceleration.
  3. Peer reviews, design critiques, collaborative coding.
  4. Test plan improvement.
  5. Risk prioritization, preliminary risk identification.
  6. Documentation enhancement.
  7. Enhanced CICD/DSO pipeline capabilities (see tools, below).
  8. AI-based threat modeling, threat detection, incident detection.
Policy Development
1. Define and document acceptable AI use cases and prohibited practices.
2. Establish documentation requirements for AI usage in code. This means documenting exactly how and what AI generated code has been used — not using an AI to write documentation!
3. Create AI-specific artifact (code, etc.) review checklists.
4. Create OSS and commercial AI procurement and vendor assessment criteria (cross reference with compliance and security requirements).
5. Develop incident response procedures for AI-related incidents.
6. If you are integrating or developing AI-driven products:
  1. Establish AI product and model validation and testing requirements.
  2. Set data governance standards for AI training and usage (my personal recommendation is that a team or “center of excellence” be established to shepherd the process moving forward).

SDLC Integration

Development Phase
1. Working with your team, write AI-specific code review guidelines.
  1. Insist on human peer code reviews, especially in any situation where an AI may have generated the code. AI assistants are helpful in peer review situations, but are not a replacement for human review.
2. Develop and document safe, secure prompt engineering standards.
3. Define AI testing requirements, and then identify the testing frameworks that meet those requirements (I’ve included a list of some suitable tools below).
4. If you are integrating or developing AI-driven products:
  1. Establish your model validation pipelines. Set up model versioning and dependency tracking.
Security Controls
1. Deploy rate limiting and usage monitoring.
2. Establish model access controls and authentication (you want complete visibility into who is using AI tooling, and for what purpose).
3. Establish human-based security threat management practices (such as red-blue teaming).
4. If you are integrating or developing AI-driven products:
  1. Implement input sanitization for AI interactions.
  2. Set up AI output validation and filtering.
  3. Create audit trails for AI decisions and actions.
Quality Assurance
1. Define AI-specific testing scenarios. This includes AI use in the SDLC, as well as AI integration into products (see below for suggested tools).
2. Implement explainability requirements.
  1. This may only apply to product development situations, but even so I think it’s worth evaluating in all situations. Imagine you are explaining your use of AI to the CTO, and she asks, “so what exactly goes on inside?” and all you can say is “nobody knows, it’s a black box.”
3. If you are integrating or developing AI-driven products:
  1. Establish performance benchmarks.
  2. Create bias detection and mitigation procedures.
  3. Set up continuous monitoring of AI system behavior.

DevSecOps Enhancement

Infrastructure Security
1. Establish backup and recovery procedures.
2. Create secure paths for AI product and model updates.
3. If using custom AI models or PaaS with self-managed AI models (not SaaS products):
  1. Implement secure model storage and versioning.
  2. Set up isolated environments for AI testing.
  3. Deploy monitoring for abnormal AI behavior.
CI/CD Pipeline Updates
1. Add AI-specific security scanning products to your delivery pipeline.
2. Implement automated testing for AI components and generated artifacts.
3. Set up automated documentation detection and enforcement.
4. If using custom AI models or PaaS with self-managed AI models (not SaaS products):
  1. Create model validation checkpoints.
  2. Deploy monitoring for model drift.
Operational Controls
1. Establish incident response procedures.
2. Implement logging and audit trails.
3. Define SLAs for AI system performance.
4. If using custom AI models or PaaS with self-managed AI models (not SaaS products):
  1. Create AI system monitoring dashboards.
  2. Set up alerting for anomalous behavior (e.g., notification methods such as Uptime, Better Stack, AlertOps).

Training and Awareness

Team training is something you’ll want to tailor depending on the nature of your development. Are you building AI-driven products? Or merely using AI tools in your development process?

Understanding basic security principles helps identify where AI and code security collide. Consider including a security course or certification. There are free options (from Codepath and Google) and commercial ones (from Udacity, Pluralsight). Maybe you can expense it!

Technical training topics
1. Internal coding, documentation and AI-usage policies and procedures.
2. Internal AI product and tool onboarding and usage.
3. Secure AI development practices.
4. AI threat recognition and mitigation.
5. Model validation and testing procedures.
6. Incident response protocols.
7. Security tool usage and automation.
General awareness topics
1. Internal guidelines for responsible AI tool usage.
2. AI security risks and best practices.
3. Data protection requirements.
4. Incident reporting procedures.
5. Regulatory compliance requirements.

Establishing a timeline and roadmap

Establishing a realistic timeline, turning it into an actionable roadmap, and making it happen takes careful forethought. Work with your team, your organization stakeholders, and keep in mind that everything can’t happen overnight.

The following is a model timeline. You will want to adapt it to your own team’s capabilities and organizational goals.

Phase 1 (months 1-3)
1. Complete risk assessment.
2. Develop initial policies.
3. Begin technical training.
4. Implement basic security controls.
Phase 2 (months 4-6)
1. Deploy AI testing and compliance tools.
2. Update CI/CD pipeline.
3. Implement monitoring systems.
4. Complete policy documentation.
Phase 3 (months 7-9)
1. Roll out full security and compliance controls.
2. Complete all training programs.
3. Implement advanced monitoring.
4. Conduct initial audits.
Phase 4 (months 10-12)
1. Fine-tune all systems.
2. Conduct penetration testing.
3. Review and update policies.
4. Begin continuous improvement cycle.

Measuring success

You can only know if your readiness plan works by measuring the results. You’ll want to set up specific metrics, a method for measuring those metrics, and comparing results over time. This will also drive your continuous improvement cycle.

Number of AI-related security incidents.
Number of AI-related development and SDLC violations.
Code, product, and SDLC/DSO security scan coverage and results.
Policy compliance rates.
Training completion rates.

AI-driven tools to improve your DSO pipeline

This section introduces some tools that may be fit for purpose. It’s just a starting point, and of course you’ll want to evaluate solutions and adopt appropriate tooling for your organization.

Don’t shy away from using AI-driven tools in your security pipeline, but recognize they are inherently limited. AI-driven tools do provide more capabilities, but don’t become over-reliant on one solution — they aren’t a magic bullet.

Code analysis and security

Snyk (and Snyk Code): Employs machine learning to detect vulnerabilities and suggest fixes in real-time. Includes features for IaC as well as source code scanning, deep dependency scanning, and real time monitoring.

SonarSource (and SonarQube for CI/CD, and AI Quality Gates): Integrates AI for detecting security hotspots and code quality issues. Offers code quality gating, vulnerability pattern detection, and much more. Supports standardized coding guidelines to reduce risks of software code quality.

JFrog Xray: Analyzes code and attributes the way an attacker would, from “code to edge.”
Amazon CodeWhisperer Security Scan: Security scanning that touts specific capabilities for AI-generated code.
GitHub Advanced Security: Uses LLMs to identify potential security vulnerabilities during code completion.
GitHub Advanced Security for Azure: Focuses on getting developers to work better together, fix security issues faster, and reduce overall security risk.

Security Orchestration

Cortex XSOAR (formerly Demisto): Uses machine learning for incident response automation.
Torq: Leverages AI for security workflow automation.

Dependency Management

Dependabot: Uses AI to prioritize dependency updates based on security impact.

Cloud security

Wiz: Comprehensive security across code, CI/CD, cloud visibility, and active threat monitoring. Uses AI to analyze cloud configurations and identify security risks.
Lacework: Multicloud visibility and protection. Applies machine learning for anomaly detection in cloud environments. Supports software composition analysis (SCA), SAST, IaC security, and more.
Aqua Security: Implements AI for container and Kubernetes security scanning.

Active threat monitoring

Checkmarx: Uses AI to reduce false positives in security scans.
Contrast Security: Employs machine learning for more accurate vulnerability detection.
HCL AppScan: Incorporates AI for intelligent scan optimization.

Security enhancement

Orca: Vulnerability scanning & cloud security, provides full visibility into deployed AI models to protect against data tampering and leakage.

The bottom line

Artificial intelligence is a new, rapidly changing field, and much of it isn’t well understood. Any team using artificial intelligence must establish practices and standards to improve safety and security.

As AI products advance, the promise is that models will improve — but, there’s plenty of debate over how much the industry may have already plateaued. We might be stuck with risky AI models for a long time to come.

I wrote about red-blue teaming before — and it’s even more relevant today. DSO teams need to integrate similar practices if any AI solutions are being deployed. Anthropic published a white paper about it, concluding that, “Red teaming is a critical tool for improving the safety and security of AI systems. It involves adversarially testing a technological system to identify potential vulnerabilities.”10

Integrate AI security planning into your playbook. Adopt a security first mindset, and strategies like threat analysis and wargaming. Prioritize tooling that ensures you are protected from AI-related risks, external threats, and inappropriate AI use internally.

The ideal time to adopt technology that tackles AI-related risks is, “as early as possible.”

Staying up to date

The facts and framework presented in this article are based on information from a number of different sources. I should also be clear that it’s a point in time analysis. With the AI landscape changing so rapidly, you’ll want to consult other sources to make sure you have the latest information.111213

The current state of AI security is evolving fast, and may have changed significantly (depending on when you are reading this article).
Industry-specific regulations and standards around AI security are also rapidly evolving.
New AI-specific threats and mitigation strategies have surely developed.

For the most up-to-date and authoritative guidance, I'd recommend consulting:

NIST's AI Risk Management Framework (AI RMF 1.0 and NIST AI RMF Playbook).14
Major cloud providers’ AI security documentation (such as the AWS security scoping matrix, and the associated blog).15
Commercial research and publications on AI security (such as Anthropic’s public research).16
Security frameworks like MITRE ATT&CK's coverage of AI threats.17
Industry-specific regulatory bodies and proposals.
Academic research on AI security.
Professional organizations focused on AI security (see: IEEE).