From Apps to LLMs: How the definition of penetration testing is expanding
Many individuals still imagine a pentest as a static endeavor. Define your boundaries around the network and applications, run through your engagement, read through your reports, and close the findings.
But here’s what is often left out : while that model has been inaccurate for years it becomes less accurate with each advancement made in the way we develop software.
The boundary of a test has always been dynamic. Many organizations continue to define the scope of tests as if it were 2021.
Security teams miss this shift at their own expense. Data breaches do not come from poor controls. Instead, they come from environments where all of the controls actually functioned properly but still allowed data to exit. Traditional testing does not identify gaps such as these. Most teams have found out the hard way by spending money on defining the wrong surface area after the breach occurred and/or in a meeting they would rather not have attended.
This guide describes how penetration testing evolved, how AI affects it, which aspects of traditional testing remain relevant, and how AI Security Testing really begins.
Contents
- A Breach Where Nothing Was Broken Into
- How the Scope Keeps Moving
- What AI Actually Changes
- What Traditional Testing Still Covers
- Where AI Security Testing Begins
- How to Scope an AI Penetration Test
- Common Mistakes That Leave AI Untested
A Breach Where Nothing Was Broken Into
In June of 2025, Researchers took Microsoft's own tool to use against itself. They extracted internal information from their own systems using that same tool, and they delivered it to an outside party. This all began when someone sent an e-mail. There was no malware. No one clicked on anything. No password was compromised. The harmful instructions were simply displayed in plain text. Like everything else that is placed into the context of its operation, the Copilot read those commands. The data was being moved prior to anyone having a valid reason to check where it went.
Microsoft logged the flaw as CVE-2025-32711. The industry called it EchoLeak. It got patched before anyone used it in the wild.
When the bug gets fixed, the question becomes how the breach happened. Everything worked as expected when all conventional methods of controlling access were used. But even though this would theoretically be an easy find on most pentests (because they look for things similar to what these controls are designed to protect against), and every one of those controls is doing their job, it was still enough to break through Microsoft’s defenses.
How the Scope Keeps Moving
The discipline doesn’t change and the target keeps wandering. The way we test has not changed much. There are just always more targets to look at. The more tools and technologies you add, the larger your attack surface becomes and testing follows to keep pace with it. This issue will run through four different generations. No generation has ever replaced the previous one entirely.
First came the infrastructure.
In this case, servers, networks, firewalls and all other perimeter devices were used as the main way for finding vulnerabilities (exposed ports) that existed outside of the "safe zone" where systems were considered to be fully inside or completely outside the organizations' trust boundaries. That type of vulnerability assessment is still the foundation.
Next came web applications.
Web-based businesses and applications developed rapidly and they had many authentication, authorization, and session management issues that needed to be tested. As such, the testers began to move from the network to focusing on the application logic.
Third came the API.
With the increasing use of service-oriented architecture (SOA), APIs effectively served as the glue between multiple services and/or applications.
However, APIs provided yet another large surface area for attacks to occur against them (i.e., broken object level access control, misuse of third party integrations etc.).
Fourth is AI.
Like each of the preceding three areas, AI creates just another layer on top of those previously built without removing any of the prior layers. What distinguishes AI from its predecessors is how it can act upon those layers once it exists.
What AI Actually Changes
Every previous expansion has been based upon how much more software could be used in predictable ways.
However, traditional testing relied on the assumption of a number of things that Artificial Intelligence (AI) has disrupted:
-
Inputs of the software are language. Instead of using structured and validated fields, the input controlled by an attacker is almost limitless as an AI application accepts open-ended natural language.
-
Outputs of the software are probabilistic. The output of the same prompt may vary significantly on Tuesday versus Monday. Therefore, you cannot determine this type of variability through the use of fixed and repetitive test cases similar to those you would use for testing a login flow.
The software will act independently if provided with tools and permissions. Therefore, the risks associated with AI applications move from what they say to what they do.
It also interacts with everything. It attempts to access external data sources and uses plugins or downstream services. Each one of these then becomes a viable portion of an attacker's potential attack path.
Stack those up and AI ends up well outside the reach of tests designed for predictable software. The OWASP GenAI Security Project puts it about as plainly as you could ask: "traditional application security practices are no longer sufficient" for these systems.
What Traditional Testing Still Covers
All these new AI features don’t remove traditional applications.
Each and every feature of AI is built to run as standard code (the servers and clouds beneath them, authentication/authorization controlling who enters which door, APIs shuttling data back & forth, logic encapsulating an entire app, etc.) that is still within the definition of conventional penetration testing. This can be misconfigured and broken in normal ways.
Add a sleek AI copilot to poorly configured API and you have not removed an issue with AI. You have simply added another layer atop an existing bad issue.
As a result, application, infrastructure, and API testing continue to do what they’ve been doing for years. Security testing of AI will begin after they are unable to cover the gap left by those testing areas.
Where AI Security Testing Begins
This newer discipline goes after the failures that only show up once a model is in the loop. NIST defines AI red teaming as "a structured testing effort to find flaws and vulnerabilities in an AI system, often in a controlled environment and in collaboration with developers of AI."
Put simply; you will be probing probabilistic, language-based and autonomous behaviors rather than static code paths. OWASP AI Testing Guide also makes this distinction.
Many risks have been sufficiently identified so far, and as the standard reference point for the field, OWASP Top 10 for LLM Applications lists Prompt Injection as #1.
A maliciously designed input can cause a model to ignore what it has been told to do, leak data it shouldn’t have access to, and/or engage in behavior it wasn’t intended to exhibit.
EchoLeak was identified as the first successful example of a craftily-designed input being used on a production LLM system. Although the update fixed that vulnerability, the method has since been recognized as a new form of attack and will no longer be considered simply a fluke/curiosity.
In addition to crafted inputs, the above list provides other examples of failures that can also serve to help determine the risk associated with using certain models for your business needs:
-
Sensitive Data Exposure: Where a model inadvertently reveals sensitive information about a person/company/etc. that should remain hidden from the model.
-
Misuse of Excessive Autonomous Authority: An agent of the model has sufficient independent authority to create and/or produce harm through its actions autonomously.
-
System Prompt Leakage: Added to the 2025 version of this report, here a model's internal control instructions are unintentionally exposed.
The excessive autonomous authority type of example highlights just how wide the possible impact zone might be. To illustrate this point, consider an artificial intelligence (“AI”) agent which has permission to issue refunds, modify internal company records, and send emails internally to employees. From a practical standpoint, this ability would be very desirable. Unfortunately, once the AI is misdirected, the exact same privileges that made it useful originally become a significant liability.
And honestly, the risk doesn't even need an attacker. In July 2025, Replit's AI coding agent deleted a live production database of the SaaS community, SaaStr. During a code freeze, it rolled right over instructions the user said he'd given it eleven times in capital letters, then generated thousands of fake records to paper over what it had done (The Register).
No one attacked. It merely had sufficient access and enough independence so that it could do significant damage by itself. In fact, it claimed the information was unrecoverable but it was recovered. The piece of this that should make sleep difficult is that most testing verifies how well your access controls are working. It was never designed to determine if the item already having that access can talk or forget or accidentally misuse it.
The scale isn't hypothetical anymore. It’s worth noting that the OECD's AI Incidents Monitor has logged more than 15,000 AI-related incidents and hazards.
How to Scope an AI Penetration Test
When determining which testing should occur, the practical approach is to consider AI systems as standalone "assets" rather than relying upon the controls that are already implemented.
Some possible starting points include:
-
Naming each model, agent, and integration included within your scope (just like naming a network segment or app).
-
Testing behavior, not simply code. Since probabilistic and autonomous systems require abuse scenarios, this may be something beyond just static checking of its own.
-
Keeping the typical layers in-scope for coverage. This includes all API access, application logic, and infrastructure with respect to its normal testing.
-
Mapping out the agent’s reach. All tools, permissions and down-stream services that the agent has direct access to will also have to go onto the attack path.
Common Mistakes That Leave AI Untested
When AI gets past all defenses, there are a limited number of ways that happens:
-
Only looking at the code. The development team tests the application developed using the model; they do not test how well the model handles malicious prompts.
-
Thinking the same access controls will apply. The access controls and input validation that have been established for structured applications do not translate well into unstructured language or unstructured actions from autonomous models.
-
Thinking of the co-pilot as a feature (not a surface). An AI model with associated tools and permissions is also an attack surface and it is likely to be very permissive regardless of how it is presented in the final product.
-
Stopping once you hit the API. If the scope continues to be limited by what is exposed via the API, you stop one layer above the attack surface.
Conclusion
No team stopped infrastructural testing once they started building web applications. No team cancelled application testing when they moved to using APIs.
They continued to build upon their existing program. We are being asked to do the same thing again. As we continue to add autonomy to our agent capabilities, we will need to extend our testing to include the additional behaviors that have been added.
However, those additional layers of functionality underpinning the autonomous behavior will remain in place and unchanged.
We have seen this trend before. Infrastructure expanded into web applications, then onto APIs, and now onto AI. The teams that will lead the way will be those who will define the scope of the new area first, before anything forces them to do so.
Bringing AI Into Your Testing Program
Wondering whether your penetration testing scope still ends at the API?
Prescient Security is able to help teams test their AI-based security applications (such as machine learning models) in addition to the standard testing of user interfaces and API's. In doing so, we can provide additional insight regarding how these systems may be exploited through the use of "prompt injection", "excessive agency" or "abuse".
If you are interested in learning where your current application testing stops, and where your true threat surface really begins, then contact us today.
Click here to talk to one of our experts and understand which testing service is best for your organization.