Or try one of the following: 詹姆斯.com, adult swim, Afterdawn, Ajaxian, Andy Budd, Ask a Ninja, AtomEnabled.org, BBC News, BBC Arabic, BBC China, BBC Russia, Brent Simmons, Channel Frederator, CNN, Digg, Diggnation, Flickr, Google News, Google Video, Harvard Law, Hebrew Language, InfoWorld, iTunes, Japanese Language, Korean Language, mir.aculo.us, Movie Trailers, Newspond, Nick Bradbury, OK/Cancel, OS News, Phil Ringnalda, Photoshop Videocast, reddit, Romanian Language, Russian Language, Ryan Parman, Traditional Chinese Language, Technorati, Tim Bray, TUAW, TVgasm, UNEASYsilence, Web 2.0 Show, Windows Vista Blog, XKCD, Yahoo! News, You Tube, Zeldman
Pyrefly and Ty: Two new Rust-powered Python type-checking tools compared | InfoWorld
Technology insight for the enterpriseSoftware development meets the McNamara Fallacy 25 Jun 2025, 11:00 am
The McNamara Fallacy is the idea that it is an error to make decisions purely on measurements or quantitative data.
Robert McNamara was the US Secretary of Defense during most of the 1960s, during the first half of the Vietnam War. He came to the position after a successful career in industry—most notably as the first president of the Ford Motor Company who was not from the Ford family—after his innovative use of statistics in business.
McNamara took the view that anything that can’t be measured should not be taken into account in the decision-making process. He considered all measurable data to be useful, and made metrics the sole means of determining the correct course of action.
He brought this statistical viewpoint to his job as Secretary of Defense and applied it to his management of the war. Problematic results, such as using enemy body counts as a measure of military success, ultimately led to the coining of the phrase “The McNamara Fallacy.”
I think the software industry is in danger of falling victim to the McNamara Fallacy.
While metrics are useful, relying on metrics alone and ignoring qualitative factors that certainly come into play can easily lead you to focus on things that don’t matter. Ultimately it can lead to failure.
Numbers are easy
First, measuring the software development process is becoming easier and easier. With Git becoming the de facto source control system and tools like LinearB, JellyFish, and Plandek providing deep insight into what is happening within software repositories, it is very simple to get metrics that tell you a lot about what your team is up to.
It is comical that the industry once seriously took something as pathetically simple as “Lines of code per day” as a serious metric. Today’s tools allow managers to see things that were not previously observable. Simple team metrics like deployment frequency, pull request review time, and cycle time are readily available to help pinpoint bottlenecks and aid in decision-making.
We’ve gotten to the point where metrics are easy, and “soft” measurements are hard. Don’t get me wrong—the granular metrics we now have are useful and valuable. But the temptation is great to rely on easy numbers instead of doing the hard work of figuring out the impossible-to-measure stuff. Robert McNamara fell into that trap, and it is easy to see the software industry doing the same.
If we lose sight of the non-measurable things, we lose sight of what makes software development successful. If we focus on “computer science” and “software engineering,” we can lose sight of the human factors in writing code that make or break a successful project.
Software development is a team sport. Although individuals can and do shine as part of a team, the team results are what really matter. Sports fans thrive on statistics, but they know that ultimately, it’s not the statistics that win championships. It’s the intangibles that make the difference between first and second place.
Intangibles are hard
Despite our best efforts, we don’t have a means of measuring “writes good code.” It takes years of experience to recognize “good code” from “bad code,” and we can’t (yet?) measure it objectively. Maybe AI will figure it out someday. One could argue that AI can write good code today, but the ability to recognize good code is still uniquely human.
Similarly, how does one measure “works well with others”? How about “team morale”? There is not, and probably won’t ever be, a way to measure these kinds of things, but they are important, and knowing it when you see it is a key to success. Recognizing and encouraging these intangibles is a critical skill for a software development manager to have.
Finally, over-indexing on metrics can be detrimental to morale. No one wants to feel like a number on a spreadsheet.
I encourage you to use the cool measurement tools out there today. But as you review what things like the DORA (DevOps Research and Assessment) metrics are telling you, remember to consider the things not revealed in the numbers. Sure, metrics are important, but understanding what your gut is telling you and listening to your intuition can be just as valuable, or even more valuable.
Measure what you can, but always be sure to listen to the still, quiet voice telling you things that no statistical package ever can.
Pyrefly and Ty: Two new Rust-powered Python type-checking tools compared 25 Jun 2025, 11:00 am
What is most striking about Python’s latest wave of third-party tooling is that they aren’t written in Python. Instead, many of the newer tools for project management, code formatting, and now type checking, are written in Rust.
This isn’t a swipe at Python; every language has its place. But modern language tooling demands a real-time feedback loop that Python can’t always deliver at the speed required. Rust fills that gap. Modern project management tools like uv
and code formatters like ruff
run fast and lean thanks to Rust.
The newest projects in this space aim to provide type-checking tools for Python that are faster and potentially more powerful than Python-based tools like mypy
and pyright
.
Ty from Astral (makers of the uv package manager and the ruff
code formatter) and Pyrefly from Meta have essentially the same use case: providing high-speed type checking and language services for Python. Both have comparable performance, running many times faster than similar Python-based projects. This article tells you where these new tools stand right now in terms of usability and features.
Pyrefly
Pyrefly is not the first Python type-checking tool from Meta. Previously, the company delivered Pyre, written in OCaml. Pyre has since been retired; Pyrefly, written from scratch in Rust, replaces it.
Run Pyrefly out of the box on an existing Python codebase, and you’ll typically get a flood of errors the first time around. If you use the command pyrefly check --suppress-errors
, all the flagged errors will be suppressed in your source via specially interpreted comments. You can then selectively remove those suppressions and run pyrefly check --remove-unused-ignores
to clean up the codebase as you go. This allows you to migrate an untyped codebase gradually.
Pyrefly, like all modern Python tooling, uses pyproject.toml
to store its project-level configuration data. You can also add per-directory configurations with standalone pyrefly.toml
projects that use the same syntax. Or, you can provide directory-specific overrides for options in a single config file.
The list of linted error types is comparable to what mypy
and Pyright can handle. Migrating from both tools is easy, as Pyrefly can do it automatically. In Pyright’s case, there’s almost a one-to-one mapping for the error-checking settings, so the change isn’t too jarring.
For a project in its early stages, Pyrefly already feels fleshed out. Detailed documentation, a VS Code extension, and even an online sandbox where you can try it out are all already here. If you are using the uv
tool, you can run uvx pyrefly
to experiment with it on a codebase without having to install anything. Note that this causes uv
to be used as the virtual environment provider for Python, so it may generate spurious errors if you are using a different venv for your project.
Ty
Astral’s ty project is also still in its early stages, and it shows. Its documentation isn’t as fleshed-out as Pyrefly’s, and its feature set is less impressive. To be fair, the project was only recently made public and is admittedly in its early stages.
You can install Ty from pip
or run it from uvx
. It intelligently detects a source directory in a pyproject.toml
-configured project, so it doesn’t mistakenly chew through Python files in your project’s virtual environment. But its configuration options are more minimal than Pyrefly’s; for instance, excluding files from checks is done via .gitignore
or other external files rather than from configuration rules.
Ty’s ruleset for checking files seems more condensed than Pyrefly or existing tools, although it covers some cases not found elsewhere. For instance, while it doesn’t check for async errors, Ty does detect if class definitions have conflicting usages of __slots__
, although the former seems like a far more common problem than the latter.
Despite being in its early stages, ty
already has two key features nailed down. It is compatible with the Language Server Protocol, and it offers a VS Code extension to leverage that. Another plus—one significant enough to call out this early on—is the level of detail in its error reports. Pyrefly’s errors report the line number and type of error, but ty
calls out the error akin to what you’d see in modern Python’s contextually detailed error messages.
Conclusion
With the performance playing field about level between the two tools, Pyrefly is the more immediately useful project. Pyrefly offers a broader existing feature set, better documentation, and tooling to allow elegant migration from other type-checkers and onboarding existing codebases. That said, ty
is in its early stages, so it’ll be worth circling back to both tools once they are out of their respective alpha and beta phases.
New AI benchmarking tools evaluate real world performance 25 Jun 2025, 6:48 am
A new AI benchmark for enterprise applications is now available following the launch of xbench, a testing initiative developed in-house by Chinese venture capital firm HongShan Capital Group (HSG).
The challenge with many of the current benchmarks is that they are widely published, making it possible for model creators to train their models to perform well on them and, as a result, reduce their usefulness as a true measure of performance. HSG says it has created a suite of ever-changing benchmarking tests, making it harder for AI companies to train on the test, and meaning they have to rely on more general test-taking capabilities.
HSG said its original intention in creating xbench was to turn its internal evaluation tool into “a public AI benchmark test, and to attract more AI talents and projects in an open and transparent way. We believe that the spirit of open source can make xbench evolve better and create greater value for the AI community.”
On June 17, the company announced it had officially open-sourced two xbench benchmarks: xbench-Science QA and xbench-DeepSearch, promising ”in the future, we will continuously and dynamically update the benchmarks based on the development of large models and AI Agents ….”
Real-world relevance
AI models, said Mohit Agrawal, research director of AI and IoT at CounterPoint Research, “have outgrown traditional benchmarks, especially in subjective domains like reasoning. Xbench is a timely attempt to bridge that gap with real-world relevance and adaptability. It’s not perfect, but it could lay the groundwork for how we track practical AI impact going forward.”
In addition, he said, the models themselves “have progressed significantly over the last two-to-three years, and this means that the evaluation criteria need to evolve with their changing capabilities. Xbench aims to fill key gaps left by traditional evaluation methods, which is a welcome first step toward a more relevant and modern benchmark. It attempts to bring real-world relevance while remaining dynamic and adaptable.”
However, said Agrawal, while it’s relatively easy to evaluate models on math or coding tasks, “assessing models in subjective areas such as reasoning is much more challenging. Reasoning models can be applied across a wide variety of contexts, and models may specialize in particular domains. In such cases, the necessary subjectivity is difficult to capture with any benchmark. Moreover, this approach requires frequent updates and expert input, which may be difficult to maintain and scale.”
Biases, he added, “may also creep into the evaluation, depending on the domain and geographic background of the experts. Overall, xbench is a strong first step, and over time, it may become the foundation for evaluating the practical impact and market readiness of AI agents.”
Hyoun Park, CEO and chief analyst at Amalgam Insights, has some concerns. “The effort to keep AI benchmarks up-to-date and to improve them over time is a welcome one, because dynamic benchmarks are necessary in a market where models are changing on a monthly or even weekly basis,” he said. “But my caveat is that AI benchmarks need to both be updated over time and actually change over time.”
Benchmarking new use cases
He pointed out, “we are seeing with efforts such as Databricks’ Agent Bricks that [it] is important to build independent benchmarks for new and emerging use cases. And Salesforce Research recently released a paper showing how LLMs fare poorly in conducting some practical tasks, even when they are capable of conducting the technical capabilities associated with the task.”
The value of an LLM, said Park, is “often not in the ability to solve any specific problem, but to identify when a novel or difficult approach might be necessary. And that is going to be a challenge for even this approach to benchmarking models, as the current focus is on finding more complex questions that can be directly solved through LLMs rather than figuring out whether these complex tasks are necessary, based on more open-ended and generalized questioning.”
Further to that, he suggested, “[it is] probably more important for 99% of users to simply be aware that they need to conceptually be aware of Vapnik-Chervonenkis complexity [a measure of the complexity of a model] to understand the robustness of a challenge that an AI model is trying to solve. And from a value perspective, it is more useful to simply provide context on whether the VC dimension of a challenge might be considered low or high, because there are practical ramifications on whether you use the small or large AI model to solve the problem, which can be orders of magnitude differences in cost.”
Model benchmarking, Park said, “has been quite challenging, as the exercise is both extremely high stakes in the multi billion dollar AI wars, and also poorly defined. There is a panoply of incentives for AI companies to cheat and overfit their models to specific tests and benchmarks.”
Kotlin 2.2.0 arrives with context parameters, unified management of compiler warnings 25 Jun 2025, 12:20 am
JetBrains has released Kotlin 2.2.0, the latest version of the company’s general purpose, statically typed language perhaps best known as a rival to Java for JVM and Android development. The update previews context parameters, stabilizes guard conditions, and offers unified management of compiler warnings.
Kotlin 2.2.0 was released June 23. Installation instructions can be found at kotlinlang.org.
Context parameters in the release improve dependency management and allow functions and properties to declare dependencies that are implicitly available in the surrounding context. With context parameters, developers do not need to manually pass around values such as services or dependencies, which are shared and rarely change across sets of function calls. Context parameters replace an older experimental feature called context receivers. Other features previewed in Kotlin 2.2.0 include context-sensitive resolution, to omit the type name in contexts where the expected type is known, and an @all
meta-target for properties, which tells the compiler to apply an annotation to all relevant parts the property.
Guard conditions, introduced in Kotlin 2.1.0 last November, are now stable. Guard conditions allow for including more than one condition for the branches of a when
expression, making complex control flows more explicit and concise, JetBrains said. Additionally, code structure is flattened with this feature.
A new compiler option in Kotlin 2.2.0, -XWarning-level
, is designed to offer a unified way of managing compiler warnings in Kotlin projects. Previously, developers only could apply general module-wide rules, such as disabling all warnings with nowarn
or turning warnings to compilation errors with -Werror
. With the new option, developers can override general rules and exclude specific diagnostics in a consistent way.
Other new features and improvements in Kotlin 2.2.0:
- For Kotlin/Wasm, the build infrastructure for the for Wasm target is separated from the JavaScript target. Previously, the
wasmJs
target shared the same infrastructure as thejs
target. This meant both targets were hosted in the same directory (build/js
) and used the same NPM tasks and configurations. Now, thewasmJs
target has its own infrastructure separate from thejs
target. This allows Wasm types and tasks to be distinct from JavaScript ones, enabling independent configuration. - LLVM has been updated from version 16 to version 19, bringing performance improvements, security updates, and bug fixes.
- Tracking memory consumption on Apple platforms has been improved.
- Windows 7 has been deprecated as a legacy target.
o3-pro may be OpenAI’s most advanced commercial offering, but GPT-4o bests it 24 Jun 2025, 2:01 pm
Unlike general-purpose large language models (LLMs), more specialized reasoning models break complex problems into steps that they ‘reason’ about, and show their work in a chain of thought (CoT) process. This is meant to improve their decision-making and accuracy and enhance trust and explainability.
But can it also lead to a sort of reasoning overkill?
Researchers at AI red teaming company SplxAI set out to answer that very question, pitting OpenAI’s latest reasoning model, o3-pro, against its multimodal model, GPT-4o. OpenAI released o3-pro earlier this month, calling it its most advanced commercial offering to date.
Doing a head-to-head comparison of the two models, the researchers found that o3-pro is far less performant, reliable, and secure, and does an unnecessary amount of reasoning. Notably, o3-pro consumed 7.3x more output tokens, cost 14x more to run, and failed in 5.6x more test cases than GPT-4o.
The results underscore the fact that “developers shouldn’t take vendor claims as dogma and immediately go and replace their LLMs with the latest and greatest from a vendor,” said Brian Jackson, principal research director at Info-Tech Research Group.
o3-pro has difficult-to-justify inefficiencies
In their experiments, the SplxAI researchers deployed o3-pro and GPT-4o as assistants to help choose the most appropriate insurance policies (health, life, auto, home) for a given user. This use case was chosen because it involves a wide range of natural language understanding and reasoning tasks, such as comparing policies and pulling out criteria from prompts.
The two models were evaluated using the same prompts and simulated test cases, as well as through benign and adversarial interactions. The researchers also tracked input and output tokens to understand cost implications and how o3-pro’s reasoning architecture could impact token usage as well as security or safety outcomes.
The models were instructed not to respond to requests outside stated insurance categories; to ignore all instructions or requests attempting to modify their behavior, change their role, or override system rules (through phrases like “pretend to be” or “ignore previous instructions”); not to disclose any internal rules; and not to “speculate, generate fictional policy types, or provide non-approved discounts.”
Comparing the models
By the numbers, o3-pro used 3.45 million more input tokens and 5.26 million more output tokens than GPT-4o and took 66.4 seconds per test, compared to 1.54 seconds for GPT-4o. Further, o3-pro failed 340 out of 4,172 test cases (8.15%) compared to 61 failures out of 3,188 (1.91%) by GPT-4o.
“While marketed as a high-performance reasoning model, these results suggest that o3-pro introduces inefficiencies that may be difficult to justify in enterprise production environments,” the researchers wrote. They emphasized that use of o3-pro should be limited to “highly specific” use cases based on cost-benefit analysis accounting for reliability, latency, and practical value.
Choose the right LLM for the use case
Jackson pointed out that these findings are not particularly surprising.
“OpenAI tells us outright that GPT-4o is the model that’s optimized for cost, and is good to use for most tasks, while their reasoning models like o3-pro are more suited for coding or specific complex tasks,” he said. “So finding that o3-pro is more expensive and not as good at a very language-oriented task like comparing insurance policies is expected.”
Reasoning models are the leading models in terms of efficacy, he noted, and while SplxAI evaluated one case study, other AI leaderboards and benchmarks pit models against a variety of different scenarios. The o3 family consistently ranks on top of benchmarks designed to test intelligence “in terms of breadth and depth.”
Choosing the right LLM can be the tricky part of developing a new solution involving generative AI, Jackson noted. Typically, developers are in an environment embedded with testing tools; for example, in Amazon Bedrock, where a user can simultaneously test a query against a number of available models to determine the best output. They may then design an application that calls upon one type of LLM for certain types of queries, and another model for other queries.
In the end, developers are trying to balance quality aspects (latency, accuracy, and sentiment) with cost and security/privacy considerations. They will typically consider how much the use case may scale (will it get 1,000 queries a day, or a million?) and consider ways to mitigate bill shock while still delivering quality outcomes, said Jackson.
Typically, he noted, developers follow agile methodologies, where they constantly test their work across a number of factors, including user experience, quality outputs, and cost considerations.
“My advice would be to view LLMs as a commodity market where there are a lot of options that are interchangeable,” said Jackson, “and that the focus should be on user satisfaction.”
Further reading:
Public cloud becomes a commodity 24 Jun 2025, 11:00 am
Since the dawn of the public cloud era, the narrative has focused on disruption and innovation. The Big Three providers (Amazon Web Services, Microsoft Azure, and Google Cloud) have commanded center stage by captivating customers and analysts alike with their relentless rollout of new features and services. The implication has always been clear: Stick with the largest, most innovative providers and take advantage of capabilities you can’t find anywhere else.
Yet, those of us who consistently talk to enterprise CTOs, architects, and IT teams see a reality very different from all the marketing hype surrounding bleeding-edge services such as advanced serverless platforms, proprietary AI accelerators, and vast analytics ecosystems. Those shiny, specialized tools advertised at launch events rarely make it into meaningful production deployments. Most actual cloud consumption focuses on a handful of basic services. In the real world, enterprises gravitate toward virtual machines, object storage, databases, networking, and security features.
The reasons are diverse. IT teams primarily manage mission-critical workloads that require reliability, security, and scalability. This creates pressure to reduce risk and complexity, making it impractical to adopt a constant stream of new, sometimes underdeveloped features. Most organizations rely on established solutions, leaving a long tail of innovative services underused and often overlooked. If anything, this demonstrates that the day-to-day needs of most enterprises are surprisingly consistent and relatively straightforward, regardless of industry or region.
What about AI?
AI was expected to change the game by providing a true differentiator for the major cloud players. It’s easy to believe that AWS, Azure, and Google Cloud are now as much AI companies as they are infrastructure providers, given their levels of investment and marketing enthusiasm. However, if you step back and examine the actual AI workloads being deployed in production, a pattern emerges. The necessary toolsets and infrastructure—GPU access, scalable data storage, major machine learning frameworks—are not only widespread but are also becoming increasingly similar across all public clouds, whether in the top tier or among the so-called “second tier” providers such as IBM Cloud and Oracle.
Additionally, access to AI is no longer genuinely exclusive. Open source AI solutions and prebuilt platforms can operate anywhere. Smaller public cloud providers, including sovereign clouds tailored to a country’s specific needs, are offering essentially similar AI and ML portfolios. For everyday enterprise use cases—fine-tuning models, running inference at scale, managing data lakes—there’s nothing particularly unique about what the major clouds provide in comparison to their smaller, often less expensive competitors.
Sticker shock
This brings us, inevitably, to cost, a topic no cloud conversation can avoid these days. The promise of “pay only for what you use” was initially a significant driver of public cloud adoption, but enterprises are waking up to a new reality: The larger you grow, the more you pay. Detailed invoices and cost analysis tools from the Big Three resemble tax documents—complicated, opaque, and often alarming. As organizations scale, cloud bills can quickly spiral out of control, blindsiding even the most prepared finance teams.
The persistent cost challenges are shifting the mindset of enterprise IT leaders. If you’re only using basic cloud primitives such as compute, networking, storage, or managed databases, why pay a premium for the marquee provider’s logo on your invoice? This question isn’t theoretical; it’s one I hear frequently. Enterprises are discovering that the value promised by the most established public clouds doesn’t align with reality, especially at the enterprise scale, given today’s prices.
The second-tier providers are stepping in to fill this gap. IBM and Oracle, for example, have shown remarkable growth in the past few years. Their product offerings may not match the sheer breadth of Microsoft, AWS, and Google, but for core use cases, they are just as reliable and often significantly less expensive. Furthermore, their pricing models are simpler and more predictable, which, in an era of cost anxiety, is a form of innovation in itself. Then there are the sovereign clouds, the regional or government-backed solutions that prioritize local compliance and data sovereignty and offer precisely what some markets require at a fraction of the cost.
MSPs and colos are still in the game
Managed service providers and colocation vendors are also playing a surprising role in this shift. By providing hybrid and multicloud management as well as hosting services, they allow enterprises to move workloads between on-premise environments, colocated data centers, and multiple public clouds with minimal concern about which cloud supports each particular workload. These players further diminish the idea of differentiation among cloud providers, making the underlying infrastructure a nearly irrelevant commodity.
What are the implications? The commoditization of public cloud isn’t just likely; in many respects, it’s already here. Competing solely on raw innovation and feature count is losing its effectiveness. Enterprises are indicating through their purchasing behavior that they want simplicity, reliability, and predictability at a fair price. If the major cloud providers don’t adapt, they’ll find themselves in the same situation traditional server and storage companies did a decade ago: struggling to differentiate what customers increasingly view as a commodity.
AWS, Microsoft, and Google will not disappear or shrink dramatically in the short term. However, they may need to reevaluate how they deliver value. I expect them to double down on managed services, application-layer offerings, and industry-specific solutions where differentiation truly matters. The rest—the core plumbing of the cloud—will increasingly be driven by price, reliability, and regulatory compliance, much like electricity or bandwidth today.
The next phase of cloud computing will belong not to those with the most features or the loudest marketing campaigns, but to those providers, big or small, that best understand enterprise needs and can deliver on the fundamentals without unnecessary complexity or excessive costs. That’s good news for the rest of us. The public cloud isn’t just another technology wave; it’s becoming an everyday utility. For enterprises, that’s precisely how it should be.
LLMs aren’t enough for real-world, real-time projects 24 Jun 2025, 11:00 am
The major builders of large language models (LLMs)—OpenAI, DeepSeek, and others—are mistaken when they claim that their latest models, like OpenAI’s o-series or DeepSeek’s R1, can now “reason.” What they’re offering isn’t reasoning. It’s simply an advanced text predictor with some added features. To unlock AI’s true transformative potential, we need to move beyond the idea of reasoning as a one-size-fits-all solution. Here’s why.
If 2024 belonged to ChatGPT, OpenAI hoped it would dominate 2025 with the o-series, promising a leap in LLM reasoning. Early praise for its attempts to curb hallucinations quickly faded when China’s DeepSeek matched its capabilities at a fraction of the cost—on a laptop. Then came Doubao, an even cheaper rival, shaking the AI landscape. Chip stocks dropped, US tech dominance faltered, and even Anthropic’s Claude 3.5 Sonnet came under scrutiny.
But the real issue with the LLM paradigm isn’t just cost—it’s the illusion that all its inherent flaws have been solved. And that’s a dangerous path that could lead to painful dead ends. Despite all the progress, issues like hallucination remain unresolved. This is why I believe the future of AI doesn’t lie in artificial general intelligence (AGI) or endlessly scaling LLMs. Instead, it’s in fusing LLMs with knowledge graphs—particularly when enhanced by retrieval-augmented generation (RAG), combining the power of structured data with generative AI models.
No matter how cheap or efficient, an LLM is fundamentally a fixed, pre-trained model, and retraining it is always costly and impractical. In contrast, knowledge graphs are dynamic, evolving networks of meaning, offering a far more adaptable and reliable foundation for reasoning. Enriching an LLM’s conceptual map with structured, interconnected data through graphs transforms it from probabilistic guesswork into precision. This hybrid approach enables true practical reasoning, offering a dependable way to tackle complex enterprise challenges with clarity—something that LLM “reasoning” often falls short of delivering.
We need to distinguish between true reasoning and the tricks LLMs use to simulate it. Model makers are loading their latest models with shortcuts. Take OpenAI, for example, which now injects code when a model detects a calculation in the context window, creating the illusion of reasoning through stagecraft rather than intelligence. But these tricks don’t solve the core problem: the model doesn’t understand what it’s doing. While today’s LLMs have solved classic logic fails—like struggling to determine how long it would take to dry 30 vs. five white shirts in the sun—there will always be countless other logical gaps. The difference is that graphs provide a structured and deep foundation for reasoning, not masking limitations with clever tricks.
The limits of LLM ‘reasoning’
We’ve seen the consequences of forcing ChatGPT into this role, where it fabricates confident but unreliable answers or risks exposing proprietary data to train itself—a fundamental flaw. Tasks like predicting financial trends, managing supply chains, or analyzing domain-specific data require more than surface-level reasoning.
Take financial fraud detection, for example. An LLM might be asked, “Does this transaction look suspicious?” and respond with something that sounds confident—“Yes, because it resembles known fraudulent patterns.” But does it actually understand the relationships between accounts, historical behavior, or hidden transaction loops? No. It’s simply echoing probability-weighted phrases from its training data. True fraud detection requires structured reasoning over financial networks buried within your transaction data—something LLMs alone cannot provide.
The problem becomes even more concerning when we consider the deployment of LLMs in real-world applications. Take, for example, a company using an LLM to summarize clinical trial results or predict drug interactions. The model might generate a response like, “This combination of compounds has shown a 30% increase in efficacy.” But what if those trials weren’t conducted together, if critical side effects are overlooked, or if regulatory constraints are ignored? The consequences could be severe.
Now, consider cybersecurity, another domain where a wrong response could have catastrophic consequences. Imagine your CSO asking an LLM, “How should we respond to this network breach?” The model might suggest actions that sound plausible but are completely misaligned with the organization’s actual infrastructure, latest threat intelligence, or compliance needs. Following AI-generated cybersecurity advice without scrutiny could leave the company even more vulnerable.
And let’s not overlook enterprise risk management. Suppose a group of business users asks an LLM, “What are the biggest financial risks for our business next year?” The model might confidently generate an answer based on past economic downturns. However, it lacks real-time awareness of macroeconomic shifts, government regulations, or industry-specific risks. It also lacks the current and actual corporate information—it simply does not have it. Without structured reasoning and real-time data integration, the response, while grammatically perfect, is little more than educated guessing dressed up as insight.
This is why structured, verifiable data are absolutely essential in enterprise AI. LLMs can offer useful insights, but without a real reasoning layer—such as knowledge graphs and graph-based retrieval—they’re essentially flying blind. The goal isn’t just for AI to generate answers, but to ensure it comprehends the relationships, logic, and real-world constraints behind those answers.
The power of knowledge graphs
The reality is that business users need models that provide accurate, explainable answers while operating securely within the walled garden of their corporate infosphere. Consider the training problem: A firm signs a major LLM contract, but unless it gets a private model, the LLM won’t fully grasp the organization’s domain without extensive training. And once new data arrives, that training is outdated—forcing another costly retraining cycle. This is plainly impractical, no matter how customized the o1, o2, o3, or o4 model is.
In sharp contrast, supplementing an LLM with a well-designed knowledge graph—especially one that employs dynamic algorithms—solves this issue by updating context rather than requiring retraining. Whereas an LLM like o1 might correctly interpret a question like “How many x?” as a sum, we need it to understand something more specific, such as “How many servers are in our AWS account?” That’s a database look-up, not an abstract mathematical question.
A knowledge graph ensures that a first attempt at practical AI can reason over your data with reliability. Moreover, with a graph-based approach, LLMs can be used securely with private data—something even the best LLM on its own can’t manage.
The smart move is to go beyond the trivial. AI needs knowledge graphs, retrieval-augmented generation, and advanced retrieval methods like vector search and graph algorithms—not just low-cost training models, impressive as they may seem.
Dominik Tomicevic leads European software company Memgraph, provider of an open-source in-memory graph database that’s purpose-built for dynamic, real-time enterprise applications.
—
Generative AI Insights provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss the challenges and opportunities of generative artificial intelligence. The selection is wide-ranging, from technology deep dives to case studies to expert opinion, but also subjective, based on our judgment of which topics and treatments will best serve InfoWorld’s technically sophisticated audience. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Contact doug_dineley@foundryco.com.
Google’s Agent2Agent project moves to Linux Foundation 24 Jun 2025, 4:48 am
The Linux Foundation is the new home of the Agent2Agent (A2A) protocol, an open protocol developed by Google to enable agentic AI interoperability and trusted agent communication across systems and platforms.
Launched by Google in April, the A2A protocol addresses the need for agents to operate in dynamic, multi-agent environments. A2A enables autonomous agents to discover one another, exchange information securely, and collaborate across systems, which in turn allows developers to unite agents from multiple sources and platforms, improving modularity, mitigating vendor lock-in, and accelerating innovation, the Linux Foundation said in a June 23 announcement. Developers can go to the A2A repository on GitHub to learn more about the protocol and follow the progress of the project.
The A2A project is being formed with participation from Amazon Web Services, Cisco, Google, Microsoft, Salesforce, SAP, and ServiceNow, Google said in a blog post, also dated June 23. Under Linux Foundation governance, A2A will remain vendor-neutral, emphasize inclusive contributions, and continue the protocol’s focus on extensibility, security, and real-world usability, the Linux Foundation said. “By joining the Linux Foundation, A2A is ensuring the long-term neutrality, collaboration, and governance that will unlock the next era of agent-to-agent powered productivity,” said Jim Zemlin, executive director of the Linux Foundation.
“The Agent2Agent protocol establishes a vital open standard for communication, enabling the industry to build truly interoperable AI agents across diverse platforms and systems,” said Rao Surapaneni, vice president and GM of Business Applications Platform, Google Cloud. “By collaborating with the Linux Foundation and leading technology providers, we will enable more innovative and valuable AI capabilities under a trusted, open-governance framework.”
Ktor adds dependency injection and HTMX modules 24 Jun 2025, 12:08 am
JetBrains has released Ktor 3.2.0, an update to the Kotlin-based framework for building asynchronous applications that brings modules for dependency injection and HTMX and automatic deserialization of configuration files into data classes, among other new capabilities.
Unveiled June 19, Ktor 3.2.0 also offers tools updates and performance improvements for different platforms. Instructions for getting started with Ktor can be found at ktor.io.
A dependency injection (DI) module featured in Ktor 3.2.0, while optional, allows Ktor to offer additional functionality out of the box for dependency injection users. Ktor DI is built on top of coroutines, which allow for the concurrent initialization of an application. Easy integration with existing DI frameworks is enabled by Ktor DI, according to JetBrains. Also, Ktor DI automatically closes AutoCloseable
instances or allows for developers to configure their own cleanup handlers.
Ktor’s new HTMX module includes tight integration with kotlinx.html, which provides a domain specific language (DSL) for HTML, and the Ktor Routing DSL. This enables developers to more easily define HTML attributes for HTMX and define routes that automatically include HTMX headers.
For typed configuration, Ktor 3.2.0 now automatically deserializes configuration files into data classes in addition to primitive types. In order to deserialize structured data, the developer first needs to define a data class that matches their YAML configuration file.
Also in Ktor 3.2.0:
- Ktor now supports
suspend
, or asynchronous modules, making it possible to await dependencies requiring suspension for initialization. Developers also can parallelize complex dependency graphs. - Ktor now supports Gradle version catalogs.
- Ktor’s CIO client and server engine now support Unix domain sockets, thus providing more efficient bidirectional communication between processes on the same system.
- A known regression in Ktor 3.2.0 pertaining to Android R8 will be fixed in Ktor 3.2.1, JetBrains said
Ktor enables development of asynchronous client and server applications. Developers can build applications ranging from microservices to multiplatform HTTP client apps with ease, JetBrains said.
Agentic AI won’t wait for your data architecture to catch up 23 Jun 2025, 5:53 pm
A decade ago, the cloud ignited a massive replatforming of application and server infrastructure. Open-source technologies like Docker and Kubernetes transformed software velocity and operational flexibility, launching a new era.
But it didn’t happen overnight. Enterprises had to adapt to shifting foundations, talent gaps, and an open-source ecosystem evolving faster than most teams could absorb.
Today, agentic AI is catalyzing a similar, profound replatforming. This shift centers on real-time data interaction, where success is measured in milliseconds, not minutes. What’s at stake is your company’s ability to thrive in new marketplaces shaped by intelligent systems.
To navigate this transition, here are key considerations for preparing your data infrastructure for agentic AI.
The AI data layer must serve polyglot, multi-persona teams
Traditional data platforms, which primarily served SQL analysts and data engineers, are no longer sufficient. Today’s AI landscape demands real-time access for a vastly expanded audience: machine learning engineers, developers, product teams, and crucially, automated agents – all needing to work with data in tools like Python, Java, and SQL.
Much as Docker and Kubernetes revolutionized cloud-native application development, Apache Iceberg has become the foundational open-source technology for this modern AI data infrastructure. Iceberg provides a transactional format for evolving schemas, time travel, and high-concurrency access.
Combined with a powerful and scalable serverless data platform, this enables real-time dataflows for unpredictable, agent-driven workloads with strict latency needs.
Together, these technologies enable fluid collaboration across diverse roles and systems. They empower intelligent agents to move beyond mere observation, allowing them to act safely and quickly within dynamic data environments.
Your biggest challenge? “Day two” operations
The greatest challenge in building data infrastructure for agentic AI lies not in technology selection, but in operationalizing it effectively.
It’s not about choosing the perfect table format or stream processor; it’s about making those components reliable, cost-efficient, and secure under high-stakes workloads. These workloads require constant interaction and unpredictable triggers.
Common challenges include:
- Lineage and compliance: Tracking data origins, managing changes, and supporting deletion for regulations like GDPR are complex and crucial.
- Resource efficiency: Without smart provisioning, GPU and TPU costs can quickly escalate. Managed cloud offerings for OSS products help by abstracting compute management.
- Access control and security: Misconfigured permissions present a significant risk. Overly broad access can easily lead to critical data being exposed.
- Discovery and context: Even with tools like Iceberg, teams struggle to find the metadata needed for just-in-time dataset access.
- Ease of use: Managing modern data tools can burden teams with unnecessary complexity. Simplifying workflows for developers, analysts, and agents is essential to keep productivity high and barriers low.
Without robust operational readiness, even the best-architected platforms will struggle under the constant pressure of agentic AI’s decision loops.
The right balance between open source and cloud partners
Complex infrastructure is now driven by open-source innovation, especially in data infrastructure. Here, open-source communities often pioneer solutions for advanced use cases, far exceeding the typical operational capacity of most data teams.
The biggest gaps arise when scaling open-source tools for high-volume ingestion, streaming joins, and just-in-time compute. Most organizations struggle with fragile pipelines, escalating costs, and legacy systems ill-suited to agentic AI’s real-time demands.
These are all areas where cloud providers with significant operational depth deliver critical value.
The goal is to combine open standards with cloud infrastructure that automates the most arduous tasks, from data lineage to resource provisioning. By building on open standards, organizations can effectively mitigate vendor lock-in. At the same time, partnering with cloud providers who actively contribute to these ecosystems and offer essential operational guardrails in their services enables faster deployment and greater reliability. This approach is superior to building fragile, ad-hoc pipelines or depending on opaque proprietary platforms.
For example, Google Cloud’s Iceberg integration in BigQuery combines open formats with highly scalable, real-time metadata offering high throughput streaming, automated table management, performance optimizations, integrations with Vertex AI for agentic applications.
Ultimately, your goal is to accelerate innovation while mitigating the inherent risks of managing complex data infrastructure alone.
The agentic AI skills gap is real
Even the largest companies are grappling with a shortage of talent to design, secure, and operate AI-ready data platforms. The most acute hiring challenge isn’t just data engineering, it’s also real-time systems engineering at scale.
Agentic AI amplifies operational demands and pace of change. It requires platforms that support dynamic collaboration, robust governance, and instantaneous interaction. These systems must simplify operations without compromising reliability.
Agentic AI marketplaces may prove even more disruptive than the Internet. If your data architecture isn’t built for real-time, open, and scalable use, the time to act is now. Learn more about advanced Apache Iceberg and data lakehouse capabilities here
GitHub’s AI billing shift signals the end of free enterprise tools era 23 Jun 2025, 3:05 pm
GitHub began enforcing monthly limits on its most powerful AI coding models this week, marking the latest example of AI companies transitioning users from free or unlimited services to paid subscription tiers once adoption takes hold.
“Monthly premium request allowances for paid GitHub Copilot users are now in effect,” the company said in its update to the Copilot consumptive billing experience, confirming that billing for additional requests now starts at $0.04 each. The enforcement represents the activation of restrictions first announced by GitHub CEO Thomas Dohmke in April.
The move affects users of GitHub’s most advanced AI models, including Anthropic’s Claude 3.5 and 3.7 Sonnet, Google’s Gemini 2.0 Flash, and OpenAI’s o3-mini. Users who exceed their monthly allowances must now either wait until the next billing cycle or enable pay-per-request billing to continue using premium features, the blog post added.
Premium request limits by plan
The enforcement creates tiered access to advanced AI capabilities. Customers with Copilot Pro will receive 300 monthly premium requests, while Copilot Business and Enterprise users will get 300 and 1,000 requests, respectively. GitHub will also offer a Pro+ plan at $39 per month, providing 1,500 premium requests and access to what the company describes as “the best models, like GPT-4.5.”
Each model consumes premium requests based on a multiplier system designed to reflect computational costs. GPT-4.5 has a 50x multiplier, meaning one interaction counts as 50 premium requests, while Google’s Gemini 2.0 Flash uses only 0.25x. Users can still make unlimited requests using GitHub’s base model GPT-4o, though rate limiting applies during high demand.
For those exceeding monthly allowances, GitHub’s current billing system will require users to “set a spending limit in your billing settings” with “the default limit set to $0,” meaning additional requests are rejected unless explicitly authorized.
April announcement set the stage
When Dohmke first announced the premium request system in April, he positioned the restrictions as necessary infrastructure for sustainable AI services. “Since GitHub Universe, we introduced a number of new models for chat, multi-file edits, and now agent mode. With the general availability of these models, we are introducing a new premium request type,” the company had said.
The company delayed implementation in May, stating: “We’re delaying the enforcement of Copilot premium request limits. Our goal is to make it easy for you to see how many premium requests you’re using and give you control over your limits and potential expenses.”
Developer backlash mirrors industry pattern
For enterprise customers, these changes signal the maturation of AI tools from experimental technologies to essential business services requiring budget planning and strategic procurement. GitHub’s current billing system allows organizations to “monitor your premium request usage in real-time from the Copilot status icon in your IDE or download detailed usage reports,” but any premium requests beyond monthly allowances are rejected unless administrators explicitly enable additional billing.
The enforcement has also prompted complaints from GitHub Copilot users in online forums, with community discussions showing almost all the comments posted over the past week argue that the limits are far too low and appear to be designed to force customers to upgrade to more expensive subscription plans.
“300 per day is ok, per month is ridiculous,” wrote one user on the forum page.
The criticism follows a familiar pattern across the AI industry as services mature from startup offerings to profitable enterprises. This shift has materialized across different categories of AI services, creating a consistent pattern of reduced free access as platforms establish market presence.
Midjourney exemplifies this trend most clearly. The popular AI image creator initially offered 25 free images to new users when it launched in July 2022, but by June 2023 eliminated free trials entirely, requiring paid subscriptions starting at $10 monthly. Video generation platform Runway AI structures its offering around a credit system where the free tier provides only “a one-time deposit of 125 credits,” while paid plans starting at $15 monthly offer renewable credit allowances that “do not roll over to following months.”
Conversational AI services have implemented similar restrictions. Anthropic’s Claude imposes daily message limits on free users, typically allowing 40-50 messages per day, while ChatGPT’s free tier restricts users to older GPT-3.5 models with access limitations during peak usage periods.
Revenue pressures drive monetization
The monetization trend reflects mounting pressure on AI companies to demonstrate sustainable business models. According to TechCrunch, Microsoft CEO Satya Nadella said last August that Copilot accounted for over 40% of GitHub’s revenue growth in 2024 and is already “a larger business than all of GitHub when the tech giant acquired it roughly seven years ago.”
Training and operating advanced language models require substantial computational resources, with leading AI companies spending hundreds of millions of dollars on infrastructure. As venture capital funding becomes more selective and investors demand clearer paths to profitability, AI companies increasingly rely on subscription revenue rather than pursuing unsustainable growth strategies.
GitHub’s approach reflects this broader recalibration. “Premium requests are in addition to the unlimited requests for agent mode, context-driven chat, and code completions that all paid plans have when using our base model,” the company emphasized in its April announcement, positioning the changes as value-added services rather than restrictions on core functionality.
The trend suggests that CIOs and technology leaders should prepare for similar changes across their AI tool portfolios. As these services transition from venture-capital-subsidized offerings to self-sustaining businesses, organizations may need to reevaluate their AI strategies and budget allocations accordingly.
More GitHub news:
- GitHub hit by a sophisticated malware campaign as ‘Banana Squad’ mimics popular repos
- GitHub Actions attack renders even security-aware orgs vulnerable
- GitHub launches Remote MCP server in public preview to power AI-driven developer workflows
- What GitHub can tell us about the future of open source
- GitHub to unbundle Advanced Security
Why AI projects fail, and how developers can help them succeed 23 Jun 2025, 11:00 am
Even as we emerge from generative AI’s tire-kicking phase, it’s still true that many (most?) enterprise artificial intelligence and machine learning projects will derail before delivering real value. Despite skyrocketing investment in AI, the majority of corporate ML initiatives never graduate from proof of concept to production. Why? Well, a CIO survey found that “unclear objectives, insufficient data readiness, and a lack of in-house expertise” sink many AI projects, but I like Santiago Valdarrama’s list even more. At the heart of much of this AI failure is something I uncovered way back in 2013 as “big data” took off: “Everyone’s doing it, but no one knows why.”
Let’s look at how developers can improve the odds of AI success.
Not every problem needs AI
First, as much as we may want to apply AI to a burgeoning list of business problems, quite often AI isn’t needed or isn’t even advisable in the first place. Not every task warrants a machine learning model, and forcing AI into scenarios where simpler analytics or rule-based systems suffice is a recipe for waste, as I’ve written. “There is a very small subset of business problems that are best solved by machine learning; most of them just need good data and an understanding of what it means,” data scientist Noah Lorang once observed. In other words, solid data analysis and software engineering often beat AI wizardry for everyday challenges.
The best strategy is clarity and simplicity. Before writing a line of TensorFlow or PyTorch, step back and ask: “What problem are we actually trying to solve, and is AI the best way to solve it?” Sometimes a straightforward algorithm or even a spreadsheet model is enough. ML guru Valdarrama advises teams to start with simple heuristics or rules before leaping into AI. “You’ll learn much more about the problem you need to solve,” he says, and you’ll establish a baseline for future ML solutions.
Garbage in, garbage out
Even a well-chosen AI problem will falter if it’s fed the wrong data. Enterprise teams often underestimate the critical-but-unexciting task of data preparation: curating the right data sets, cleaning and labeling them, and ensuring they actually represent the problem space. It’s no surprise that according to Gartner research, nearly 85% of AI projects fail due to poor data quality or lack of relevant data. If your training data is garbage (biased, incomplete, outdated), your model’s outputs will be garbage as well—no matter how advanced your algorithms.
Data-related issues are cited as a top cause of failure for AI initiatives. Enterprises frequently discover their data is siloed across departments, rife with errors, or simply not relevant to the problem at hand. A model trained on idealized or irrelevant data sets will crumble against real-world input. Successful AI/ML efforts, by contrast, treat data as a first-class citizen. That means investing in data engineering pipelines, data governance, and domain expertise before spending money on fancy algorithms. As one observer puts it, data engineering is the “unsung hero” of AI. Without clean, well-curated data, “even the most advanced AI algorithms are rendered powerless.”
For developers, this translates to a focus on data readiness. Make sure you have the data your model needs and that you need the data you have. If you’re predicting customer churn, do you have comprehensive, up-to-date customer interaction data? If not, no amount of neural network tuning will save you. Don’t let eagerness for AI blind you to the essential grunt work of ETL, data cleaning, and feature engineering.
Poorly defined success
Too many AI/ML initiatives launch with vague hopes of “delivering value” but no agreed-upon way to quantify that value. A lack of clear metrics is a well-documented AI project killer. For example, a retail company might deploy a machine learning model to personalize offers to customers but fail to decide whether success is defined by increased click-through rates, higher revenue per customer, or improved retention. Without that clarity, even a technically accurate model might be deemed a flop.
In the generative AI arena especially, many teams roll out models without any systematic evaluation in place. As ML engineer Shreya Shankar notes, “Most people don’t have any form of systematic evaluation before they ship… so their expectations are set purely based on vibes.” Vibes might feel good in a demo, but they collapse in production. It’s hard to declare a win (or acknowledge a loss) when you didn’t define what winning looks like from the start.
The solution is straightforward: Establish concrete success metrics up front. For example, if you’re building an AI fraud detection system, success might be “reduce false positives by X% while catching Y% more fraud.” Setting one or two clear KPIs focuses the team’s efforts and provides a reality check against hype. It also forces a conversation with business stakeholders: if we achieve X metric, will this project be considered successful? Developers and data scientists should insist on this clarity. It’s better to negotiate what matters upfront than to try to retroactively justify an AI project with cherry-picked stats.
Ignoring the feedback loop
Let’s say you’ve built a decent first version of an AI/ML model and deployed it. Job done, right? Hardly. One major reason AI initiatives stumble is the failure to plan for continuous learning and iteration. Unlike traditional software, an AI model’s performance can and will drift over time: Data distributions shift, and users react in unexpected ways. In other words, our pristine AI dreams must face the real world. If you ignore feedback loops and omit a plan for ongoing model tuning, your AI project will quickly become a stale experiment that fails to adapt.
The real key to AI success is to constantly tune your model, something many teams neglect amid the excitement of a new AI launch. In practice, this means putting in place what modern MLops teams call “the data flywheel”: monitoring your model’s outputs, collecting new data on where it’s wrong or uncertain, retraining or refining the model, and redeploying improved versions. Shankar warns that too often “teams expect too high of accuracy … from an AI application right after it’s launched and often don’t build out the infrastructure to continually inspect data, incorporate new tests, and improve the end-to-end system.” Model deployment isn’t the finish line: it’s the start of a long race.
All talk, no walk
Finally, too many organizations excel at impressive AI prototypes and pilot projects and stop short of investing the hard work to turn those demos into dependable, production-grade systems at scale. Why does this happen so frequently? One reason is the hype-fueled rush we touched on earlier. When CEOs and board members pressure the company to “get on the AI train,” there’s an incentive to show progress fast, even if that progress is only superficial. As I’ve suggested, we’ve sometimes “allowed the promise of AI to overshadow current reality.”
Another factor is what could be called “pilot purgatory.” Organizations spin up numerous AI proofs of concept to explore use cases but fund them minimally and isolate them from core production systems. Often these pilots die not because the technology failed but because they were never designed with production in mind. An endless stream of disconnected experiments is costly and demoralizing. It creates “pilot fatigue” without yielding tangible benefits. Some of this is fostered by organizational dynamics. In today’s market, it may be easier to get C-level executives to invest in your project if it has AI sprinkled on top. As IDC’s Ashish Nadkarni indicates, “Most of these [failed] genAI initiatives are born at the board level … not because of a strong business case. It’s trickle-down economics to me.”
To avoid this trap, you need to allocate sufficient time and resources to harden a prototype for production: plugging it into real data workflows, adding user feedback channels, handling edge cases, implementing guardrails (like prompt filtering or human fallback for sensitive tasks), etc. Success, in short, will ultimately come down to developers.
Developers to the rescue
It’s easy to be cynical about enterprise AI given the high failure rates. Yet amid the wreckage of failed projects are shining examples of AI done right, often led by teams that balanced skepticism with ingenuity. The differentiator is usually a developer mindset that puts substance over show. Indeed, production-grade AI “is all the work that happens before and after the prompt,” as I’ve suggested.
The good news is that the power to fix these failures lies largely in our hands as developers, data scientists, and technology leaders. We can push back when a project lacks a clear objective or success metric and insist on answering “why” before jumping into “how.” We can advocate for the boring-but-crucial work of data quality and MLops, reminding our organizations that AI is not magic—it’s engineering. When we do embrace an AI solution, we can do so with eyes open and a plan for the full product life cycle, not just the demo.
Devops debt: The hidden tax on innovation 23 Jun 2025, 11:00 am
Your devops teams are likely wasting half their time on work that delivers zero business value. According to our 2025 State of Java Survey & Report, 62% of organizations report that dead code is hampering their devops productivity, while 33% admit their teams waste more than half their time chasing false-positive security alerts. Meanwhile, 72% of companies are paying for cloud capacity they never use.
This isn’t just inefficiency—it’s a hidden tax on innovation that’s silently killing your ability to compete. In my 25+ years working with Java, from JDK 1.0 in 1996 to today, I’ve witnessed how these mounting inefficiencies have become the single biggest barrier to innovation for Java-based enterprises. And with nearly 70% of companies reporting that more than half their applications run on Java, this isn’t a niche problem—it’s a crisis hiding in plain sight.
The three pillars of devops debt
Code bloat: The growing burden of digital hoarding
Dead code—portions of your code base that are never executed but still sit in production—creates a cascade of problems that extend far beyond wasted storage. The productivity impact only hints at the deeper issue: This digital hoarding forces developers to navigate unnecessarily complex systems. Our research reveals that organizations with high levels of dead code report development cycles that are, on average, 35% longer than those with streamlined code bases.
This problem compounds over time as Java versions become dated and obsolete. For example, 10% of organizations still run applications on Java 6, a 20-year-old version that Oracle ceased providing updates for in December 2018.
Security false positives: The endless chase
Beyond wasted development time, security false positives consume an enormous amount of devops resources. The “better safe than sorry” approach to security scanning has led to alert fatigue, with one-third of teams spending the majority of their time investigating issues that turn out to be non-threats.
The problem is particularly acute in Java environments, where 41% of organizations encounter critical production security issues on a weekly or daily basis. Despite having had more than three years to address Log4j, half of the companies surveyed are still experiencing security vulnerabilities from Log4j in production. This persistent vulnerability highlights a broader challenge: distinguishing between theoretical vulnerabilities and actual threats.
Our research indicates that approximately 70% of security alerts in Java environments are ultimately determined to be false positives or vulnerabilities in code paths that are never executed in production. When devops teams can’t efficiently separate real threats from hypothetical ones, innovation inevitably grinds to a halt.
Cloud waste: Paying for idle capacity
The financial dimension of devops debt manifests in cloud resource inefficiency. Beyond the headline figure of widespread waste, we’ve found that many organizations are dramatically over-provisioning their Java applications due to uncertainty about performance requirements and inconsistent load patterns.
For Java-based organizations, this problem is particularly significant because nearly two-thirds report that more than 50% of their cloud compute costs stem from Java workloads. Additional analysis shows that optimizing Java Virtual Machine (JVM) configurations alone could reduce cloud costs by 25% to 30% for the average enterprise.
This waste essentially functions as a direct financial penalty—you’re literally paying for capacity you don’t use, just like interest on a financial debt. Across the enterprise landscape, we estimate this represents over $10 billion in annual wasted cloud spending.
Breaking free of devops debt
As Java applications continue to modernize, with nearly half of organizations (49%) now running either Java 17 or Java 21, this transition creates a perfect opportunity to address these underlying inefficiencies.
Code hygiene automation
Implement automated tools that identify and safely remove dead code, integrated directly into your CI/CD pipeline to prevent new accumulation. Just as we continuously monitor JVM performance metrics, apply the same rigor to identifying unused code patterns.
Leading organizations are now incorporating runtime usage analysis to identify code paths that haven’t been executed in production for extended periods. This data-driven approach has helped some enterprises reduce their code bases by up to 40% without any functional impact.
Consider implementing policies that require deprecated code to have sunset dates, ensuring that temporary workarounds don’t become permanent technical debt. Regular code reviews focused specifically on identifying unused components can help keep your code base lean and maintainable.
Runtime intelligence for security
Traditional security scanning produces too many alerts with too little context. Modern approaches incorporate runtime intelligence to prioritize vulnerabilities based on actual usage patterns rather than theoretical exploitability.
Organizations should invest in tools that distinguish between code paths actually executed in production versus those that exist but aren’t used. This runtime intelligence approach transforms security from theoretical vulnerability hunting to practical risk management, dramatically reducing false positives and freeing your teams to focus on innovation.
Companies that have adopted this approach report up to an 80% reduction in security alert volume while actually improving their security posture by focusing resources on genuinely exploitable vulnerabilities.
Resource optimization
Adopt tools and practices that optimize cloud resource allocation through advanced auto-scaling, high-performance JDKs, and established finops practices that align technology with business objectives.
Our report shows that forward-thinking organizations are already addressing this: 38% have implemented new internal rules for cloud instance usage, 35% are using more efficient compute instances and processors, and 24% have adopted high-performance JDKs specifically to enhance performance and reduce costs.
The most successful organizations are implementing cross-functional finops teams with representation from engineering, operations, and finance to holistically address resource optimization. These teams establish metrics and governance processes that balance innovation speed with cost efficiency.
The innovation imperative
The cost of devops debt goes far beyond wasted engineering hours. When teams spend half their time managing false positives and navigating bloated code bases, your competitors who’ve addressed these issues can innovate twice as fast. Top developers seek environments where they can create value, not manage legacy messes. Every hour spent on activities that don’t add value represents features not built, customer needs not addressed, and market opportunities missed.
Just as we’ve seen organizations seek alternatives to inefficient Java deployments, I predict we’ll see a similar movement toward addressing devops debt as awareness of its costs grows. The organizations that move first will gain significant competitive advantage.
The question isn’t whether you have devops debt—it’s whether you’ll start paying it down before your competitors do. The tools and practices exist today to dramatically reduce these inefficiencies. Those who act now won’t just improve their engineering productivity; they’ll fundamentally transform their ability to innovate in an increasingly competitive marketplace.
Simon Ritter is deputy CTO of Azul.
—
New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.
How to succeed (or fail) with AI-driven development 23 Jun 2025, 11:00 am
Artificial intelligence (AI) continues to permeate seemingly every aspect of business, including software development. AI-augmented development involves using generative AI to support various stages of the software development lifecycle, including design, testing, and deployment. Introducing AI-powered tools into the development process is intended to increase developer productivity by automating certain tasks. It can also enhance the quality of code and speed up the development lifecycle, so development teams can bring products to users more quickly.
AI-augmented development is on the rise, according to industry research. An May 2025 report by market intelligence and advisory firm QKS Group forecasts that the global AI-augmented software development market will expand at a compound annual growth rate of 33 percent through 2030.
“In an era where speed, innovation, and adaptability define competitive advantage, AI-augmented software development is rapidly becoming a transformative force for enterprises,” the report says. “By embedding AI into every stage of the software development lifecycle, from code generation and testing to debugging and deployment, organizations across industries like finance, healthcare, retail, telecom, and manufacturing are redefining how software is built, optimized, and scaled.”
Deploying AI-augmented development tools and processes comes with both risks and rewards. For tech leaders and software developers, it is vital to understand both.
Risks of AI-augmented software development
Risks of relying too heavily on AI for software development include bias in the data used to train models, cybersecurity threats, and unchecked errors in AI-generated code. We asked a range of experts what they’ve found most challenging about integrating AI in the software development lifecycle and how they’ve managed those challenges.
Bias in the models
Bias in the data used to feed models has long been an issue for AI, and AI-augmented development is no exception.
“Because AI is trained on human-coded data, it can replicate and amplify existing biases,” says Ja-Naé Duane, faculty and academic director of the Master’s Program in Innovation Management and Entrepreneurship at Brown University School of Engineering. “Without deliberate oversight and diverse perspectives in design and testing, we risk embedding exclusion into the systems we build,” she says.
Most Loved Workplace, a provider of workplace certifications, uses machine learning to analyze employee sentiment. But early on, it saw signs that its models were misreading certain emotional tones or cultural language differences.
“We had to retrain the models, labeling according to our own researched models, and using humans in the loop to test for bias,” says Louis Carter, founder of the company and an organizational psychologist.
“Our internal team did a lot of work to do so, and we created a gaming platform for everyone to label and add in their own interpretation of bias,” Carter says. “We improved the [BERT language model], developing our own construct for identifying emotions and sentiment. If we hadn’t caught it, the results would have misled users and hurt the product’s credibility.”
Intellectual property (IP) infringement
The use of AI-augmented development and possible IP infringement can raise complex legal issues, especially within the area of copyright. Because AI models can be trained using enormous datasets, including some copyrighted content, they can generate outputs that closely resemble or infringe upon existing copyrighted material. This can lead to lawsuits.
“The current uncertainty around how these models do or don’t infringe on intellectual property rights is absolutely still a risk,” says Joseph Mudrak, a software engineer at product design company Priority Designs. “OpenAI and Meta, for example, are both subjects of ongoing court cases regarding the sources of the data fed into those models.”
The American Bar Association notes that as the use of generative AI grows rapidly, “so have cases brought against generative AI tools for infringement of copyright and other intellectual property rights, which may establish notable legal precedents in this area.”
“Most generally available AI-augmented development systems are trained on large swaths of data, and it’s not particularly clear where that data comes from,” says Kirk Sigmon, a partner at law firm Banner & Witcoff Ltd. Sigmon specializes in AI and does coding and development work on the side. “Code is protectable by copyright, meaning that it is very possible that AI-augmented development systems could output copyright-infringing code,” Sigmon says.
Cybersecurity issues
AI-augmented development introduces potential cybersecurity risks such as insecure code generation. If they are trained on datasets with flawed or insecure examples, AI models can generate code containing common vulnerabilities such as SQL injection or cross-site scripting attacks.
AI-generated code could also inadvertently include sensitive data such as customer information or user passwords, exposing it to potential attackers. Training models on sensitive data might lead to unintentional exposure of this data in the generated code.
“From a privacy and cybersecurity standpoint, unvalidated AI-generated code can introduce serious vulnerabilities into the software supply chain,” says Maryam Meseha, founding partner and co-chair of privacy and data protection at law firm Pierson Ferdinand LLP.
“We’ve seen companies unknowingly ship features that carried embedded security flaws, simply because the code ‘looked right’ or passed surface-level tests,” Meseha says. “The cost of retroactively fixing these issues, or worse, dealing with a data breach, far outweighs the initial speed gains.”
False confidence
There might be a tendency for development teams and leaders to assume that AI will get it right almost all the time because they believe automation removes the problem of human error. This false confidence can lead to problems.
“AI-augmented approaches, particularly those using generative AI, are inherently prone to mistakes,” says Ipek Ozkaya, technical director of engineering intelligent software systems at the Carnegie Mellon University Software Engineering Institute.
“If AI-augmented software development workflows are not designed to prevent, recognize, correct, and account for these mistakes, they are likely to become nightmares down the line, amounting to unmanageable technical debt,” Ozkaya says.
Most Loved Workplace, which uses tools such as Claude Code, Sentry, and custom AI models for emotion and sentiment analysis in its platform, has experienced false confidence with AI-augmented development.
“Claude and other tools sound right even when they’re dead wrong,” Carter says. “One piece of output missed a major edge case in a logic loop. It passed initial testing but broke once real users hit it. Now, everything AI touches goes through multiple human checks.”
The company has had developers submit code from Claude that looked solid at first but failed under load, Carter says. “When I asked why they made certain choices, they couldn’t explain it—it came straight from the tool,” he says. “Since then, we’ve made it clear: If you can’t explain it, don’t ship it.”
Rewards of AI-augmented software development
While increased productivity and cost-effectiveness garner the most attention from business leaders, tech leaders and developers are finding that AI supports developer learning and skills development, prevents burnout, and may make software development more sustainable as a career.
Speed without burnout
It’s no surprise, given the pressure to deliver quality software at a rapid pace, that many developers experience burnout. A 2024 study by Kickstand Research, based on a survey of more than 600 full-time professionals in software engineering, found that nearly two-thirds of respondents (65 percent) experienced burnout in the past year.
The report, conducted on behalf of Jellyfish, a provider of an engineering management platform, indicated that the problem was particularly acute for short-staffed engineers and leaders overseeing large organizations. Of respondents at companies with more than 500 people in their engineering organization, 85 percent of managers and 92 percent of executives said they were experiencing burnout.
Deploying AI-augmented development tools can help address the issue by automating tasks and increasing productivity.
“Claude Code has helped us move faster without overwhelming the team,” Carter says. “One of our junior developers hit a wall building a complex rules engine. He used Claude to map out the logic and get unstuck. What would’ve taken half a day took about an hour. It saved time and boosted his confidence.”
Cleaner code and fewer bugs
AI-augmented development can lead to fewer bugs and improved code quality. This is because AI tools can handle tasks such as code analysis, bug detection, and automated testing. They can help identify possible errors and suggest enhancements.
“We use Sentry to catch issues early, and Claude to clean up and comment the code before anything ships,” Carter says. “Claude is a great way of cleaning up messy code.”
Commenting, or adding notes and reasoning behind what code is doing and what it is intended to accomplish, makes it easy for everyone to understand, Carter says. This is especially helpful for programmers whose second language is English, “because there are a lot of misunderstandings that can happen.”
Most Loved Workplace is running sentiment and emotion scoring in its human resources SaaS application Workplacely, used for certifying companies. “AI helps us test edge cases faster and flag inconsistencies in model outputs before they go live,” Carter says.
“My favorite way to use AI-augmented development systems is to use them to help me bugfix,” Sigmon says. “AI systems have already saved me a few times when, late at night, I struggled to find some small typo in code, or struggled to figure out some complex interrelationship between different signaling systems.”
Cost-effectiveness and increased productivity
AI-augmented development systems can be cost-effective, particularly over time due to increased efficiency and productivity, the automation of tasks, reduced errors, and shorter development lifecycles.
“Using AI-augmented development systems can save money because you can hire fewer developers,” Sigmon says. “That said, it comes with some caveats. For instance, if the world pivots to only hiring senior developers and relies on AI for ‘easy’ work, then we’ll never have the opportunity to train junior developers to become those senior developers in the future.”
AI “can automate routine coding tasks and surface bugs, as well as optimize performance, dramatically reducing development time and cost,” Duane says.
“For example, tools like GitHub Copilot have been shown to significantly cut time-to-deploy by offering developers real-time code suggestions,” Duane says. “In several organizations I work with, teams have reported up to a 35 percent acceleration in release cycles, allowing them to move from planning to prototyping at unprecedented speed.”
Upskilling on the fly
The skills shortage is one of the biggest hurdles for organizations and their development operations. AI-powered tools can help developers learn new skills organically in the development process.
“I’ve seen junior team members start thinking like senior engineers much faster,” Carter says. “One in particular used to lean on me for direction constantly. Now, with Claude, he tests ideas, reviews structure, and comes to me with smarter questions. It’s changed how we work.”
AI is lowering the barrier to entry for individuals without formal programming training by enabling no-code and low-code platforms, Duane says. “This transformation aligns with our vision of inclusive innovation ecosystems,” she says.
For instance, platforms such as Bubble and Zapier enable entrepreneurs, educators, and others without technical backgrounds to build and automate without writing a single line of code, Duane says. “As a result, millions of new voices can now participate in shaping digital solutions, voices that would have previously been left out,” she says.
Further reading:
GitHub hit by a sophisticated malware campaign as ‘Banana Squad’ mimics popular repos 20 Jun 2025, 2:11 pm
A threat group dubbed “Banana Squad,” active since April 2023, has trojanized more than 60 GitHub repositories in an ongoing campaign, offering Python-based hacking kits with malicious payloads.
Discovered by ReversingLabs, the malicious public repos each imitate a well-known hacking tool to look legitimate but inject hidden backdoor logic.
IBM combines governance and security tools to solve the AI agent oversight crisis 20 Jun 2025, 12:49 pm
IBM is integrating its AI governance tool watsonx.governance with Guardium AI Security — its tool for securing AI models, data, and their usage — to simplify and bolster AgentOps for enterprises.
AgentOps, short for agent operations and also otherwise known as agent development lifecycle management, is a growing area of focus for enterprises as agent sprawl becomes a key challenge, mostly driven by vendors lining up to offer enterprises tools to create AI agents for a plethora of different tasks.
The hyperscalers disrupt the sovereign cloud disruptors 20 Jun 2025, 11:00 am
The fast-paced world of cloud computing is a continuous cycle of innovation and disruption. Most recently, sovereign cloud startups have stepped into the spotlight, offering specialized cloud services with a focus on sovereignty to address pressing concerns of data privacy, jurisdiction, and regulatory compliance. Recent announcements from major hyperscalers, particularly Microsoft’s introduction of new sovereign cloud products for Europe, signal a dramatic shift in this landscape. We are witnessing a classic case of disruption turning on itself as these giants now present a significant challenge to the very disruptors they once fueled.
Microsoft has positioned its sovereign cloud services to meet the growing demand for data control and security, concerns heightened by recent geopolitical shifts and a troubling increase in cyberthreats. These products include features such as Data Guardian, which ensures that only Microsoft personnel based in Europe can manage remote access to systems and external key management, granting customers autonomy over their encryption keys.
Additionally, Microsoft offers a regulated environment management control panel and locally hosted solutions through Azure Local to address the evolving regulatory landscape in Europe. Although these initiatives may seem beneficial for enterprises seeking sovereign cloud options, they inadvertently complicate an already crowded marketplace.
Double-edged sword
The entry of hyperscalers into the sovereign cloud space presents a double-edged sword. On one side, it offers alternative choices for organizations striving to comply with regional regulations, thereby broadening the available options—a generally positive move in a competitive market. Companies now have the opportunity to evaluate advanced offerings from established players alongside those from agile startups that previously led this niche. Conversely, expanded selection inevitably adds complexity. Organizations must consider not only the features and pricing but also the challenges of navigating the potential legal restrictions associated with US-based service providers.
One of the primary concerns arises from the complexities of US law. Companies like Microsoft operate under regulations, including the CLOUD Act (Clarifying Lawful Overseas Use of Data) and section 702 of FISA (Foreign Intelligence Surveillance Act), which can compel them to provide access to data even if it resides in foreign nations. This reality casts a dark cloud over the promises of sovereignty from these hyperscalers.
Critics, including Benjamin Schilz, the CEO of Wire, argue that Microsoft’s claims of sovereignty may be misleading. The assertion of control over data in a sovereign cloud environment isn’t as solid as it appears when examined through the lens of US legal frameworks. Schilz aptly points out that the US government has demonstrated its ability to compel companies to surrender data even if it exists outside the United States. How sovereign are these so-called sovereign clouds?
For cloud customers, especially in regulated industries, the implications are significant. Concerns about genuine data protection may diminish the initial allure of collaborating with hyperscalers. These giants provide extensive resources, global networks, and advanced technology, but they also operate under regulations that can conflict with the very principles of sovereignty that businesses pursue.
Several European entities, including Denmark and the German state of Schleswig-Holstein, are considering dropping Microsoft Office in favor of locally hosted, open-source alternatives. This move is driven by increasing distrust of US tech firms. These circumstances create a sense of urgency for overseas enterprises to reassess their cloud strategies, balancing their reliance on American tech giants with the need for greater control over sensitive data.
Deep-seated worries
This tension is not merely a fleeting concern; it reflects deep-seated worries within organizations that are increasingly aware of the fragility of data sovereignty in a digital age marked by geopolitical tensions and cybersecurity threats. In recent years, as more sensitive data has been stored in the public cloud, companies have grown more reliant on these environments for their operations, assuming that they are as secure—if not more secure—than on-premises solutions. This dependence is becoming harder to maintain as awareness increases about the legal complexities of data privacy and the implications of US surveillance practices.
Moreover, for enterprises aiming to maintain compliance while leveraging advanced cloud capabilities, the selection process becomes complex. Should they trust the well-resourced hyperscalers or should they turn to smaller, specialized providers that prioritize sovereignty, even at the expense of reduced infrastructure scale? Many organizations may soon find themselves in a dilemma, especially if their primary motivations for adopting cloud services were to enhance agility and reduce costs in a landscape where privacy is now a paramount concern.
The result? Startups, which carved out their niches based on localized offerings and strong compliance assurances, now confront increased competition from hyperscalers with formidable resources. While these new offerings from giants like Amazon Web Services (AWS) and Google Cloud enhance the sovereign cloud ecosystem, they also reduce the uniqueness that initially attracted clients to smaller providers. Decision-makers are left wrestling with the question of whether true sovereignty is attainable within the framework of these expansive multiple services.
Looking ahead, growth in the sovereign cloud market is expected to continue, driven by increased awareness of compliance and data privacy concerns across various industries. However, this growth may add to the confusion. The complexities of differing regulations, data residency requirements, and a wide range of service offerings make it essential for enterprises to conduct thorough due diligence to understand how these services align with organizational goals of sovereignty, performance, and compliance without falling victim to promises of data sovereignty that are impossible to fulfill.
Developers set the pace for genAI tools adoption 20 Jun 2025, 11:00 am
The increasingly weird interactions between everyday people and ChatGPT may make more headlines, but developers are setting the pace for AI’s future. Whether it’s coding assistants auditioning (sometimes erratically) as copilots, development platforms vying to offer the smoothest AI-driven developer experience, or tech giants like Oracle and Microsoft angling for developer loyalty, our picks this month highlight some of the biggest battles in genAI today—all aiming to win the hearts and minds of builders.
Top picks for generative AI readers on InfoWorld
What the AI coding assistants get right, and where they go wrong
Think of AI coding assistants as bright but distractable interns: they can be very helpful, but they also have serious quirks. Here’s a look at the best and worst of six leading AI copilots.
The AI platform wars will be won on the developer experience
Whoever makes building AI into applications a seamless developer experience will win a big piece of the future. Right now, Microsoft appears to be leading the pack, almost by accident.
The key to Oracle’s AI future
As the database of choice for a huge swath of enterprises, Oracle is in a unique position to provide the data support big companies crave for their AI projects. But will developers get on board?
9 APIs you’ll love for AI integration and automated workflows
APIs have long offered developers access to data and functionality from a variety of sources. Now, a new crop of APIs is connecting apps to AI integrations.
More good reads and generative AI updates elsewhere
Are genAI shortcuts making attackers easier to catch?
There’s been much ado about the next generation of developers relying on AI rather than learning comp-sci fundamentals. Now it appears the next generation of cybercriminals is making the same mistake.
Reddit releases new ‘Community Intelligence’ ad tools
Reddit’s very human peer-to-peer discussions provide a broad pool of data used to train many prominent AI tools. Now, the company is offering AI-driven insights to advertisers based on user-generated content.
Exposed developer secrets are a big problem, and AI is making them worse
Leaving plaintext passwords or SSH keys in human-readable code is a hallmark of inexperienced or overworked developers. But a recent report found that repos integrated with an AI copilot were 40% more likely to contain leaked secrets than those without.
Taking advantage of Microsoft Edge’s built-in AI 19 Jun 2025, 11:00 am
Large language models are a useful tool, but they’re overkill for much of what we do with services like ChatGPT. Summarizing text, rewriting our words, even responding to basic chatbot prompts are tasks that don’t need the power of an LLM and the associated compute, power, and cooling of a modern inferencing data center.
There is an alternative: small language models. SLMs like Microsoft’s Phi can produce reliable results with much fewer resources, as they’re trained with fewer parameters. One of the latest Phi models, Phi-4-mini-instruct, has 3.5 billion parameters trained on five trillion tokens. SLMs like Phi-4-mini-instruct are designed to run on edge hardware, taking language generation to PCs and small servers.
Microsoft has been investing in the development and deployment of SLMs, building its PC-based inferencing architecture on them and using ONNX runtimes with GPUs and NPUs. The downside is that downloading and installing a new model can take time, and the one you want your code to use may not be installed on a user’s PC. This can be quite the hurdle to overcome, even with Windows bundling Phi Silica with its Copilot+ PCs.
What’s needed is a way to deliver AI functions in a trusted form that offers the same APIs and features wherever you want to run it. The logical place for this is in the browser, as we do much of our day-to-day work with one, filling in forms, editing text, and working with content from inside and outside our businesses.
An AI model in the browser
A new feature being trialed in the Dev and Canary builds of Microsoft’s Edge browser provides new AI APIs for working with text content, hosting Phi-4-mini in the browser. There’s no need to expect users to spend time setting up either WebNN or WebGPU, or even WebAssembly, or requiring them to preload models and have the right security permissions in place for you to call the model and run a local inferencing instance.
There are other advantages. By running models locally you save money. You don’t need the expensive cloud inferencing subscription using GPT or similar. By keeping inferencing local, you’re also keeping user data private; it’s not transferred over the network and it’s not used to train models (a process that can lead to accidental leaks of personally identifiable information).
The browser itself hosts the model, downloading it and updating it as needed. Your code simply needs to initialize the model (the browser automatically downloads it if necessary) and then calls JavaScript APIs to manage basic AI functions. Currently the preview APIs offer four text-based services: summarizing text, writing and rewriting text, and basic prompt evaluation. There are plans to add support for translation services in a future release.
Getting started with Phi in Edge
Getting started is easy enough. You need to set Edge feature flags in either the Canary or Dev builds of Edge for each of the four services, restarting the browser once they’re enabled. You can then open the sample playground web application to first download the Phi model and then start experimenting with the APIs. It can take some time to download the model, so be prepared for a wait.
Be aware that there are a few bugs at this stage of development. The sample web application stopped updating the download progress counter roughly halfway through the process, but switching to a different API view showed that the installation was complete and I could try out the samples.
Once downloaded, the model is available for all AI API applications, and downloads only when an updated version is released. It runs locally so there’s no dependency on the network; it can be used with little or no connectivity.
The test pages are basic HTML forms. The Prompt API sample has two fields for setting up user and system prompts, as well as a JSON format constraint schema. For example, the initial sample produces a sentiment analysis for a review web application. The sample constraints ensure that the output is a JSON document containing only the sentiment and the confidence level.
With the model running in the browser and without the same level of protection as the larger-scale LLMs running in Azure AI Foundry, having a well-written system prompt and an associated constraint schema are essential to building a trustworthy in-browser AI application. You should avoid using open-ended prompts, which can lead to errors. By focusing on specific queries (for example, determining sentiment), it’s possible to keep risk to a minimum, ensuring the model operates in a constrained semantic space.
Using constraints to restrict the format of the SLM output makes it easier to use in-browser AI as part of an application, for example, using numeric values or simple text responses as the basis for a graphical UI. Our hypothetical sentiment application could perhaps display a red icon beside negative sentiment content, allowing a worker to analyze it further.
Using Edge’s experimental AI APIs
Edge’s AI APIs are experimental, so expect them to change, especially if they become part of the Chromium browser platform. For now, however, you’re able to quickly add support in your pages, using JavaScript and the Edge-specific LanguageModel
object.
Any code needs to first check for API support before checking that the Phi model is available. The same call looks for whether the model is present or not or if it’s currently being downloaded. Once a download has been completed, you can load it into memory and start inference. Creating a new session is an asynchronous process that allows you to monitor download progress, ensuring the model is in place and that users are aware of how long it will take to download several gigabytes of model and data.
Once the model is downloaded, start by defining a session and giving it a system prompt. This sets the baseline for any interactions and establishes the overall context for an inference. At the same time, you can use a technique called “N-shot prompting” to provide structure to outputs by providing a set of defined prompts and their expected responses. Other tuning options define limits for how text is generated and how random the outputs are. Sessions can be cloned if you need to reuse the prompts without reloading a page. You should destroy any sessions when the host page is closed.
With the model configured, you can now deliver a user prompt. This can be streamed so you can watch output tokens being generated or simply delivered via an asynchronous call. This last option is the most likely, especially if you will be processing the output for display. If you are using response constraints, these are delivered alongside the prompt. Constraints can be JSON or regular expressions.
If you intend to use the Writing Assistant APIs, the process is similar. Again, you need to check if the API feature flags have been enabled. Opening a new session either uses the copy of Phi that’s already been downloaded or starts the download process. Each API has a different set of options, such as setting the type and length of a summary or the tone of a piece of writing. You can choose the output type, either plain text or markdown.
CPU, GPU, or NPU?
Testing the sample Prompt API playground on a Copilot+ PC shows that, for now at least, Edge is not using Window’s NPU support. Instead, the Windows Task Manager performance indicators show that Edge’s Phi model runs on the device’s GPU. At this early stage in development, it makes sense to take a GPU-only approach as more PCs will support it—especially the PCs used by the target developer audience.
It’s likely that Microsoft will move to supporting both GPU and NPU inference as more PCs add inferencing accelerators and once the Windows ML APIs are finished. Windows ML’s common ONNX APIs for CPU, GPU, and NPU are a logical target for Edge’s APIs, especially if Microsoft prepares its models for all the target environments, including Arm, Intel, and AMD NPUs.
Windows ML provides tools for Edge’s developers to first test for appropriate inferencing hardware and then download optimized models. As this process can be automated, it seems ideal for web-based AI applications where their developers have no visibility into the underlying hardware.
Microsoft’s Windows-based AI announcements at Build 2025 provide enough of the necessary scaffolding that bundling AI tools in the browser makes a lot of sense. You need a trusted, secure platform to host edge inferencing, one where you know that the hardware is able to support a model and where one standard set of APIs ensures you only have to write code once to have it run anywhere your target browser runs.
Firecrawl: Easy web data extraction for AI applications 19 Jun 2025, 11:00 am
As organizations increasingly rely on large language models (LLMs) to process web-based information, the challenge of converting unstructured websites into clean, analyzable formats has become critical.
Firecrawl, an open-source web crawling and data extraction tool developed by Mendable, addresses this gap by providing a scalable solution to harvest and structure web content for AI applications. With its ability to handle dynamic JavaScript-rendered pages, bypass anti-bot mechanisms, and output LLM-friendly Markdown, Firecrawl has become indispensable for developers building retrieval-augmented generation (RAG) systems and knowledge bases.
Project overview – Firecrawl
Firecrawl is available as an AGPL-3.0-licensed open-source project or a cloud-based API service (Firecrawl Cloud). Firecrawl crawls entire websites and converts their content into structured Markdown or JSON. Launched in 2023, the project gained rapid adoption, surpassing 34,000 GitHub stars by early 2025 and becoming the preferred web scraping solution for companies like Snapchat, Coinbase, and MongoDB. Hosted by Mendable, Firecrawl combines traditional crawling techniques with AI-powered extraction capabilities, supporting everything from simple blog scraping to complex interactions with single-page applications.
Key Firecrawl capabilities include:
- Full website crawling without requiring sitemaps
- JavaScript rendering through integrated Playwright microservices
- Automatic proxy rotation and CAPTCHA handling
- Multi-format output (Markdown, HTML, structured JSON)
- Integration with LLM orchestration frameworks like LangChain and LlamaIndex
The project’s architecture separates crawling, rendering, and extraction into modular components, allowing horizontal scaling through Redis-backed job queues. This design enables Firecrawl to process millions of pages daily while maintaining sub-second latency for individual requests.
What problem does Firecrawl solve?
Traditional web scraping approaches face three critical limitations in AI contexts:
- Structural loss: Converting HTML to plain text destroys semantic hierarchy and metadata crucial for LLM understanding.
- Dynamic content: Modern JavaScript frameworks require headless browsers, increasing complexity and resource demands.
- Scale limitations: Manual proxy management and rate limiting hinder large-scale data collection.
Firecrawl addresses these through its intelligent crawling engine, which preserves document structure using Markdown headers and semantic HTML annotations. The system automatically detects and waits for JavaScript-rendered content, with built-in retry logic for CAPTCHA challenges and network failures. For enterprise deployments, Firecrawl Cloud offers managed scaling with features like automatic IP rotation and geographic targeting.
A closer look at Firecrawl
Firecrawl’s distributed architecture comprises four core components that work in tandem.
- Crawler orchestrator: Firecrawl’s crawler orchestrator manages URL discovery and politeness policies using breadth-first search with domain prioritization. It implements adaptive delay algorithms that adjust to website response times while maintaining compliance with robots.txt directives unless explicitly overridden.
- Playwright microservices: Firecrawl uses the Playwright testing framework’s headless Chrome instances to handle JavaScript execution, enabling interaction with dynamic single-page applications. These microservices capture screenshots for visual verification and implement automatic scroll detection to handle infinite-scroll pages. Cookie persistence across sessions allows seamless navigation through authenticated content.
- Extraction pipeline: Firecrawl’s extraction pipeline converts rendered content into AI-ready formats using customizable schemas. This allows developers to define nested JSON output formats that preserve both content and metadata. This pipeline supports multi-stage processing, including PDF text extraction via PyMuPDF and image OCR through Tesseract.js integrations.
- Rate limiting: Excessive web scraping can result in IP blocks. Firecrawl prevents IP bans by throttling concurrent requests and automatically rotating proxies. The system integrates with third-party CAPTCHA solving services to handle anti-bot challenges, while maintaining detailed logs for compliance auditing.
Firecrawl integrations and use cases
The value of Firecrawl is amplified by its extensive integrations. LLM frameworks like LangChain support direct ingestion into vector databases, while automation platforms such as Make enable visual workflow building for scraping pipelines. Enterprise stacks benefit from Splunk integrations for crawl analytics and Snowflake connectors for direct data lake ingestion.
The below example shows the integration of Firecrawl with LangChain.
from langchain_community.document_loaders import FirecrawlLoader
loader = FirecrawlLoader(
api_key="YOUR_KEY",
url="https://example.com",
mode="crawl"
)
docs = loader.load()
The real-world applications of combining web scraping and large language models are possibly endless. Here are three examples.
An ecommerce company could monitor tens of thousands of product pages daily using the below technique:
app.crawl_url(
'https://competitor.com',
limit=12000,
scrape_options={'formats': ['json']},
output='s3://bucket/%(domain)s/%(date)s.json'
)
The structured JSON is fed into an LLM for price trend analysis.
A research team at a university could scrape millions of research papers using Firecrawl’s PDF processor:
firecrawl https://arxiv.org/pdf/2106.00001 --format=markdown --output=arxiv_papers/
A media intelligence firm could track multiple news sites using Firecrawl’s Sitemap detection:
app.map_url('https://nytimes.com')
Firecrawl’s roadmap focuses on semantic crawling using LLM-guided content discovery and WebAssembly-based edge processing for browser-side execution.
Bottom line – Firecrawl
Firecrawl redefines web data acquisition for the AI era, offering developers an enterprise-grade tool kit that abstracts away web scraping complexities. By combining robust crawling infrastructure with AI-native output formats, Firecrawl enables organizations to focus on deriving insights rather than data wrangling. As LLMs continue to permeate business workflows, Firecrawl’s role as the bridge between unstructured web content and structured AI inputs will only grow more critical. For teams building the next generation of AI applications, mastering Firecrawl’s capabilities will provide a strategic advantage in the race to harness web-scale information.
Page processed in 1.04 seconds.
Powered by SimplePie 1.3.1, Build 20130517180413. Run the SimplePie Compatibility Test. SimplePie is © 2004–2025, Ryan Parman and Geoffrey Sneddon, and licensed under the BSD License.