Latest news with #OlgaMegorskaya


Forbes
6 days ago
- Business
- Forbes
How Virtual Testing Environments Are Unlocking Enterprise AI Agent Adoption
Olga Megorskaya is Founder & CEO of Toloka AI, a high-quality data partner for all stages of AI development. AI agents are having a moment. From customer service automation to complex workflow orchestration, these systems promise to revolutionize how enterprises operate. Yet despite the hype, actual deployment remains frustratingly limited. The reason? Safety risks are simply too high for most organizations to stomach. Unlike traditional AI models that respond to prompts in isolation, AI agents perform sequences of actions across multiple systems—accessing databases, modifying files, interacting with APIs and making decisions that cascade through entire business processes. When an agent malfunctions in a production environment, the potential consequences include operational disruptions, security breaches, compliance violations and reputational damage. The solution isn't to restrict AI agents—it's to test them properly before deployment. Enter agentic environments: realistic virtual spaces that mirror your actual business operations, allowing you to stress-test agents without risking your real systems. The Enterprise Safety Paradox Here's the challenge enterprise leaders face: AI agents must interact with real tools and data to be effective, but testing them in production environments is unacceptably risky. A customer service agent who accidentally exposes PII, a workflow automation that corrupts financial records or a research assistant who inadvertently violates compliance protocols can cause immediate, tangible damage. Traditional testing approaches fall short because they can't capture the complexity of real enterprise environments. Sandbox testing with dummy data doesn't reveal how agents behave when faced with the messy realities of actual business operations, like incomplete datasets, legacy system quirks or the subtle contextual cues that guide human decision making. This creates a deployment deadlock. Organizations need to see agents perform in realistic conditions to build confidence, but they can't afford the risks of letting unproven agents loose on production systems. What Are Agentic Environments? Agentic environments solve this paradox by creating high-fidelity digital twins of real business operations. These aren't simple testing sandboxes; they're comprehensive virtual organizations, complete with realistic data, workflows and system integrations. Leading organizations are now implementing sophisticated virtual companies that serve as testing grounds for enterprise AI agents. These environments typically include fully functional instances of Google Workspace, Confluence, Salesforce CRM, Jira, GitHub and Slack, along with virtual employees, department structures and ongoing projects. These environments can be customized to match specific industry requirements or integrate proprietary tools, allowing them to safely validate AI agent performance before real-world deployment. The value of this approach has become clear as companies are using these virtual environments to test agents before launch. Realistic environments reveal critical vulnerabilities that would be invisible in traditional testing scenarios—issues that could compromise data security or cause system failures in production. By identifying and addressing these problems in virtual environments, organizations prevent what could be costly deployment failures. To build these virtual environments, data companies are employing real human experts to generate realistic scenarios that reflect the nuanced decision making and complex interactions agents will encounter in actual business operations. The key to success is realism. Rather than testing agents with sanitized, artificial scenarios, these environments present the complexity agents will encounter in production: conflicting information across systems, incomplete data sets, varying user permissions and the kind of edge cases that only emerge in real-world operations. Beyond Basic Testing: Agent-Oriented Benchmarks This focus on realistic testing has driven the development of sophisticated agent-oriented benchmarks. SWE-bench tests coding agents in structured development environments, while TAU-bench evaluates agents in retail and airline scenarios where they must navigate complex, domain-specific rules while interacting with both humans and APIs over extended periods. What makes these benchmarks valuable for enterprise deployment isn't just their technical rigor—it's their emphasis on real-world conditions. Human insight remains essential for creating benchmarks that address customers' unique operational requirements and designing realistic scenarios that reflect actual business complexities. These benchmarks test whether agents can maintain performance across millions of interactions, follow nuanced compliance requirements and handle the kind of ambiguous situations that human workers navigate daily. Enterprise teams are increasingly requesting custom versions of these benchmarks tailored to their specific industries and use cases. The demand signals that disruptive technologies require brand-new testing approaches. Security Implications From a security perspective, agentic environments are essential for identifying potential vulnerabilities before deployment. Agents that can modify files, access databases and interact with external APIs represent significant attack surfaces. Testing in isolated environments allows security teams to observe agent behavior under various conditions, including adversarial scenarios designed to exploit potential weaknesses. These environments also enable red team exercises where security professionals can attempt to manipulate agents into performing unauthorized actions or accessing restricted data. Better to discover these vulnerabilities in a controlled setting than after deployment. The Path Forward Using realistic testing environments enables organizations to build the confidence needed to deploy agents safely and effectively. Companies that establish robust agentic environments now will be positioned to take advantage of future advances while maintaining the safety and reliability their operations demand. Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?


Forbes
29-05-2025
- Business
- Forbes
3 Breakthrough Ways Data Is Powering The AI Reasoning Revolution
Olga Megorskaya is Founder & CEO of Toloka AI, a high quality data partner for all stages of AI development. The buzz around reasoning models like DeepSeek R1, OpenAI o1 and Grok 3 signals a turning point in AI development that pivots on reasoning. When we talk about reasoning, we mean that models can do more than repeat patterns—they think through problems step by step, consider multiple perspectives before giving a final answer and double-check their work. As reasoning skills improve, modern LLMs are pushing us closer to a future where AI agents can autonomously handle all sorts of tasks. AI agents will become useful enough for widespread use when they learn to truly reason, meaning they adapt to new challenges, generalize skills from one area to apply them in a new domain, navigate multiple environments and reliably produce correct answers and outputs. Behind these emerging skills, you'll find sophisticated datasets used for training and evaluating the models. The better the data, the stronger the reasoning skills. How is data shaping the next generation of reasoning models and agents? As a data partner to frontier labs, we've identified three ways that data drives AI reasoning right now: domain diversity and complexity, refined reasoning and robust evaluations. By building stronger reasoning skills in AI systems, these new approaches to data for training and testing will open a door to the widespread adoption of AI agents. Current models often train well in structured environments like math and coding, where answer verification is straightforward, fitting nicely into classical reinforcement learning frameworks. But the next leap requires pushing into more complex data across a wider knowledge spectrum. This is to achieve better generalization and performance as models transfer learning across areas. Beyond math and coding, here's the kind of data becoming essential for training the next wave of AI: These data points cover multi-step scenarios like web research trajectories with verification checkpoints. This includes open-ended domains such as law or business consulting that have multifaceted answers, which makes them difficult to verify but important for advanced reasoning. Think of complex legal issues with multiple valid approaches or comprehensive market assessments with validation criteria. Agent datasets are based on taxonomies of use cases, domains and categories as well as real-world tasks. For instance, a task for a corporate assistant agent would be to respond to a support request using simulated knowledge bases and company policies. Agents also need contexts and environments that simulate how they interact with specific software, data in a CRM or knowledge base or other infrastructure. These contexts are created manually for agent training and testing. The path a model takes to an answer is becoming as critical as the answer itself. As classical model training approaches are revisited, techniques like reward shaping (providing intermediate guidance) are vital. Current methods focus on guiding the process with feedback from human experts for better coherence, efficiency and safety: This focuses on a model's "thinking" rather than the outcome by guiding it through logical reasoning steps or guiding an agent through interactions with the environment. Think of it like checking step-by-step proofs in math, where human experts review each step and identify where a model makes a mistake instead of evaluating the final answer. Preference-based learning trains models to prioritize better reasoning paths. Experts review alternative paths and choose the best ones for models to learn from. This data can compare entire trajectories or individual steps in a process. These include data crafted from scratch to show high-quality reasoning sequences, much like teaching by example. Another approach is to edit LLM reasoning steps to improve them and let the model learn from the corrections. Current LLM evaluations have two main limitations: They struggle to provide meaningful signals of substantial improvements, and they are slow to adapt. The challenges mirror those in training data, including limited coverage of niche domains and specialized skills. To drive real progress, benchmarks need to specifically address the quality and safety of reasoning models and agents. Based on our own efforts, here's how to collaborate with clients on evaluations: Include a wider range of domains, specialized skill sets and more complex, real-world tasks. Move beyond single-metric evaluations to assess interdisciplinary and long-term challenges like forecasting. Use fine-grained, use-case-specific metrics. Co-develop these with subject-matter experts to add depth and capture nuances that standard benchmarks miss. As models develop advanced reasoning, safety evaluations must track the full chain of thought. For agents interacting with external tools or APIs, red teaming becomes critical. We recommend developing structured testing environments for red teamers and using the outcomes to generate new datasets focused on identified vulnerabilities. Even as model architectures advance, data remains the bedrock. In the era of reasoning models and agents, the emphasis has shifted decisively toward data quality, diversity and complexity. New approaches to data production are having a tremendous impact on the pace of AI development, urging reasoning models forward faster. With data providers upping their game to support the reasoning paradigm, we expect the near future to bring a wave of domain-specific, task-optimized reasoning agents—a new era of agentic AI. Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?


CNA
07-05-2025
- Business
- CNA
Amazon's Bezos leads new investment in AI data company Toloka
LONDON :Jeff Bezos' personal firm Bezos Expeditions is leading a $72-million investment in artificial intelligence data solutions company Toloka, which is looking to scale up its business, Toloka told Reuters on Wednesday. Toloka, which helps train and evaluate AI models using a network of human experts and testers, is part of Nasdaq-listed AI infrastructure firm Nebius Group and has worked with Amazon, Microsoft and Anthropic. Amsterdam-based Nebius emerged from a $5.4 billion deal, finalised last year, that split the domestic and international assets of Russian internet giant Yandex in the largest corporate exit from Russia since Moscow's 2022 invasion of Ukraine. Toloka founder and CEO Olga Megorskaya said the Bezos-led strategic investment was a milestone that should accelerate the company's growth, particularly in the U.S. market, and support its development of products through human experts collaborating with AI agents. "There will always be the need for control, verification, and help from human experts to ensure that the result is actually of high quality," Megorskaya told Reuters. Before finalising the split from Yandex, Nebius would have struggled to secure external investment from the likes of Amazon founder Bezos due to sanctions prohibiting U.S. investments in Russia. Chip giant Nvidia was among the investors in a $700 million private placement raised by Nebius late last year. In addition to Bezos Expeditions, Mikhail Parakhin, CTO of Canadian e-commerce company Shopify, is participating in the investment, Nebius said, and will join Toloka's new board of directors as executive chairman. Parakhin said he was excited to be joining Toloka at a time when the demand for world-class AI data expertise is more urgent than ever. Bezos Expeditions did not immediately respond to a request for comment. Nebius said it would continue to support Toloka as a shareholder, retaining a significant majority economic stake, but relinquishing majority voting control. "The important thing about this round is that we gained the needed flexibility to operate independently," Megorskaya said. She said Toloka anticipated another investment round in the future.