
From strengthening armor for U.S. warfighters to patching supply chain vulnerabilities, the convergence of AI and biotechnology could redefine U.S. national and economic security. Dramatic technical advancements in both fields mean that the building blocks of life, such as DNA and RNA, are understood and programmable. Humans now have the tools to shape life, carrying the potential to create better crops, livestock, materials, medicines, and manufacturing. The AI-biotechnology nexus stands to promote food and energy security, power the economy, safeguard the environment, and protect Americans from future pandemics and biothreats.
Seizing this potential, however, will hinge on improving the United States’ access to high-quality, secure biological data (biodata) designed specifically for AI. Existing biodata repositories lack the consistency, metadata, and trust needed to develop tailored AI models. To lead in biotechnology, the United States must modernize its data infrastructure and create biodata built for the AI future. U.S. leadership in AI-enabled biotechnology will depend less on new algorithms than on whether the United States treats biodata as secure national infrastructure.
Why Biodata Matters
Advancements in AI and biotechnology go hand-in-hand. AI provides the computational power to make sense of biology’s complexity at scales previously impossible, while biotechnology generates the information required to train better AI models. Biodata is the foundation of this synergistic relationship.
Biodata—including DNA, RNA, proteins, and metabolites—explain the structure, function, and process of biological systems. Biodata is to AI what fuel is to an engine—the fundamental resource that determines how powerful and useful AI models can become. Google DeepMind’s AlphaFold helps illuminate this concept.
AlphaFold is an AI system that predicts a protein’s structure from amino acid sequences. The system is based on datasets of known protein structures from decades of experimental work in structural biology. AlphaFold’s neural networks “study” and learn from the datasets, which enables the system to predict new protein structures from previous patterns. The datasets provide the scientific understanding of protein structures, while the AI system provides computational power and pattern recognition to drive new insights. AlphaFold allows researchers to solve protein folding problems in just minutes—a task that would have previously taken years of experimental work—paving the way to accelerated drug discovery, enzyme engineering, and improved understanding of diseases.
In short, the United States needs access to high-quality, secure, and diverse biodata to maintain leadership in AI and biotechnology. But the United States faces several challenges that threaten to cede its edge in both.
Current Challenges
First, the vast majority of U.S. investment into the AI-biotechnology nexus favors biomedical applications due to economic returns, regulatory and market dynamics, and the perceived risk-reward profile. To be sure, investment in biomedical applications has produced enormous benefits. With this support, companies like Insilico Medicine and Recursion Pharmaceuticals have developed AI models that help design new drugs, predict treatment responses, and automate diagnostics. But a narrow focus on biomedicine creates a data ecosystem that lacks the variety needed to train AI models that advance U.S. biotechnology leadership across other domains. Disproportionate biomedical investment leaves other critical areas like energy, agriculture, and defense under-resourced and under-developed.
Biomedical data will not necessarily help build an AI model that enables precision farming or self-repairing concrete for military airfields, for instance. To realize the broad potential of biotechnology, the United States needs diversity in biodata at multiple levels: genetic, geographic, environmental, and experimental. Diversity is a technical imperative that improves model robustness, reduces bias, and enhances transferability across contexts.
Another challenge is that most U.S. biodata is not built for AI and lacks the security, structure, and interoperability to support model development. The most impactful AI models, whether OpenAI’s ChatGPT or Google DeepMind’s AlphaFold, are trained primarily on “found” data, or data readily available on the Internet. But that data was never specifically designed for AI and often lacks needed structure and organization. In the absence of a national strategy to coordinate biodata collection, labeling, and storage, individual labs and researchers operate autonomously, following their own practices and standards. The result? A fragmented and disorganized trove of biodata, and subpar AI models that could obscure underlying biological signals and produce unreliable outputs.
The Case for Action
These challenges have produced a biodata ecosystem that is ill-prepared for the AI future and threaten U.S. biotechnology leadership. As the National Security Commission on Emerging Biotechnology (NSCEB) warned last year in its final report, failing to secure and modernize U.S. biodata infrastructure poses major national security risks.
U.S. adversaries are investing heavily in tightly integrated, state-controlled biotechnology ecosystems in which AI and biodata infrastructure co-evolve. China has made biotechnology a strategic national priority for more than two decades—spending tens of billions of dollars, building hundreds of research parks, and integrating AI and biodata across sectors from agriculture to defense. The United States, on the other hand, lacks a unified federal strategy or equivalent large-scale integrated data infrastructure. Absent intervention, the United States risks dependency on foreign biodata, software, and supply chains. In other words, the United States could risk a future in which China controls key means to feed, heal, and even defend U.S. citizens.
The United States has taken some action to shore up its biodata. The Department of Energy’s Bioenergy Technologies Office and National Laboratory testbeds are exploring open data standards for industrial biotechnology. Programs at ARPA-H and DARPA are also funding open-source algorithm development to ensure equitable access to foundational tools. And, encouragingly, the 2026 National Defense Authorization Act (NDAA) calls on the Department of Defense (DoD) to develop requirements to ensure that biodata created by DoD-funded research are collected and stored in a manner that facilitates its use for “advanced computational methods.” These initiatives address important pieces of the problem, but they do not establish shared standards, coordinated data commissioning, or durable governance for AI-ready biodata.
The United States still lacks a unified mechanism for coordinating public-private data collection, ensuring quality control, or maintaining the secure, distributed compute environments necessary for high-value AI model training. The United States must act quickly to align agency, industry, and academic biodata efforts. It can start by building a secure biodata infrastructure and commissioning biological datasets purpose-built for AI.
Data Infrastructure: Build a Secure, Cloud-Based Biodata Portal
If biodata is to function as national infrastructure, the United States must create a secure environment in which AI models can be trained on sensitive biological data. The United States should build a secure, cloud-based federal biodata portal for AI model development, similar to the NSCEB’s proposed Web of Biodata. This portal—which could be managed by the National Institutes of Health (NIH), the Department of Agriculture, or participating national labs—should function as a computational sandbox that allows authorized users to train, test, and validate AI models.
To keep sensitive data secure, researchers should bring their algorithms to the data instead of moving the data around. Access should be managed in layers, including an open tier for public information, a controlled tier for verified researchers, and a highly restricted tier for sensitive or proprietary data handled in secure environments. This approach offers both openness—in line with U.S. policies like ARPA-H and DARPA’s programs—and protection from nefarious actors—a national security imperative. It lets scientists push the boundaries of discovery while keeping the country’s most sensitive biodata safe and accountable.
A “trust fabric” could connect data providers and users while strengthening cybersecurity, enabling near real-time transparency about who is accessing what, for what purpose, and under which regulatory or ethical constraints. Public trust is essential to the future of AI-enabled biotechnology. Even if datasets cannot be fully open, transparency in how they are collected, curated, and governed must be non-negotiable. Researchers, policymakers, and the public need to understand not only how data are secured, but how AI models trained on them are validated and applied to real-world decisions that affect health, food, and environmental outcomes.
The portal must also ensure that model training can occur across multiple institutions while preserving data confidentiality. This can be achieved by integrating practices like federated learning, differential privacy, homomorphic encryption, and multiparty computation. Additionally, each dataset should include rich metadata following the Finable, Accessible, Interoperable, Reusable (FAIR) principles for data sharing, as well as the “FAIR-AI” extensions.
Finally, the portal should integrate explainable AI frameworks, allowing model developers to quantify data sufficiency and bias before deployment or publication. Automated metadata validation would flag underrepresented biological classes or geographic regions, helping guide future data-collection priorities. This approach would not only improve efficiency by reducing redundant data collection, but also establish a continuous feedback loop between data users and data stewards. The resulting infrastructure would position the United States as both a trusted custodian of global biodata and a leader in secure, transparent AI model development across all biological domains.
Targeted Data Generation: Commissioned Biological Datasets for AI
Infrastructure alone is insufficient without high-quality data designed specifically for AI. The United States should commission biological datasets from U.S. national laboratories, the National Institute of Standards and Technology (NIST), and core academic facilities that are built for model training. This would differ from the current approach to biodata generation, in which hypothesis-led research agendas—or niche biological questions—determine the types of data collected. Commissioned datasets would broaden the focus of the data, helping to increase the quality and diversity of biodata and create higher performing, more predictive models.
Commissioned datasets should aim to capture enough biological diversity to support generalizable models and expand beyond biomedical applications. Integrating genomic, transcriptomic, and high-frequence phenotypic data from plants exposed to variable climates, for instance, could enable AI models capable of forecasting crop resilience and guiding gene-editing strategies for climate adaptation. Likewise, datasets intentionally designed to link currently unconnected data silos could power AI systems that optimize fermentation and scale-up for bio-based materials and fuels. These examples, though not exhaustive, demonstrate how carefully designed datasets can translate into practical innovations across multiple sectors.
Data should be generated under standardized, auditable conditions, comparable to Good Manufacturing Practice (GMP) specifications. Access should then be provided through secure, tightly controlled agreements to vetted U.S. researchers and companies. Data generation must also avoid unnecessary redundancy. For example, repeatedly including the same protein from thousands of well-known species adds little predictive value, while costing just as much as producing novel, non-redundant data.
Targeted data generation would help the United States overcome the security, structure, and interoperability challenges that prevent researchers and engineering from fully capitalizing on the AI-biotechnology nexus. Commissioning datasets specifically for AI can help the United States promote innovation across the entire spectrum of biotechnology, ensuring U.S. global leadership in this critical field.
The Stakes
Biodata is the fuel of biotechnology innovation and security in the twenty-first century. Without deliberate investment in AI-ready biodata, collected at scale to ensure diversity, curated for quality, and protected by design, the United States risks losing its competitive edge in both biotechnology and AI to adversaries. To maintain its lead and unlock the benefits of AI-enabled biotech, the United States must treat biodata as critical national infrastructure—secure, standardized, and strategically governed across health and non-health domains alike.
– Michelle Holko, Sam Howell and John Wilbanks, Published courtesy of Lawfare.