Palo Alto Research services for clients - Task Force for AI-Data Networking-Protocol (TF-AID-NP)

We are very diligently and busy in delivering PALO ALTO RESEARCH services to clients, please check this site frequently.

Palo Alto Research connects over 5,000 senior engineers, researchers and experts to serve our clients for research, development, design, analysis, consulting & engineering services in the ICT (information and communications technology) field as well as business experts in account management, channel sales, presales engineering, technical architecture and training across various business sectors. Palo Alto Research provides one-stop solution for clients to build their platform ecosystem in the industry. Palo Alto Research also provides a solid foundation for the mission to develop cutting-edge IP and AI solutions to our clients.

Task Force for AI-Data Networking-Protocol (TF-AID-NP)
Working Group for National AI-Data Training and Inference super-Pool Infrastructure

Prof. Willie W. LU, Chair and Principal Investigator, Palo Alto Research
Contact: https://www.linkedin.com/in/willielu/

Click to view Prof. Lu's speech on AID-NP in Palo Alto, California.

Industry experts and executives to join force together for upgrading the national infrastructure to support seamless AI data flow with trust across all networking nodes including wireline backbone and wireless transport and optimize AI data processing including training and inference amongst different multiple data centers and between individual datacenter and distributed edge acceleration nodes as well as wireless transport between mobile wireless users and the connecting AI processing nodes.
-----------------------------------

AI-Data Networking-Protocol (AID-NP) for National AI-Data Training and Inference super-Pool Infrastructure (both wireline backbone and wireless transport)

Prof. Willie W. LU, Principal Investigator and Chief Architect, Palo Alto Research

Research Project mainly funded by West Lake® education and research services

Research project objective: upgrading the national infrastructure to support seamless AI data flow with trust across all networking nodes including wireline backbone and wireless transport and optimize AI data processing including training and inference amongst different multiple data centers and between individual datacenter and distributed edge processing nodes, covering entire infrastructure both wireline backbone and wireless transport.

-----------------------------------

Tentative structure of white paper for the AID-NP project from the China expert meeting
　

Chapter 1: SUMMARY

In addition to data center upgrade, we need new protocol for AI-data networking infrastructure as well, especially for building the National AI-Data Training and Inference super-Pool Infrastructure (NAID-TIPI) amongst hundreds of large-scale datacenters operated by different vendors and providers nationwide.

AI-generated data info is totally DIFFERENT from traditional NON-AI data packet! Our Internet infrastructure was NOT designed for AI data packet. The TCP/IP mechanism was developed based on physical packet switching of bit-based error correction control protocol based on physical bit-by-bit, for example, we received 8 bits then correcting 1-2 bits automatically, with round-trip latency. But for AI generated traffic, it is NOT based on bits, rather it is based on token-oriented info segment (otherwise the transmission frame overhead is too big and low efficient), with very low latency. Hence we need new protocol to transmit AI-generated traffics and data. Otherwise large amount of AI-generated information will be filtered out in switch, gateway, router and access points automatically. We need new startups to solve these infrastructure protocol issues.

Further, with AI revolution rapidly accelerating, networking infrastructure lies at its core. Currently, up to 80% of GPU consumption today is by the major cloud providers building massive AI clusters with hundreds of thousands of accelerators. This requires a transformational shift in AI-data-friendly networking capabilities to enable the immense bandwidth and ultra-low latency as well as seamless transport optimized for AI-data which is required for distributed AI training and inference among different datacenters through different networks across different locations. The demands of AI will continue to drive both network throughputs in the data center and for AI data in networking nodes in the years ahead.

Most people think NVIDIA = GPUs. But modern AI training is actually a networking problem. A single A100 can only hold ~50B parameters. Training large models requires splitting them across hundreds or thousands of GPUs in geographically distributed locations across Wide Area Networking (WAN) infrastructure.

For distributed AI training where GPUs constantly sync gradients, and so we can not tolerate any considerable End-to-end latency between each GPU.

We have one project focusing on AI-data Networking Protocol development especially for AI-data Switching Protocol, AI-data Routing Protocol and AI-data Interconnecting Protocol, etc.

Furthermore, another major issue for the AI data flow is the wireless link between mobile wireless users and the connecting AI data flow servers such as AI datacenter and/or distributed AI acceleration edge nodes sitting in local computer server, virtual mobile server or other processing units, etc.

At the mobile user side with mobile device, the wireless transport between the mobile devices and the data centers or the edge processing nodes need redefinition and re-development to support ultra-low latency of AI Data Flow with Trust where the innovative Open Wireless Architecture (OWA) Virtualization Platform has been utilized to secure the performance and efficiency. This AI-Native OWA Wireless Virtualization for the Wireless Link of the mobile users is part of the subject AID-NP platform and infrastructure.

The subject AID-NP also supports PET (Privacy Enhanced Technology) promoted by the OECD member states for finance platform, health platform and governmental information platforms, etc.

Palo Alto Research has a research project on this AI-Data Networking-Protocol (AID-NP) development amongst multiple datacenters, and between individual datacenter and distributed edge processing nodes, as well as between wireless mobile devices and said datacenters or distributed edge processing nodes, especially for National AI-Data Training and Inference super-Pool Infrastructure (NAID-TIPI).

We have monthly expert panel discussion at Prof. Willie LU’s Cupertino house or in the designated hillside park with worldly networking technology experts and scientists rooted in the San Francisco Bay Area (aka Silicon Valley). The panel discussion is normally in the afternoon of first Sunday of the month, except Prof. Lu is out of town on travel. For more information, send email to tf6g+subscribe@googlegroups.com.

CHAPTER 2:

TRADITIONAL TCP/IP PROTOCOL WAS NOT DESIGNED FOR AI-DATA TRANSPORT

Due to the unique demands of AI data workloads which require extremely low latency, accurate synchronization and high throughput, current network protocols like traditional TCP/IP is not sufficient and even not effective in terms of transmission and transport performance, leading to a need for new, optimized protocols specifically designed for AI data transport.

AI tokens require very low latency because in most AI applications, especially those involving real-time interactions amongst multiple datacenters in different locations, like live translations, living understanding and live inference, a quick response time and accurate synchronization amongst different training and inference engines or agents located in multiple datacenters or edge acceleration nodes in different locations are crucial for a seamless user experience and optimal performance.

Low latency also ensures that the AI model can process and generate responses to user inputs rapidly from different AI engines and agents, either from different datacenters or from different edge acceleration nodes, minimizing perceived delays and token-loss, and maintaining a natural AI data flow.

Last but not least, the traditional TCP/IP transported packets are all human-generated dataflow, including file data, email data, web data, and other user data as well as control data, signaling data and other network maintenance data. They do not need any verification of the source of data, which is all produced by Internet users.

However in the era of AI data workloads, large amount of data are generated by AI engines, agents, accelerators through AI training and inference models, and through AI dataflow transport amongst different datacenters or edge acceleration nodes, hence Data Flow with Trust by Humans (DFTH) becomes extremely essential in both private information transport and public information transport, especially for government information infrastructure.

TCP/IP is in no way to support DFTH mechanism.

TCP/IP requires roundtrip acknowledgments of packet transmission causing long latency and low efficiency in networking. The traditional Internet infrastructure focuses on reliable data delivery rather than seamless AI training and inference, and so TCP/IP was developed for that purpose. Though other protocols such as UDP was developed to support real-time packet applications, its performance remains far from the system requirements of the AI training and inference infrastructure.

Hence, TCP (and UDP) dramatically slows the rate of AI data transfer.

CHAPTER 3:

EXISTING IMPROVED NETWORK PROTOCOLS ARE ALSO FAR FROM MEETING RAPIDLY DEVELOPED AI DATA FLOW AND TRANSPORT

"RDMA (remote direct memory access ) over Converged Ethernet (RoCE)" or the emerging "Ultra Ethernet Transport (UET)" protocol developed by the Ultra Ethernet Consortium (UEC) is a popularly proposed alternative to attempt to support AI data transport infrastructure.

However, RDMA transmits data in chunks of large flows, and these large flows can cause unbalanced and over-burdened links. Also, RDMA is not designed for long transmission path.

UET is still within the fence and inside the wall of Ethernet architecture that Ethernet protocol is primarily designed for Local Area Networks (LANs) and is not optimally suited for large geographic areas, making it unsuitable for Wide Area Networks (WANs) due to limitations in many technical mechanisms including but not limited to: transmission error correction, long latency of medium access, Speeds decreasing with increased traffic, low reliability, capacity restrains, degradation through network switch and router, etc, bottlenecks with underline transmission modulation, big packet loss over long distance transmission, low signal-over-noise rate over long distance transmission, etc.

CHAPTER 4:

AI-DRIVEN IOT (INTERNET OF THING) NEEDS NEW NETWORKING PROTOCOL TO CONNECT BILLIONS OF IOT NODES

The integration of AI and IoT is indeed driving the need for new networking protocols to efficiently connect and manage billions of IoT devices. This emerging paradigm, often referred to as AIoT (Artificial Intelligence of Things), presents unique challenges that traditional networking protocols struggle to address effectively.

Challenges with Current Networking Protocols

Existing networking protocols face several limitations when it comes to supporting AI-driven IoT environments:

a)    Bandwidth constraints: AI applications, especially those involving machine learning and data analytics, require significant bandwidth. Current protocols often struggle to meet these demands, particularly in large-scale IoT deployments.

b)    Latency and load management: Traditional protocols are not optimized for the low-latency, high-load requirements of AI and IoT applications. This can lead to performance bottlenecks and reduced efficiency.

c)     Scalability issues: The dynamic nature of AIoT networks, with potentially billions of connected devices, poses scalability challenges that many existing protocols cannot adequately handle.

d)    Security concerns: Ensuring secure data transmission in AIoT environments is increasingly complex, requiring more sophisticated security measures than traditional protocols can provide.

CHAPTER 5:

CONSIDERATION IN DEVELOPING NEW NETWORKING PROTOCOL FOR AIOT DATA

When developing new networking approaches and protocols, we need to consider:

Private Connectivity Fabric (PCF)

PCF is an innovative architecture designed to meet the demands of AI-driven networks:

a)    It enables the creation of network meshes with full control over endpoints and layers.

b)    Allows for bandwidth control, latency management, load balancing, and redundant paths.

c)     Driven by APIs and controlled by routing protocols.

AI-Enhanced Network Management

AI is being leveraged to improve network management and performance:

a) AI-driven channel planning uses advanced algorithms to monitor network conditions and avoid interference.

b) AI-enhanced radio resource management (RRM) can process and analyze 10 times the data of conventional methods, optimizing radio configurations across multiple access points.

Adaptive Policies

To enhance security and performance in AIoT networks:

a) Adaptive policies dynamically adjust security measures and resource allocation based on real-time data and contextual information.

b) This approach allows for more granular control over network traffic and device interactions than traditional VLAN-based methods.

Blockchain Integration

Blockchain technology is being explored as a potential solution for enhancing security and privacy in AIoT environments:

a) Provides a decentralized and immutable ledger for secure management of device identities and transactions.

b) When combined with AI, it can offer adaptive responses to cyber threats and ensure data integrity.

Other consideration

As AIoT continues to evolve, we can expect further developments in networking protocols:

a)    Increased focus on edge computing to reduce latency and bandwidth requirements.

b)    Development of more sophisticated AI algorithms for network optimization and security.

c)     Standardization efforts to improve interoperability between different IoT devices and platforms.

The convergence of AI and IoT is driving significant changes in networking technologies. As these systems become more prevalent, new protocols and architectures will continue to emerge, addressing the unique challenges posed by connecting billions of intelligent devices. The future of AIoT networking will likely involve a combination of innovative technologies, including AI-driven management, blockchain integration, and adaptive security measures, to create more efficient, secure, and scalable networks.

CHAPTER 6:

CONSIDERATION IN DEVELOPING NEW NETWORKING PROTOCOL FOR AI DATA BETWEEN DIFFERENT MULTIPLE DATACENTERS

The development of new networking protocols for AI data transport between multiple datacenters is a critical area of focus as AI workloads continue to grow in scale and complexity. Several key considerations and approaches are emerging to address this challenge:

High-Bandwidth, Low-Latency Interconnects

A fundamental requirement for AI data transport between datacenters is extremely high bandwidth and low latency. This is driving innovations in fiber optic networking technology:

a)    Advanced optical transmission techniques like coherent optics are being deployed to maximize bandwidth over long-distance fiber links.

b)    Dense wavelength division multiplexing (DWDM) allows multiple data streams to be transmitted simultaneously over a single fiber.

c)     Hollow-core fiber cables are being explored to reduce latency by allowing light to travel closer to the speed of light.

Hierarchical Synchronization

Given the varying distances between datacenters, a hierarchical approach to synchronizing AI model training across sites is being adopted:

a)    Within a campus (< 1km): Very frequent synchronization due to low latency.

b)    Within a region (< 100km): Less frequent syncs to account for higher latency.

c)     Cross-continent: Even less frequent synchronization.

d)    This allows AI training to be distributed across geographically dispersed datacenters while managing latency constraints.

Asynchronous and Decentralized Training

New AI training approaches are being developed to work more effectively across distributed infrastructure:

a) Asynchronous Stochastic Gradient Descent (SGD) algorithms allow model updates to happen independently at different datacenters without tight synchronization.

b) Federated learning techniques enable model training across decentralized datasets without raw data leaving each datacenter.

Intelligent Traffic Management

AI itself is being applied to optimize data flows between datacenters:

a) Machine learning algorithms analyze network traffic patterns and dynamically allocate bandwidth.

b) AI-driven software-defined networking (AI-SDN) allows for automated, policy-based routing of AI workloads across the network.

Enhanced Security Protocols

As AI data moves between datacenters, robust security is critical:

a) End-to-end encryption protocols are being enhanced to secure massive AI datasets in transit.

b) AI-based anomaly detection systems monitor inter-datacenter traffic for potential security threats.

Edge Computing Integration

Edge datacenters are being incorporated into AI networking architectures:

a) Localized processing reduces the amount of raw data that needs to be transferred to central datacenters.

b) Edge sites can handle time-sensitive AI inferencing while coordinating with larger datacenters for model training.

The development of these new networking protocols and architectures is an active area of research and innovation. As AI models and datasets continue to grow, the ability to efficiently distribute training and inference across multiple datacenters will be crucial for scaling AI capabilities. This is driving significant investment in next-generation datacenter interconnect technologies and intelligent networking systems optimized for AI workloads.

CHAPTER 7:

STATE OF ART OF AI-DATA INTERNETWORKING PROTOCOL FOR AI-DATA TRANSPORT BETWEEN DIFFERENT DATACENTERS IN DIFFERENT LOCATIONS

The rise of AI applications has indeed created new challenges for data center interconnects and networking protocols. To address the unique requirements of AI workloads, several advancements are being made in data center networking and interconnect technologies:

High-Speed Interconnects

The demand for higher bandwidth between data centers is driving the adoption of faster interconnect technologies:

a) 400G and 800G optical transceivers are being deployed to handle the increased data volumes generated by AI workloads.

b) There is growing interest in even higher speeds like 1.2T and 1.6T for future-proofing data center networks.

These high-speed interconnects aim to reduce latency and increase throughput for AI data transport between geographically distributed data centers.

Energy-Efficient Designs

New transceiver designs are emerging to improve energy efficiency:

a) Linear Pluggable Optics (LPO) eliminate complex digital signal processing on the optics, reducing power consumption.

b) Co-Packaged Optics (CPO) integrate optics more closely with switch ASICs to optimize power and latency.

These innovations help data centers scale up their networking capabilities while managing power constraints.

AI-Optimized Networking Protocols

While not a single new protocol, several optimizations are being made to existing networking stacks:

a) AI-driven traffic prediction and routing optimization to reduce latency.

b) Enhanced quality of service mechanisms to prioritize time-sensitive AI workloads.

Improved congestion control algorithms tailored for bursty AI traffic patterns.

Edge Computing Integration

To reduce latency for certain AI applications, edge computing architectures are being incorporated:

a) Distributed AI processing closer to data sources.

b) Optimized protocols for edge-to-core data center communication.

Scalable Network Architectures

Data center network designs are evolving to better support AI workloads:

a)    Flatter network topologies to reduce hop counts between AI compute nodes.

b)    Increased east-west traffic capacity within data centers.

c)     Software-defined networking (SDN) for more flexible resource allocation.

While there isn't a single new "AI-Data internetworking protocol" per se, the industry is adapting existing technologies and developing new optimizations to meet the unique demands of AI workloads. The focus is on increasing bandwidth, reducing latency, improving energy efficiency, and enhancing scalability across geographically distributed data centers.

CHAPTER 8:

CURRENT WIDE AREA NETWORK (WAN) NETWORKING SYSTEM DOES NOT SUPPORT AI DATA TRANSPORT BETWEEN GEOGRAPHICALLY DISTRIBUTED DATA CENTERS

The current wide area network (WAN) switching and routing systems do face some limitations when it comes to supporting AI data transport between geographically distributed data centers.

Current WAN Limitations for AI Workloads

Traditional WAN architectures were not designed for AI workloads in mind, which can lead to several challenges:

a)    Bandwidth constraints: AI applications often require massive amounts of data transfer, which can overwhelm traditional WAN links.

b)    Latency issues: The geographic distance between data centers can introduce significant latency, impacting real-time AI applications.

c)     Lack of intelligent routing: Conventional WANs may not have the capability to dynamically optimize routes for AI traffic.

d)    Security concerns: AI data often contains sensitive information, requiring robust security measures that may not be inherent in legacy WAN systems.

Emerging Solutions to Solve the Current Problems

While current WAN systems may have limitations, the networking industry is rapidly evolving to address these challenges:

AI-Driven SD-WAN

Software-defined WAN (SD-WAN) enhanced with AI capabilities is emerging as a potential solution:

a)    Intelligent traffic routing: AI algorithms can analyze network conditions in real-time and optimize routing for AI workloads.

b)    Predictive maintenance: AI can identify potential network issues before they impact performance.

c)     Enhanced security: AI-powered security features can better protect sensitive AI data in transit.

Cloud-Native Networking

Cloud providers are developing specialized networking solutions optimized for AI workloads:

a) High-bandwidth interconnects over the Cloud: Cloud providers offer dedicated high-speed links between data centers.

b) Edge computing: Bringing compute resources closer to data sources can reduce latency for AI applications.

AI-Optimized Hardware

Network equipment manufacturers are developing hardware specifically designed to handle AI traffic:

a) Programmable network processors: These can be optimized for AI-specific protocols and data formats.

b) Intelligent NICs: Network interface cards with built-in AI acceleration can offload some processing from the main network.

AI-Optimized Broadband Wireless Access (BWA)

The TCP/IP-oriented packet switching networks do not support AI-data traffic due to its serious latency issue and packet-loss issue.

The traditional wireless-oriented circuit switching transmission is instead the optimal solution to support AI-data traffic due to its low latency and better SNR over the air. The proposed BWA based on OWA (open wireless architecture) platform is the optimal way for such AI-optimized BWA solution.

Future Outlook

The future of WAN for AI looks promising:

a)    Integration of AI into network management: Networks will become self-optimizing, automatically adjusting to AI workload demands.

b)    Quantum networking: Emerging quantum communication technologies may revolutionize data transfer for AI applications.

c)     Broadband Wireless Access (BWA): Next-generation fixed broadband wireless access technologies will provide increased bandwidth and lower latency (original circuit-switch wireless transmission) for AI data transport.

While current WAN systems may not fully support AI data transport between geographically distributed data centers, the rapid pace of innovation in networking technology is quickly closing this gap. As AI becomes increasingly central to business operations, we can expect to see continued advancements in WAN technologies specifically tailored to meet the unique demands of AI workloads.

CHAPTER 9:

RF SOLUTIONS TO INTERCONNECT GEOGRAPHICALLY DISTRIBUTED DATA CENTERS FOR AI DATA TRANSPORT

RF over Fiber (RFoF) technology transmits radio frequency (RF) signals over optical fiber by converting analog RF signals into optical signals, transmitting them over fiber, and then converting them back to RF signals.

In RF-over-fiber architecture, a data-carrying RF (radio frequency) signal with a high frequency is imposed on a lightwave signal before being transported over the optical link.

RFoF solutions are built with open architectures that align to open standard suites such as the LCA and CMOSS.

RFoF offers a solution which is comprised of the following blocks:

1. RFoF High SFDR links which support 20GHz and 40GHz instantaneous bandwidths.

2. RF to Optical conversion modules with optional signal level control functionality.

3. Optical Matrix fast routing with n*M ports enabling to switch between any of the n optical inputs and combinations thereof to the M optical antenna outputs with reliable multi-fiber interfaces.

4. Optical to RF conversion antenna modules, each with managed RF power amplifiers to produce the desired RF level at each antenna port.

5. Scalable modular design which allows upgrades of the number of antennas M and number of signals n with minimal changes to the system architecture.

6. Management and monitoring state of the art system based on popular standard protocols.

7. Optional Optical delay line and modulation capabilities for the n input signals.

CHAPTER 10:

AI-NATIVE OPEN WIRELESS ARCHITECTURE (OWA) WIRELESS TRANSPORT

EXISTING TELECOM INFRASTRUCTURE DOES NOT SUPPORT AI DATA FLOW WITH TRUST OVER WIRELESS LINKS

The existing wireless communication infrastructure was developed for the people-to-people communications topology where human-being generally does not take the wireless transmission channels all the time of 7/24 due to standby sleep time, etc, and so the wireless infrastructure is based on Erlang model.

Second, the existing wireless communication infrastructure was developed totally within the traditional Telecom infrastructure which demands the closed-architectural base station, BSC, MSC and extended Gateway and other network equipments, etc which is in full conflict with evolving open computing architecture, open software architecture and open networking architecture. Since Steve Jobs launched iPhone, the traditional Telecom infrastructure has been facing tremendous challenges in opening up its transmission nodes and networking nodes as well as system architecture. In order to meet the rapid challenges rising from the computer industry and software industry, the telecom industry has no choice but to couple existing multiple infrastructures together to support open architecture in computer and software industries, causing increasing complexity and low efficiency in system implementation and infrastructure implementation.

The wireless service model has been shifting rapidly from people-to-people communication model to new models of Internet of Vehicle (IoV), Internet of Things (IoT) and large model of AI Data Transport. These new models demand full utilization of the wireless transmission resources 7/24 and demand very low latency in real-time and strictly synchronized data flow over wireless links, which further challenge the existing telecom infrastructure in terms of poor performance and bad networking capability.

Open Wireless Architecture (OWA) was introduced for delivering open architecture solutions in wireless local area networks and cellular wireless networks by constructing an independent OWA Virtualization Layer upon existing various Radio Transmission Technology (RTT) radio interfaces in order to create open and compact platform for the AI data transport over wireless links for mobile devices of the mobile users.

OWA WIRELESS TRANSPORT TO SUPPORT ULTRA-LOW LATENCY OF AI DATA FLOW WITH TRUST OVER WIRELESS LINKS

Steve Jobs’s most important contribution to the world was to open up the mobile device architecture from traditional telecom device which was of carrier-centric close architecture, to an open platform converged with open computing architecture and open software architecture so that different mobile services’ and applications’ developers can play their different games upon such open-architecture platforms for the mobile users through their mobile devices. Apple’s iPhone totally changed the game rules in the mobile device industry and kicked off the new era of open architecture in the wireless industry.

At the mobile user side with mobile device, the wireless transport between the mobile devices and the data centers or the edge processing nodes need redefinition and re-development to support ultra-low latency of AI Data Flow with Trust over the air where the innovative Open Wireless Architecture (OWA) Virtualization Platform has been utilized to secure the performance and efficiency. This AI-Native OWA Wireless Virtualization of AI Data Flow for mobile devices is part of the subject AID-NP platform and infrastructure.

As more and more AI data migrate from desktop computers to mobile devices (mobile phone, mobile Pad and mobile laptop), an efficient wireless transport between mobile devices and the AI agents in datacenters or edge acceleration nodes for the AI data flow becomes extremely important. The current wireless network infrastructure, including cellular mobile networks and wireless local area networks, is not designed and optimized for AI data flow which requires ultra low latency.

Over 90% of existing 4G and 5G cellular mobile networks and 100% of existing wireless local area networks are based on packet-switching transmission mechanism. The packet-switched data are transported hop-by-hop across the entire Internet or amongst numerous routing nodes throughout the wide area networking infrastructure, causing lengthy delays and high latency in wireless transmission performance.

Open Wireless Architecture (OWA) Virtualization is built upon the MAC/PHY layers of the underlying wireless transmission resources to separate the various Radio Transmission Technologies (RTTs) from the higher layers of data transport and service sessions in full convergence of Open Computer Architecture (OCA), Open Network Architecture (ONA), Open Software Architecture (OSA) and Open Data Architecture (ODA) for the new generation of mobile device architecture including smart mobile phone, mobile Pad and mobile laptop, etc.

OWA Wireless Virtualization Platform manages various RTTs in cost-effective and spectrum-efficient way to optimize the performance for the service sessions of the wireless data transmission. OWA also employs Virtual Mobile Server (VMS) for Mobile AI and Telecom GPT processing as edge acceleration nodes performing AI-Native calculating, processing, programming, computing tasks for open wireless transmission, signal processing and wireless networking, etc among managed mobile devices and the VMS hosting server.

VMS connects to backbone AI datacenters and/or AI edge acceleration nodes through innovative AI-Data Networking Protocol (AID-NP) to facilitate AI dataflow with ultra-low latency. This enables the end-to-end low-latency AI data flow with trust between local mobile devices and remote mobile devices across wireless networks and wireline networks amongst multiple datacenters and/or edge acceleration nodes of the entire wide area networks of AI data flows.

OWA effectively maps the available wireless transmission resources into two blocks:

ü Circuit-Switched Wireless Channels (CSWC) ensuring End-to-End wireless transmission of ultra-low latency for the AI data flow and transport.

ü Packet-Switched Wireless Channels (PSWC) optimizing wireless transmission efficiency of high latency for the traditional TCP/IP data flow and transport.

Then OWA effectively convert the above blocks of CSWC and PSWC into respective OWA Virtual Frames based on the defined Quality of Wireless transmissions (QoW) in terms of said data transmission latency and other wireless transmission parameters.

Then the OWA Virtualization Platform drives the underlying OWA Wireless Adaptation layer for the porting of specific RTTs, either circuit-switched ports or packet-switched ports, for the specific wireless data flow of either AI data flow and transport or TCP/IP data flow and transport across the available wireless networks in the area.

Both circuit-switched and packet-switched wireless data transmissions are administrated by the managed VMS AI server.

Circuit-Switched Optimizer (CSO) and Packet-Switched Optimizer (PSO) sit above the OWA Virtualization Platform to ensure the quality of the data flow which is trustworthy and reliable. There are two main reasons to set up CSO and PSO:

ü Both Internet and AI models generate huge amount of garbage data across the entire infrastructure and we need to filter out these garbage data and perform basic screening.

ü Wireless transmission channels are very expensive due to high-cost wireless spectrum resources and high-cost wireless transmission systems, so we do not want to waste the wireless channels for the transmission of these garbage data.

The CSO is extremely important since we buy ultra-low latency with low-efficiency wireless transmission in order to support quality AI data flow across the AI data infrastructure. Meanwhile, we still maintain high-efficient wireless transmission of TCP/IP data flow but with tolerable high latency through the PSO controller.

Further, the CSO and PSO utilize different Error-Correction Mechanism (ECM) for data flow transmission over the wireless air links, which will be discussed in details in the OWA training course.

OWA pushed the traditional telecom industry to open up its wireless infrastructure from carrier-centric platform to user-centric platform for supporting open AI data flow and open IoT data flow across various RTT air interfaces which is a revolutionary approach for the industry to move forward.

OWA was evolved from Software Defined Radio (SDR) back to 2000s, but has been greatly improved to support the wireless transport for emerging Internet of Vehicle (IoV), Internet of Things (IoT) and AI data flow with trust (AI-DFT) through innovative OWA Wireless Virtualization platform for the mobile devices and mobile wireless infrastructure in the era of AI and IoT.

OWA Wireless Virtualization platform is a new wireless access and wireless adaptation layer to support billions of wireless nodes for the emerging AI dataflow and IoT dataflow.

OWA Wireless Virtualization platform also supports AI-native PETs (Privacy Enhanced Technologies) for finance platform, health platform and governmental information platforms, etc.

For further details of OWA Access Control and OWA Adaptation Control, please join the OWA technology training course scheduled twice a year in the heart of Silicon Valley of San Francisco Bay Area.

The OWA research and development has been very active throughout China, U.S. and other countries under the leadership of Prof. Willie W. Lu, the PI and Chief Architect of OWA platform and senior expert and delegate of OECD Missions for the technology, regulations and policies in the sectors of ICT, Cybersecurity, AI, IoT and PET, etc.

Chapter 11: Development of an Advanced AI‑Data Networking Protocol

1. Scope and Objective

This extended report develops a detailed technical blueprint for an Advanced AI‑Data Networking Protocol (AID‑NP), reflecting:

· The architectural vision and requirements described by the Task Force for AI‑Data Networking‑Protocol (TF‑AID‑NP) at Palo Alto Research (NAID‑TIPI, AI‑Data Switching/Routing/Interconnecting, OWA, RFoF, DFTH) [1].

· The current state of the art in relevant standardization bodies: UEC – Ultra Ethernet Consortium (UE Specification v1.0.x) [2]. OCP – Open Compute Project (ESUN 1.0, SUE‑T/UALink ecosystem) [3]. IETF – Internet Engineering Task Force (AIIP, AIPREF) [4][5][9]. MCP – Model Context Protocol (specification and ecosystem) [6]. IEEE 802 / IEEE 802.1 – Nendica AICN & lossless DCN work [7][8].

The goal is not just to summarize these, but to turn them into an actionable design and roadmap for AID‑NP, indicating what can be standardized now, what requires experimental work, and how to align multiple consortia.

2. AI‑Data Networking vs Traditional Networking: Detailed Requirements

2.1 Traffic Characteristics of AI Workloads

Modern AI training and inference pipelines differ significantly from traditional transaction or bulk‑transfer workloads:

· Collective Communication Patterns

· Token‑Centric Data

· High Burstiness with Predictable Phases

· Geographically Distributed Training

2.2 Constraints of Current Protocols (TCP/IP, RoCE, InfiniBand‑over‑Ethernet)

1. TCP

2. UDP

3. RDMA (RoCE, iWARP, Infiniband)

4. WAN Protocols

TF‑AID‑NP’s position is that this gap cannot be bridged by “patching TCP/IP” alone, and calls for a new AI‑Data Networking Protocol suite tailored to AI’s token‑centric, trust‑centric requirements [1].

3. AID‑NP Conceptual Stack (NAID‑TIPI Context)

3.1 NAID‑TIPI: National AI‑Data Training & Inference Super‑Pool

NAID‑TIPI (National AI‑Data Training and Inference super‑Pool Infrastructure) [1]:

· Interconnects: Campus and metro AI clusters. Regional and national data centers. Edge computing sites (micro‑DCs, base‑stations, VMS nodes).

· Covers: Wireline backbone – DWDM, RFoF, hollow‑core fiber. Wireless transport – via Open Wireless Architecture (OWA).

AID‑NP provides a common AI‑native control and data plane across this environment, supporting both training and inference.

3.2 AID‑NP Components (as described in TF‑AID‑NP and extended here)

AI‑Data Switching Protocol (AI‑SP) – intra‑cluster, low‑latency, lossless forwarding.
AI‑Data Routing Protocol (AI‑RP) – AI‑aware routing and multi‑path selection across clusters and regions.
AI‑Data Interconnecting Protocol (AI‑IP / AI‑DC‑P) – WAN/Inter‑DC transport, hierarchical synchronization, federated learning.
Private Connectivity Fabric (PCF) – SDN‑like, policy‑driven mesh for AI‑flows.
AI‑Native Open Wireless Architecture (OWA) – CSWC/PSWC virtualization, CSO/PSO optimizers, PET integration.
RF‑over‑Fiber (RFoF) and advanced optics – physical backbone to meet bandwidth/latency targets.
The next sections define these components more concretely and then map them to standardization work.

4. AI‑Data Switching Protocol (AI‑SP) – Detailed Design

4.1 Goals

· Sub‑10 µs end‑to‑end latency per hop for AI token‑flows.

· Strictly lossless semantics for designated AI classes (e.g., gradient flows).

· Header minimalism: replace multi‑layer IP/UDP/TCP headers with a compact AI‑aware header.

· Hardware‑friendly: implementable in current switch silicon oriented toward UEC/ESUN.

4.2 Header Format (Aligned with ESUN / UEC 4‑Byte Header)

Recent OCP ESUN 1.0 work defines a 4‑byte ESUN Header (EH) that replaces the standard IP/UDP stack and carries:

· EH‑ECN – Congestion feedback.

· EH‑CoS – Class of Service.

· ESUN Flow Label – Used for load‑balancing.

· TTL – Time‑to‑Live loop detection [3].

AID‑NP can directly adopt and extend this layout as the AI‑SP header:

This header is inserted after the Ethernet MAC header, but before any higher‑layer payload (RDMA‑like operations, AI‑collective PDUs, or specialized transport).

4.3 Lossless Behavior and Congestion Control

ESUN 1.0 (aligned with UEC) requires support for PFC, CBFC, and LLR [3][2]:

· PFC (Priority‑based Flow Control) – Per‑priority pause frames; critical to lossless semantics for specific CoS.

· CBFC (Credit‑Based Flow Control) – Finer‑grained buffer management; switch egress ports track credits for upstream senders.

· LLR (Link‑Level Retry) – Retransmits frames locally on a link upon detection of errors without propagating loss to the fabric.

AI‑SP defines:

· One “lossless AI class” of service (e.g., CoS 6–7) with: PFC enabled and carefully tuned. CBFC enabled to avoid buffer starvation. LLR mandatory on all links carrying the class.

· One or more “elastic classes” for inference and best‑effort AI traffic with standard ECN‑based congestion management.

· Integration with IEEE 802.1 AICN’s congestion telemetry and enhanced signaling (e.g., CSIG) allows:

· Fine‑grained observation of queue depths.

· Packets marked with quantized congestion levels.

· Feedback loops with AI‑SP endpoints that can adjust send rates per flow label.

5. AI‑Data Routing Protocol (AI‑RP) – Routing for AI Fabrics

5.1 Distinguishing Features

AI‑RP is not a simple rebrand of IP routing. Core differences:

1. Flow‑Centric Routing

2. AI‑Topology Awareness

3. Hierarchical Zones

4. Congestion‑Aware Path Re‑selection

5.2 Control‑Plane Operation

· AI‑RP builds on a PCF‑style SDN controller (see Section 7) that: Installs forwarding rules mapping Flow Label → next hop, with different paths per zone. Can pre‑compute disjoint paths for resilience.

· Integration with AIIP naming:ai:// identifiers representing AI services/models (e.g., ai://gpu-cluster-12.regionX/naid-tipi) are resolved to manifests containing endpoint and policy information [4].PCF translates these to AI‑RP route instantiation decisions.

6. AI‑Data Interconnecting Protocol (AI‑IP / AI‑DC‑P)

6.1 Function

AI‑IP (or AI‑DC‑P in NAID‑TIPI terminology) handles:

· Inter‑DC training: cross‑region gradient exchange.

· Federated learning: periodic model updates from edge sites.

· Distributed inference: spanning multiple DCs for resiliency and load.

6.2 Hierarchical Synchronization

TF‑AID‑NP emphasizes hierarchical synchronization [1]:

· Campuses (<1 km) – near‑synchronous training with micro‑second granularity.

· Regions (<100 km) – slight relaxation of synchronous semantics, tolerant of ms latency.

· Continents – asynchronous SGD and federated learning; infrequent but larger update packets.

AI‑IP introduces specialized control messages:

· SYNC‑BEACON – establishes epoch boundaries across tiers.

· MODEL‑DELTA – compressed parameter updates.

· STALENESS‑HINT – indicates tolerated gradient staleness from a region.

These messages also carry AIPREF preference references (e.g., content’s allowed use for training vs output) to ensure cross‑region updates respect content owners’ policies [5][9].

6.3 Physical Realization

Per TF‑AID‑NP [1]:

· RF‑over‑Fiber (RFoF) modules: high Spurious Free Dynamic Range (SFDR) 20/40 GHz links with modular optical matrices.

· DWDM: multi‑wavelength, high‑capacity optical trunk.

· Hollow‑core fiber: reduces latency vs traditional fiber.

· These technologies provide the bandwidth and latency targets for AI‑IP’s inter‑connect.

7. Private Connectivity Fabric (PCF) – AI‑Aware SDN Layer

PCF is the unifying control fabric that ties together AI‑SP, AI‑RP, and AI‑IP:

7.1 Features

1. API‑Driven Network Slice Creation

2. AI‑Enhanced Network Management

3. Adaptive Policies

4. Blockchain‑Backed Identity

7.2 Relationship with IEEE 802.1 AICN

The AICN study item [7] and the previous intelligent lossless DCN report [8] supply:

· Models of AI throughput vs latency vs availability.

· Identification of scale, efficiency, and availability challenges.

· Proposed technologies for congestion control, PFC storm mitigation, buffer/headroom optimization.

PCF uses these findings to:

· Decide how many lossless classes to configure.

· Tune congestion algorithms for AI‑flows vs background traffic.

· Optimize headroom based on measured queue utilization.

8. Wireless Edge: AI‑Native Open Wireless Architecture (OWA)

8.1 Motivation

AI‑enabled applications require mobile and IoT endpoints to stream sensor data and receive inference results under strict latency and security constraints:

· AR/VR.

· Industrial robotics.

· Smart city sensing.

· Health monitoring.

OWA partitions wireless resources into:

1. Circuit‑Switched Wireless Channels (CSWC) – reserved capacity for AI‑flows.

2. Packet‑Switched Wireless Channels (PSWC) – best‑effort flows for legacy traffic.

8.2 Architecture

· OWA Platform & Virtualization Layer abstracts heterogeneous RATs (5G, Wi‑Fi, private LTE, mmWave) into a common interface [1].

· Virtual Mobile Server (VMS) nodes act as edge compute, bridging wireless to NAID‑TIPI AI data centers.

· CSO (Circuit‑Switched Optimizer):

· Determines optimal spectrum and time allocation for AI data bursts.

· Minimizes over‑provisioning by predicting AI burst windows.

· PSO (Packet‑Switched Optimizer):

· Manages remaining capacity using classic packet scheduling and ECN.

· Error‑Correction Mechanisms (ECM):

· CSWC uses more aggressive FEC/ARQ to reduce residual error and latency.

· PSWC uses conventional radio error correction.

OWA is a natural target for future IEEE 802.11/3GPP contributions; however, as of 2026, it is primarily at the architecture/prototype stage within TF‑AID‑NP [1].

9. Application and Policy Layers: AIIP, AIPREF, MCP

9.1 AIIP – AI Internet Protocol

AIIP is an IETF draft that defines:

ai:// URI scheme for addressing AI agents, devices, robots, tools [4].
Manifest‑based resolution:ai:// identifier → signed manifest including: Publisher identity. Verification keys. Capabilities (e.g., “actuate”, “inspect”).Invocation endpoints (HTTPS, MQTT, etc.).
Security: Manifests are integrity‑protected (JOSE/COSE).Clients validate against trust anchors and log resolutions.

AID‑NP Use:

Every NAID‑TIPI endpoint (DC cluster, VMS node, robot) exposes an ai:// identity.
PCF obtains manifests and maps them to: Allowed paths (e.g., stay in region X).Required security features (MACsec, TLS, PET).

9.2 AIPREF – AI Preferences Working Group

The AIPREF WG is standardizing:

A vocabulary for expressing AI usage preferences of content owners: Categories such as Foundational Model Training, AI Output, and Search [9].Evolving models where “Search” is a subset of “AI Output”.
A companion spec for attaching these preferences to content (e.g., in HTTP headers) [9].

AID‑NP Use:

AIPREF preferences are ingested at AI data collection time: Before a dataset flows into NAID‑TIPI, PCF checks: Whether Foundational Model Training is allowed. Whether AI Output or Search usage is restricted.
Policy decisions: Reject or down‑scope certain flows. Attach metadata to AI‑SP flows indicating permitted use (e.g., training‑only, no output referencing).

9.3 MCP – Model Context Protocol

MCP is an open standard for connecting LLMs/agents to external tools and data [6]:

Defines a client–server messaging protocol: Tool discovery. Invocation. Streaming results.
Transport‑agnostic; typically runs over Web Sockets/HTTP/GRPC.

AID‑NP Use:

MCP runs on top of AID‑NP:AI agents in data centers or at the edge use MCP to access NAID‑TIPI data sources. MCP requests are routed via AI‑RP and subject to AIPREF‑driven policies.

Combined with AIIP:

AIIP → naming (ai://service).
MCP → invocation (structured RPC).
AIPREF → policy gating.
AID‑NP → efficient, AI‑aware transport.

10. IEEE 802.1 AICN and Lossless DCN – Grounding Fabric Behavior

10.1 AICN Study Item

The AI Computing Network (AICN) Nendica study item [7]:

· Analyses AI workloads: Parallelities (data/model/pipeline). Collective communication.

· Identifies key metrics: Scale (N of accelerators).Efficiency (network utilization). Availability.

Key challenges:

· Scale: limited bandwidth on long‑distance links; deadlock risk due to link‑level flow control.

· Efficiency: load balancing, ECMP issues, packet spray, congestion control complexities [7].

10.2 Lossless DCN Report (Pre‑Draft DCN Work)

The earlier IEEE 802 Nendica Lossless DCN report [8] covers:

· RDMA use (RoCE, iWARP, Infiniband) in data centers.

· PFC storms and deadlock scenarios.

· Congestion control tuning (ECN, QCN evolutions).

· Intelligent buffer management and headroom optimization.

How this informs AID‑NP:

· AI‑SP leverages their insights to: Avoid PFC storms (intelligent thresholds, deadlock detection). Implement advanced ECN marking policies. Dimension buffers for AI training phases.

· AICN’s final report is expected to recommend specific IEEE 802.1 amendments, which AID‑NP can adopt as normative behavior for PCF and AI‑SP/AI‑RP.

11. End‑to‑End Example – Distributed Training Across NAID‑TIPI

Scenario:

A 2,000‑GPU model training job runs across:

· 4 data centers (500 GPUs each).

· 20 edge VMS nodes supplying streaming data.

· WAN segments using RFoF and DWDM.

Step‑by‑step:

1. Data Admission & Policy Check

2. Endpoint Discovery

3. Path & Slice Setup (PCF)

4. Training Iteration

5. Edge Participation via OWA

6. Model Update & Storage

12. Development and Standardization Roadmap

12.1 Near Term (0–2 Years)

1. Profile AID‑NP over Existing Standards

2. Leverage AICN Outputs

3. Integrate AIIP + MCP

4. Formalize DFTH & PET Semantics

12.2 Mid Term (2–4 Years)

1. Standard Proposals

2. AID-NP Standardization

3. Certification Programs

12.3 Long Term (4+ Years)

1. Unified AID‑NP Standard Suite

2. Global Interoperable NAID‑TIPI Deployments

13. Actionable Recommendations

1. For Network Operators / Cloud Providers

2. For Standards Contributors

3. For System Architects / Vendors

4. For TF‑AID‑NP

For more information about "Task Force for AI Data Networking Protocol, please visit: https://paloaltoresearch.org/anp.htm

References

[1] AI‑DATA NETWORKING PROTOCOL (AID‑NP) – Task Force for AI‑Data Networking‑Protocol (TF‑AID‑NP). https://paloaltoresearch.org/anp.htm.

[2] ULTRA ETHERNETTM SPECIFICATION v1.0.2. https://ultraethernet.org/wp-content/uploads/sites/20/2026/01/UE-Specification-1.0.2-1.pdf.

[3] OCP ESUN – NETWORK OPERATOR REQUIREMENTS BASE SPECIFICATION REV 1.0 & ESUN HEADER OVERVIEW. https://www.opencompute.org/documents/ocp-esun-network-operator-requirements-base-specification-rev-1-0-final-pdf.

[4] ARCHITECTURE FOR THE ARTIFICIAL INTELLIGENCE INTERNET PROTOCOL (AIIP). https://www.ietf.org/archive/id/draft-sogomonian-aiip-architecture-00.html.

[5] AI PREFERENCES (AIPREF) WORKING GROUP – OVERVIEW. https://datatracker.ietf.org/wg/aipref/about/.

[6] MODEL CONTEXT PROTOCOL – SPECIFICATION & INTRODUCTION. https://modelcontextprotocol.io/.

[7] NENDICA STUDY ITEM: AI COMPUTING NETWORKS (AICN). https://1.ieee802.org/nendica-aicn/.

[8] IEEE 802 NENDICA REPORT: INTELLIGENT LOSSLESS DATA CENTER NETWORKS (PRE‑DRAFT DCN REPORT). https://mentor.ieee.org/802.1/dcn/20/1-20-0030-00-ICne-pre-draft-dcn-report.pdf.

[9] EVOLUTION OF THE AI PREFERENCE VOCABULARY DURING THE ZÜRICH INTERIM – AI‑PREF WORKING GROUP NOTE. https://openfuture.eu/wp-content/uploads/2026/01/251020Evolution_AI_preference_vocabulary_Zurich.pdf.

About Prof. Willie W. LU, PI of the subject AID-NP project and OWA project

Former U.S. DARPA expert, former U.S. FCC expert, former Stanford Professor, is now leading Palo Alto Research, on his prestigious research and development programs on advanced wireless technology, AI deep research, AI Data Flow and AI Data Networking and Cybersecurity research, etc.

Prof. Willie W. Lu is a renowned expert in wireless communications and the chief inventor of Open Wireless Architecture (OWA) technology. His contributions have significantly shaped the landscape of modern wireless communications.

As Chief Wireless Architect for over 25 years, Prof. Lu expanded his ICT expertise to AI Data Networking and Infrastructure in 2008, leading to the launch of the subject task force for the AI data networking protocol, after 15 years' intensive research on this subject.

Career and Achievements

Prof. Lu has had an illustrious career spanning over three decades in the field of Information and Communication Technologies (ICT). He has held several prestigious positions, including:

1) Consulting professor at Stanford University in charge of Open Wireless Architecture (OWA) research program.

2) Member of the Federal Communications Commission (FCC) Technological Advisory Council

3) Member of DARPA Expert Committee, Advance Wireless Technology

4) Member and Delegate of U.S. Delegation for the OECD Missions for Technology, IP and Policy in AI Data Flow, Wireless, Cybersecurity and IoT

5) Visiting professor at the Chinese University of Hong Kong

6) Chair Professor at Zhejiang University of China (ranked No.3 in China - the best engineering university in China)

7) Chief Architect and Corporate Vice President at Infineon Technologies AG and Chief Representative of Infineon China

8) CEO of the U.S. Center for Wireless Communications (USCWC, now merged into Palo Alto Research) in Palo Alto, California

9) Chairman and CEO, Palo Alto Research, in the United States

Prof. Lu has also served as a senior technical advisor for 25 wireless communication authorities in more than ten countries, demonstrating his global influence in the field.

For more information about Prof. Willie Lu, please ask AI: "Who is Prof. Willie Lu?"

To be continued .....our scientists, researchers and engineers are working diligently on this emerging project, and the newest results will be released to our sponsors and clients first. After 3-6 months we will release to the public. To become our sponsor or client, please contact PI Prof. Willie Lu directly through his LinkedIN account as set forth above.

The TF-AID-NP is independently organized and administrated by West Lake education and research services, a division of Palo Alto Research.

All information in this website is for educational purpose only and subject to change. Nothing is waived and all rights are reserved.

Around the above main service projects, we provide research, development, consulting and design services to clients on the following detailed service jobs (but not limited to):

We are very diligently seeking federal SBA loan and private investment to upgrade our PALO ALTO RESEARCH developments, productions, services and marketing activities slowed down caused by Covid-19 pandemic.

Palo Alto Research connects over 5,000 senior engineers, researchers and experts to serve our clients for research, development, design, analysis, consulting & engineering services in the ICT field.

We are very diligently and busy in delivering PALO ALTO RESEARCH services to clients, please check this site frequently.