Title: Deep Research: Is it Feasible to Crowdfund an AI Model with Cryptocurrency Incentives?
Author: Jeff Amico
Translator: TechFlow by Pionex
Introduction
During the COVID-19 pandemic, Folding@home achieved a major milestone. The research project obtained a computing capacity of 2.4 exaFLOPS, provided by 2 million volunteer devices worldwide. This represented fifteen times the processing power of the world's largest supercomputer at the time, enabling scientists to simulate the dynamics of COVID proteins on a large scale. Their work advanced our understanding of the virus and its pathological mechanisms, especially in the early stages of the pandemic.

Global distribution of Folding@home users, 2021
Folding@home has a long history of volunteer computing, using crowd-sourced computing resources to tackle large-scale problems. This concept gained widespread attention in the 1990s with SETI@home, which gathered over 5 million volunteer computers to search for extraterrestrial life. Since then, this idea has been applied to multiple fields, including astrophysics, molecular biology, mathematics, cryptography, and gaming. In each case, collective power enhanced the capabilities of individual projects far beyond what they could achieve alone. This drove progress, enabling research to be conducted in a more open and collaborative manner.
Many wonder if we can apply this crowd-sourcing model to deep learning. In other words, can we train a large neural network among the masses? Advanced model training is one of the most computationally intensive tasks in human history. Like many @home projects, the current costs exceed what only the largest participants can bear. This may hinder future progress as we rely on fewer and fewer companies to make new breakthroughs. It also concentrates the control of our AI systems in the hands of a few. Regardless of your view on this technology, this is a future worth paying attention to.
Most critics have dismissed the idea of decentralized training as incompatible with current training techniques. However, this view has become increasingly outdated. New technologies have emerged that can reduce communication demands between nodes, allowing efficient training on devices with poor network connections. These technologies include DiLoCo, SWARM Parallelism, lo-fi, and decentralized training of base models in heterogeneous environments, among others. Many of these have fault tolerance and support heterogeneous computing. There are also new architectures designed specifically for decentralized networks, including DiPaCo and decentralized hybrid expert models.
We also see various cryptographic primitives beginning to mature, enabling networks to coordinate resources globally. These technologies support applications such as digital currencies, cross-border payments, and prediction markets. Unlike early volunteer projects, these networks can aggregate astonishing computing power, often several orders of magnitude larger than the largest envisioned cloud training clusters.
These elements together constitute a new model training paradigm. This paradigm fully leverages global computing resources, including a large number of edge devices that can be used if connected. This will lower the cost of most training workloads by introducing new competitive mechanisms. It can also unlock new forms of training, making model development collaborative and modular, rather than isolated and singular. Models can obtain computation and data from the masses, learning in real time. Individuals can own a part of the models they create. Researchers can also openly share novel research findings without having to monetize their discoveries to offset high computing budgets.
This report examines the current state and related costs of large-scale model training. It reviews past distributed computing efforts—from SETI to Folding to BOINC—as inspiration to explore alternative paths. The report discusses the historical challenges of decentralized training and turns to the latest breakthroughs that may help overcome these challenges. Finally, it summarizes the opportunities and challenges for the future.
Current State of Advanced Model Training
The cost of advanced model training is already unaffordable for non-major participants. This trend is not new, but according to the actual situation, it is becoming more severe as leading labs continue to challenge expansion assumptions. It is reported that OpenAI spent over $3 billion on training this year. Anthropic predicts that by 2025, we will start training with $100 billion, and $1 trillion models are not far off.

This trend has led to industry concentration, as only a few companies can afford the participation costs. This raises core policy issues for the future—can we accept a situation where all leading AI systems are controlled by one or two companies? It also limits the pace of progress, which is evident in the research community, as smaller labs cannot afford the computing resources needed for expanding experiments. Industry leaders have also mentioned this multiple times:
Joe Spisak of Meta: To truly understand the capabilities of [models] architecture, you have to explore at scale, and I think that's what's missing in the current ecosystem. If you look at academia—academia has a lot of great talent, but they lack access to computing resources, and that's a problem because they have these great ideas, but they don't have a way to realize these ideas at the level they need.
Max Ryabinin of Together: The demand for expensive hardware puts a lot of pressure on the research community. Most researchers cannot participate in the development of large neural networks because the necessary experiments are too costly for them. If we continue to increase the size of models by expanding their scale, we will eventually be able to compete
Francois Chollet of Google: We know that large language models (LLMs) have not yet achieved general artificial intelligence (AGI). At the same time, progress towards AGI has stalled. The limitations we face with large language models today are exactly the same as those we faced five years ago. We need new ideas and breakthroughs. I think the next breakthrough is likely to come from external teams, while all large labs are busy training larger language models. Some are skeptical of these concerns, believing that hardware improvements and capital expenditure on cloud computing will solve this problem. But this seems unrealistic. On the one hand, by the end of this decade, the number of FLOPs in the new generation of Nvidia chips will increase significantly, possibly reaching 10 times that of today's H100. This will reduce the price per FLOP by 80-90%. Similarly, it is expected that by the end of this decade, the total FLOP supply will increase by about 20 times, while improving networks and related infrastructure. All of this will increase the training efficiency per dollar.

Source: SemiAnalysis AI Cloud TCO Model
At the same time, the total FLOP demand will also increase significantly as labs hope to further expand their scale. If the ten-year training computing trend continues, the FLOPs for advanced training are expected to reach about 2e29 by 2030. Training at this scale would require approximately 20 million H100 equivalent GPUs, based on current training runtimes and utilization rates. Assuming there are still multiple leading labs in this field, the total FLOP demand will be several times this number, as the overall supply will be allocated among them. EpochAI predicts that we will need approximately 100 million H100 equivalent GPUs by then, about 50 times the 2024 shipment volume. SemiAnalysis also makes similar predictions, believing that the demand for advanced training and GPU supply will roughly grow in sync during this period.
Capacity constraints may become more acute for various reasons. For example, it is common for manufacturing bottlenecks to delay expected shipping cycles. Or if we fail to produce enough energy to power data centers. Or if we encounter difficulties in connecting these energy sources to the grid. Or if increasing scrutiny of capital expenditure ultimately leads to industry downsizing, and so on. In the best case scenario, our current approach only allows a few companies to continue driving research progress, which may not be enough.

Clearly, we need a new approach. This approach does not require constantly expanding data centers, capital expenditure, and energy consumption to find the next breakthrough, but efficiently utilizes our existing infrastructure, able to flexibly expand with demand fluctuations. This will allow for more experiments in research, as training runs no longer require a return on investment of millions of dollars in computing budgets. Once freed from this constraint, we can surpass the current large language model (LLM) paradigm, as many believe achieving general artificial intelligence (AGI) is necessary. To understand what this alternative might look like, we can draw inspiration from past distributed computing practices.
Crowd Computing: A Brief History
SETI@home popularized this concept in 1999, allowing millions of participants to analyze radio signals in search of extraterrestrial intelligence. SETI collected electromagnetic data from the Arecibo telescope, divided it into batches, and sent it to users over the internet. Users analyzed the data in their daily activities and sent back the results. There was no need for communication between users, and batches could be independently audited, enabling highly parallel processing. At its peak, SETI@home had over 5 million participants and processing power exceeding the largest supercomputer at the time. It ultimately shut down in March 2020, but its success inspired subsequent volunteer computing movements.
Folding@home continued this concept in 2000, using edge computing to simulate protein folding in diseases such as Alzheimer's, cancer, and Parkinson's. Volunteers simulated protein folding on their personal computers during idle time, helping researchers study how proteins misfold and lead to diseases. At different times in its history, its computing power exceeded that of the largest supercomputers, including in the late 2000s and during COVID, when it became the first distributed computing project to exceed one exaFLOPS. Since its inception, Folding's researchers have published over 200 peer-reviewed papers, each relying on volunteer computing power.
The Berkeley Open Infrastructure for Network Computing (BOINC) popularized this concept in 2002, providing a crowd computing platform for various research projects. It supported multiple projects including SETI@home and Folding@home, as well as new projects in astrophysics, molecular biology, mathematics, and cryptography. By 2024, BOINC listed 30 ongoing projects and nearly 1,000 published scientific papers generated by its computing network.
Outside of the research field, volunteer computing has been used to train game engines for Go (LeelaZero, KataGo) and chess (Stockfish, LeelaChessZero). LeelaZero was trained through volunteer computing from 2017 to 2021, enabling it to play over ten million games against itself, making it one of the strongest Go engines today. Similarly, Stockfish has been continuously trained on volunteer networks since 2013, making it one of the most popular and powerful chess engines.
Challenges in Deep Learning
But can we apply this model to deep learning? Can we network edge devices from around the world to create a low-cost public training cluster? Consumer hardware—from Apple laptops to Nvidia gaming cards—is becoming increasingly powerful in deep learning. In many cases, the performance of these devices even exceeds that of data center GPUs in terms of performance per dollar.

However, to effectively utilize these resources in a distributed environment, we need to overcome various challenges.
First, current distributed training techniques assume frequent communication between nodes.
The most advanced models have become so large that training must be split among thousands of GPUs. This is achieved through various parallelization techniques, often splitting the model, dataset, or both among available GPUs. This typically requires a high-bandwidth, low-latency network, or else nodes will be idle, waiting for data to arrive.
For example, Distributed Data Parallel (DDP) distributes the dataset across GPUs, with each GPU training the complete model on its specific data shard and then sharing its gradient updates to generate new model weights at each step. This requires relatively limited communication overhead, as nodes only share gradient updates after each backward pass, and collective communication operations can partially overlap with computation. However, this approach is only suitable for smaller models, as it requires each GPU to store the entire model's weights, activations, and optimizer state in memory. For example, GPT-4 requires over 10TB of memory during training, while a single H100 has only 80GB.
To address this issue, various techniques have been developed to split the model for allocation among GPUs. For example, tensor parallelism splits various weights within a single layer, allowing each GPU to perform necessary operations and pass the output to other GPUs. This reduces the memory requirements for each GPU but requires continuous communication between them, thus requiring high-bandwidth, low-latency connections for efficiency.
Pipeline parallelism allocates the layers of the model to different GPUs, with each GPU performing its work and sharing updates with the next GPU in the pipeline. While this requires less communication than tensor parallelism, it may lead to "bubbles" (i.e., idle time), where GPUs further down the pipeline wait for information from preceding GPUs to start their work.
To address these challenges, various technologies have been developed. For example, ZeRO (Zero Redundancy Optimizer) is a memory optimization technique that reduces memory usage by increasing communication overhead, allowing larger models to be trained on specific devices. ZeRO reduces memory requirements by splitting model parameters, gradients, and optimizer state among GPUs, but relies on extensive communication for devices to access the split data. It forms the basis for popular techniques such as Fully Sharded Data Parallelism (FSDP) and DeepSpeed.
These techniques are often combined in large model training to maximize resource utilization efficiency, known as 3D parallelism. In this configuration, tensor parallelism is typically used to allocate weights within a single server to various GPUs, as a significant amount of communication is required between each split layer. Then, pipeline parallelism is used to allocate layers between different servers (but within the same island in the data center), as it requires less communication. Subsequently, data parallelism or Fully Sharded Data Parallelism is used to split the dataset between different server islands, as it can adapt to longer network latencies through asynchronous sharing of updates and/or gradient compression. Meta uses this combined approach to train Llama 3.1, as shown in the diagram below.
These methods bring core challenges to decentralized training networks, which rely on devices connected through (slower and more fluctuating) consumer-grade internet connections. In this environment, the cost of communication quickly exceeds the benefits of edge computing, as devices are often idle, waiting for data to arrive. To illustrate with a simple example, distributed data parallel training of a half-precision model with 10 billion parameters, each GPU needs to share 2GB of data in each optimization step. Using typical internet bandwidth (e.g., 1 gigabit per second), assuming computation and communication do not overlap, transmitting gradient updates would take at least 16 seconds, resulting in significant idle time. Techniques like tensor parallelism (which require more communication) would perform even worse.
Secondly, current training techniques lack fault tolerance. Like any distributed system, as the scale increases, training clusters become more prone to failures. However, this issue is more severe in training because our current techniques are primarily synchronous, meaning GPUs must work together to complete model training. The failure of a single GPU among thousands can halt the entire training process, forcing other GPUs to start training from scratch. In some cases, GPUs may not fail completely but become slow for various reasons, slowing down the speed of thousands of other GPUs in the cluster. Given the scale of today's clusters, this could mean additional costs of tens of millions to billions of dollars.
Meta detailed these issues in their Llama training process, experiencing over 400 unexpected interruptions, averaging about 8 interruptions per day. These interruptions were mainly attributed to hardware issues, such as GPU or host hardware failures. This resulted in their GPU utilization being only 38-43%. OpenAI performed even worse during the training of GPT-4, with only 32-36% utilization, also due to frequent failures during training.
In other words, cutting-edge labs struggle to achieve over 40% utilization even in fully optimized environments (including homogeneous, state-of-the-art hardware, networking, power, and cooling systems). This is mainly attributed to hardware failures and network issues, which are even more severe in edge training environments due to the imbalance in processing power, bandwidth, latency, and reliability. Not to mention, decentralized networks are vulnerable to malicious actors who may attempt to disrupt the overall project or cheat on specific workloads for various reasons. Even the purely volunteer network SETI@home has experienced cheating among different participants.
Thirdly, training cutting-edge models requires massive computing power. While projects like SETI and Folding achieved impressive scales, they pale in comparison to the computing power required for today's cutting-edge training. GPT-4 was trained on a cluster of 20,000 A100s, with a peak throughput of 6.28 ExaFLOPS in half-precision. This is three times more powerful than Folding@home at its peak. Llama 405b was trained using 16,000 H100s, with a peak throughput of 15.8 ExaFLOPS, seven times more powerful than Folding's peak. With multiple labs planning to build clusters of over 100,000 H100s, this gap will only widen further, with each cluster's computing power reaching a staggering 99 ExaFLOPS.

This makes sense, as @home projects are volunteer-driven. Contributors donate their memory and processor cycles and bear the associated costs. This naturally limits their scale compared to commercial projects.
Recent Progress
While these issues have historically plagued decentralized training efforts, they seem no longer insurmountable. New training techniques have emerged that can reduce the communication requirements between nodes, enabling efficient training on internet-connected devices. Many of these techniques originate from large labs aiming to scale up model training, thus requiring efficient communication technologies across data centers. We have also seen advancements in fault-tolerant training methods and incentive systems that can support larger-scale training in edge environments.
Efficient Communication Technologies
DiLoCo is a recent research from Google that reduces communication overhead by locally optimizing the model state before passing updates between devices. Their method (based on earlier federated learning research) shows comparable performance to traditional synchronous training while reducing communication between nodes by 500 times. Since then, this method has been replicated by other researchers and extended to train larger models (exceeding 10 billion parameters). It has also been extended to asynchronous training, meaning nodes can share gradient updates at different times rather than all at once. This better adapts to the varying processing power and network speeds of edge hardware.
Other data parallel methods, such as lo-fi and DisTrO, aim to further reduce communication costs. Lo-fi proposes a fully local fine-tuning approach, meaning nodes train independently and only pass weights at the end. This method performs comparably to the baseline when fine-tuning language models with over 10 billion parameters, while completely eliminating communication overhead. In a preliminary report, DisTrO claims to have adopted a new distributed optimizer that they believe can reduce communication requirements by four to five orders of magnitude, although this method is yet to be confirmed.
New model parallel methods have also emerged, enabling larger-scale implementation. DiPaCo (also from Google) partitions the model into multiple modules, each containing different expert modules for training specific tasks. Then, training data is sharded through "paths," which are sequences of experts corresponding to each data sample. Given a shard, each worker can train specific paths almost independently, except for the communication required to share the shared modules, handled by DiLoCo. This architecture reduces training time for billion-parameter models by over half.
SWARM Parallelism and Decentralized Training of Basic Models in Heterogeneous Environments (DTFMHE) also propose model parallel methods to achieve large-scale training in heterogeneous environments. SWARM found that as the model scale increases, pipeline parallelism communication constraints decrease, enabling effective training of larger models with lower network bandwidth and higher latency. To apply this concept in a heterogeneous environment, they use temporary "pipeline connections" between nodes, which can be updated in real-time at each iteration. This allows nodes to dynamically reroute their output to any peer node in the next pipeline stage. This means that if a peer node is faster than other nodes, or if any participant disconnects, the output can be dynamically rerouted to ensure continuous training, as long as each stage has at least one active participant. They used this method to train a billion-parameter model on low-cost heterogeneous GPUs, even with slow interconnect speeds (as shown in the figure below).
DTFMHE also proposes a novel scheduling algorithm, as well as pipeline and data parallelism, to train large models on devices across 3 continents. Despite their network speed being 100 times slower than standard Deepspeed, their method is only 1.7-3.5 times slower than using standard Deepspeed in a data center. Similar to SWARM, DTFMHE demonstrates that as the model scale increases, communication costs can be effectively hidden, even in geographically distributed networks. This allows us to overcome weaker connections between nodes through various technologies, including increasing the size of hidden layers and adding more layers to each pipeline stage.
Fault Tolerance
Many of the aforementioned data parallel methods inherently have fault tolerance, as each node stores the entire model in memory. This redundancy typically means that nodes can continue to work independently even if other nodes fail. This is crucial for decentralized training, as nodes are often unreliable, heterogeneous, and may even be malicious. However, as mentioned earlier, pure data parallel methods are only suitable for smaller models, so the model size is constrained by the smallest node's memory capacity in the network.
To address the above issues, some have proposed fault-tolerant techniques applicable to model parallel (or mixed parallel) training. SWARM addresses peer node failures by prioritizing stable peer nodes with lower latency and rerouting pipeline stage tasks in case of failures. Other methods, such as Oobleck, employ a similar approach by creating multiple "pipeline templates" to provide redundancy to cope with partial node failures. While tested in data centers, Oobleck's approach provides strong reliability guarantees, which are equally applicable in decentralized environments.
We have also seen new model architectures, such as Decentralized Mixture of Experts (DMoE), designed to support fault-tolerant training in decentralized environments. Similar to traditional mixture expert models, DMoE consists of multiple independent "expert" networks distributed across a set of worker nodes. DMoE uses a distributed hash table to track and integrate asynchronous updates in a decentralized manner. This mechanism (also used in SWARM) exhibits good resistance to node failures, as it can exclude certain experts from the average computation if some nodes fail or are unresponsive.
Scalability
Finally, the cryptographic incentive systems adopted by networks like Bitcoin and Ethereum can help achieve the required scale. These networks crowdsource computation by paying contributors with a local asset that appreciates with adoption growth. This design incentivizes early contributors by rewarding them handsomely, and as the network reaches minimum viable scale, these rewards can gradually decrease.
Indeed, this mechanism has various pitfalls that need to be avoided. The most significant pitfall is over-incentivizing supply without corresponding demand. Additionally, insufficient decentralization of the underlying network could lead to regulatory issues. However, when designed properly, decentralized incentive systems can achieve considerable scale over an extended period.
For example, Bitcoin's annual power consumption is approximately 150 terawatt-hours (TWh), which is two orders of magnitude higher than the power consumption of the largest envisioned AI training clusters (100,000 H100s running at full load for a year). For reference, OpenAI's GPT-4 was trained on 20,000 A100s, and Meta's flagship Llama 405B model was trained on 16,000 H100s. Similarly, Ethereum's power consumption at its peak is approximately 70 TWh, distributed across millions of GPUs. Even considering the rapid growth of AI data centers in the coming years, networks like these incentive computing networks will still surpass their scale multiple times.
While not all computations are interchangeable, training has unique requirements that need to be considered compared to mining. Nonetheless, these networks demonstrate the scale achievable through these mechanisms.
The Road Ahead
Connecting these pieces together, we can see the beginning of a new path forward.
Soon, new training technologies will enable us to surpass the limitations of data centers, as devices no longer need to be colocated to be effective. This will take time, as our current decentralized training methods are still at a smaller scale, primarily in the range of 1 to 2 billion parameters, much smaller than models like GPT-4. We need further breakthroughs to scale up these methods without sacrificing critical attributes like communication efficiency and fault tolerance. Alternatively, we need new model architectures that are different from today's large monolithic models—possibly smaller, more modular, and running on edge devices rather than in the cloud.
Regardless, it is reasonable to expect further progress in this direction. Our current methods are unsustainable, providing strong market incentives for innovation. We have already seen this trend, with manufacturers like Apple building more powerful edge devices to run more workloads locally rather than relying on the cloud. We also see increasing support for open-source solutions—even within companies like Meta—to promote more decentralized research and development. These trends will only accelerate over time.
At the same time, we need new network infrastructure to connect edge devices to enable their use in this manner. These devices include laptops, gaming desktops, and eventually even smartphones with high-performance GPUs and large memory. This will allow us to build a "global cluster" of low-cost, always-on computing power that can parallel process training tasks. This is also a challenging problem that requires progress in multiple domains.
We need better scheduling techniques for training in heterogeneous environments. Currently, there is no method to automatically parallelize models for optimization, especially in cases where devices can disconnect or connect at any time. This is a critical next step for optimizing training while retaining the scale advantage based on edge networks.
We also need to address the general complexity of decentralized networks. To maximize scale, networks should be built as open protocols—a set of standards and instructions governing interactions between participants, much like TCP/IP for machine learning computation. This would allow any device following specific specifications to connect to the network, regardless of owner and location. It also ensures the network remains neutral, allowing users to train the models they prefer.
While this achieves maximized scale, it also requires a mechanism to verify the correctness of all training tasks without relying on a single entity. This is crucial, as there are inherent incentives for cheating—such as claiming to have completed a training task to receive rewards when it has not actually been done. Given that different devices typically execute machine learning operations in different ways, using standard replication techniques becomes challenging to verify correctness, making this particularly challenging. Correctly addressing this issue requires in-depth research in cryptography and other disciplines.
Fortunately, we continue to see progress in all these areas. These challenges seem no longer insurmountable compared to the past few years. They also appear quite small compared to the opportunities. Google summed this up best in their DiPaCo paper, pointing out the potential for decentralized training to break the negative feedback loop:
Progress in distributed training of machine learning models may lead to simplified infrastructure construction, ultimately resulting in more widely available computational resources. Currently, infrastructure is designed around the standard approach of training large monolithic models, while the architecture of machine learning models is also designed to leverage current infrastructure and training methods. This feedback loop may lead the community into a misleading local minimum, where the limitation of computational resources exceeds actual needs.
Perhaps most excitingly, there is a growing enthusiasm in the research community to address these challenges. Our team at Gensyn is building the aforementioned network infrastructure. Teams like Hivemind and BigScience are applying many of these technologies in practice. Projects like Petals, sahajBERT, and Bloom showcase the capabilities of these technologies, along with the growing interest in community-based machine learning. Many others are also pushing research progress, aiming to build a more open, collaborative model training ecosystem. If you are interested in this work, please contact us to get involved.
免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。