GU Yonggen, LI Zhi, WU Xiaohong, TAO Jie
2026, 54(3): 285-295.
Federated learning is a new machine learning paradigm designed to address the needs of distributed data storage and privacy protection, where clients collaboratively train a model. However, in practical applications, client data is often non-independent and identically distributed(Non-IID). Federated learning randomly selects a subset of clients for training in each round, and the sampled clients typically fail to fully represent the global data distribution. This exacerbates the impact of Non-IID on training, leading to inefficiencies in global model training and reduced accuracy. To address this issue, a clustering-based federated learning client sampling algorithm(FedCG) is proposed. The core idea of this method is to enhance the diversity of training data in each round, thereby improving the training efficiency and accuracy of the model. First, the “representative gradient” is used to calculate the inverse cosine similarity between clients. The similarity is applied for hierarchical clustering of the clients. Next, clients are grouped based on the clustering results. Finally, experiments are conducted on standard datasets such as Fashion-MNIST, CIFAR-10, EMNIST-Letters, and EMNIST-Balanced. Compared to three baseline algorithms: FedAvg, FedProx, and CSS,the accuracy of global model testing has significantly improved, with a maximum increase of 12%.