Abstract
Multimodal Federated Learning (FL) integrates two crucial research areas in IoT scenarios: utilizing complementary multimodal data to enhance downstream inference performance and conducting decentralized training to safeguard privacy. However, existing studies primarily focus on applying FL methods after multimodal feature fusion, without fundamentally addressing multimodal FL across both feature and sample spaces. A notable tradeoff persists between the computational demands of multimodal information and the limited computing resources in IoT systems. To tackle this challenge, we propose a Joint Horizontal and Vertical (JHV) FL algorithm tailored for multimodal IoT systems. JHV employs vertical FL to distribute computing tasks across multimodal IoT devices (feature space) and horizontal FL to allocate tasks across multiple silos (sample space). Experimental results on two public multimodal datasets show that JHV outperforms three baseline methods, demonstrating its effectiveness for multimodal IoT systems, especially in rapid and accurate downstream tasks like classification and prediction.