The performance of a tracking system is greatly increased if multiple types of sensors are combined to achieve the objective of the tracking instead of relying on single type of sensor. To conduct the multi-modal tracking, we have previously developed a multi-modal sensor-based tracking model where acoustic sensors mainly track the objects and visual sensors compensate the tracking errors [1]. In this paper, we find a network synchronization problem appearing in the developed tracking system. The problem is caused by the different location and traffic characteristics of multi-modal sensors and non-synchronized arrival of the captured sensor data at a processing server. To effectively deliver the sensor data, we propose a time-based packet aggregation algorithm where the acoustic sensor data are aggregated based on the sampling time and sent to the server. The delivered acoustic sensor data is then compensated by visual images to correct the tracking errors and such a compensation process improves the tracking accuracy in ideal case. However, in real situations, the tracking improvement from visual compensation can be severely degraded due to the aforementioned network synchronization problem, the impact of which is analyzed by simulations in this paper. To resolve the network synchronization problem, we differentiate the service level of sensor traffic based on Weight Round Robin (WRR) scheduling at the routers. The weighting factor allocated to each queue is calculated by a proposed Delay-based Weight Allocation (DWA) algorithm. From the simulations, we show the traffic differentiation model can mitigate the non-synchronization of sensor data. Finally, we analyze expected traffic behaviors of the tracking system in terms of acoustic sampling interval and visual image size.