If I set the experiment_mode to "standalone" for example, which is not "simulation", FedScale fails to run. The femnist_cluster.yml is:
2023-12-27 14:39:19.056225: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-27 14:39:19.152964: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-12-27 14:39:19.480238: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:19.480270: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:19.480272: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(12-27) 14:39:19 INFO [aggregator.py:44] Job args Namespace(adam_epsilon=1e-08, backbone='./resnet50.pth', backend='gloo', batch_size=20, bidirectional=True, blacklist_max_len=0.3, blacklist_rounds=-1, block_size=64, cfg_file='./utils/rcnn/cfgs/res101.yml', clf_block_size=32, clip_bound=0.9, clip_threshold=3.0, clock_factor=2.4368231046931412, conf_path='~/dataset/', connection_timeout=60, cuda_device=None, cut_off_util=0.05, data_cache='', data_dir='/home/whr/code/FedScale/benchmark/dataset/data/femnist', data_map_file='/home/whr/code/FedScale/benchmark/dataset/data/femnist/client_data_mapping/train.csv', data_set='femnist', decay_factor=0.98, decay_round=10, device_avail_file='/home/whr/code/FedScale/benchmark/dataset/data/device_info/client_behave_trace', device_conf_file='/home/whr/code/FedScale/benchmark/dataset/data/device_info/client_device_capacity', dump_epoch=10000000000.0, embedding_file='glove.840B.300d.txt', engine='pytorch', epsilon=0.9, eval_interval=10, executor_configs='192.168.124.104:[1]=192.168.124.105:[1]=192.168.124.106:[1]', experiment_mode='standalone', exploration_alpha=0.3, exploration_decay=0.98, exploration_factor=0.9, exploration_min=0.3, filter_less=21, filter_more=1000000000000000.0, finetune=False, gamma=0.9, gradient_policy=None, hidden_layers=7, hidden_size=256, input_dim=0, input_shape=[1, 3, 28, 28], job_name='femnist_cluster', labels_path='labels.json', learning_rate=0.05, line_by_line=False, local_steps=5, log_path='/home/whr/code/FedScale/benchmark', loss_decay=0.2, malicious_factor=1000000000000000.0, max_concurrency=10, max_staleness=5, memory_capacity=2000, min_learning_rate=5e-05, mlm=False, mlm_probability=0.15, model='resnet18', model_size=65536, model_zoo='torchcv', n_actions=2, n_states=4, noise_dir=None, noise_factor=0.1, noise_max=0.5, noise_min=0.0, noise_prob=0.4, num_class=62, num_classes=35, num_executors=3, num_loaders=2, num_participants=3, output_dim=0, overcommitment=1.0, overwrite_cache=False, pacer_delta=5, pacer_step=20, proxy_mu=0.1, ps_ip='192.168.124.102', ps_port='29500', qfed_q=1.0, rnn_type='lstm', round_penalty=2.0, round_threshold=30, rounds=1000, sample_mode='random', sample_rate=16000, sample_seed=233, sample_window=5.0, save_checkpoint=True, spec_augment=False, speed_volume_perturb=False, target_delta=0.0001, target_replace_iter=15, task='cv', test_bsz=20, test_manifest='data/test_manifest.csv', test_output_dir='./logs/server', test_ratio=1.0, test_size_file='', this_rank=0, time_stamp='1227_143917', train_manifest='data/train_manifest.csv', train_size_file='', train_uniform=False, use_cuda=True, vocab_tag_size=500, vocab_token_size=10000, wandb_token='', weight_decay=0, window='hamming', window_size=0.02, window_stride=0.01, yogi_beta=0.9, yogi_beta2=0.99, yogi_eta=0.003, yogi_tau=1e-08)
(12-27) 14:39:20 INFO [aggregator.py:164] Initiating control plane communication ...
(12-27) 14:39:20 INFO [aggregator.py:188] %%%%%%%%%% Opening aggregator server using port [::]:29500 %%%%%%%%%%
(12-27) 14:39:20 INFO [fllibs.py:97] Initializing the model ...
(12-27) 14:39:20 INFO [aggregator.py:967] Start monitoring events ...
2023-12-27 14:39:31.090474: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-27 14:39:31.169358: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-12-27 14:39:31.478808: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:31.478836: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:31.478838: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(12-27) 14:39:31 INFO [fllibs.py:97] Initializing the model ...
(12-27) 14:39:31 INFO [executor.py:77] (EXECUTOR:1) is setting up environ ...
(12-27) 14:39:32 INFO [executor.py:123] Data partitioner starts ...
(12-27) 14:39:32 INFO [divide_data.py:62] Partitioning data by profile /home/whr/code/FedScale/benchmark/dataset/data/femnist/client_data_mapping/train.csv...
(12-27) 14:39:32 INFO [divide_data.py:74] Trace names are client_id, sample_path, label_name, label_id
(12-27) 14:39:32 INFO [divide_data.py:105] Randomly partitioning data, 81674 samples...
(12-27) 14:39:32 INFO [executor.py:141] Data partitioner completes ...
(12-27) 14:39:32 INFO [channel_context.py:21] %%%%%%%%%% Opening grpc connection to 192.168.124.102 %%%%%%%%%%
(12-27) 14:39:32 INFO [executor.py:404] Start monitoring events ...
(12-27) 14:39:32 INFO [aggregator.py:318] Received executor 1 information, 1/3
(12-27) 14:39:32 INFO [aggregator.py:274] Loading 2800 client traces ...
(12-27) 14:39:32 INFO [aggregator.py:304] Info of all feasible clients {'total_feasible_clients': 2799, 'total_num_samples': 637858}
2023-12-27 14:39:33.925569: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-27 14:39:34.012208: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-12-27 14:39:34.334770: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:34.334812: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:34.334815: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(12-27) 14:39:34 INFO [fllibs.py:97] Initializing the model ...
(12-27) 14:39:34 INFO [executor.py:77] (EXECUTOR:2) is setting up environ ...
2023-12-27 14:39:35.087146: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-27 14:39:35.167337: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
(12-27) 14:39:35 INFO [executor.py:123] Data partitioner starts ...
(12-27) 14:39:35 INFO [divide_data.py:62] Partitioning data by profile /home/whr/code/FedScale/benchmark/dataset/data/femnist/client_data_mapping/train.csv...
(12-27) 14:39:35 INFO [divide_data.py:74] Trace names are client_id, sample_path, label_name, label_id
2023-12-27 14:39:35.479452: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:35.479481: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:35.479484: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(12-27) 14:39:35 INFO [divide_data.py:105] Randomly partitioning data, 81674 samples...
(12-27) 14:39:35 INFO [executor.py:141] Data partitioner completes ...
(12-27) 14:39:35 INFO [channel_context.py:21] %%%%%%%%%% Opening grpc connection to 192.168.124.102 %%%%%%%%%%
(12-27) 14:39:35 INFO [executor.py:404] Start monitoring events ...
(12-27) 14:39:35 INFO [aggregator.py:318] Received executor 2 information, 2/3
(12-27) 14:39:35 INFO [aggregator.py:274] Loading 2800 client traces ...
(12-27) 14:39:35 INFO [aggregator.py:304] Info of all feasible clients {'total_feasible_clients': 5598, 'total_num_samples': 1275716}
(12-27) 14:39:35 INFO [fllibs.py:97] Initializing the model ...
(12-27) 14:39:35 INFO [executor.py:77] (EXECUTOR:3) is setting up environ ...
(12-27) 14:39:36 INFO [executor.py:123] Data partitioner starts ...
(12-27) 14:39:36 INFO [divide_data.py:62] Partitioning data by profile /home/whr/code/FedScale/benchmark/dataset/data/femnist/client_data_mapping/train.csv...
(12-27) 14:39:36 INFO [divide_data.py:74] Trace names are client_id, sample_path, label_name, label_id
(12-27) 14:39:36 INFO [divide_data.py:105] Randomly partitioning data, 81674 samples...
(12-27) 14:39:36 INFO [executor.py:141] Data partitioner completes ...
(12-27) 14:39:36 INFO [channel_context.py:21] %%%%%%%%%% Opening grpc connection to 192.168.124.102 %%%%%%%%%%
(12-27) 14:39:36 INFO [executor.py:404] Start monitoring events ...
(12-27) 14:39:36 INFO [aggregator.py:318] Received executor 3 information, 3/3
(12-27) 14:39:36 INFO [aggregator.py:274] Loading 2800 client traces ...
(12-27) 14:39:36 INFO [aggregator.py:304] Info of all feasible clients {'total_feasible_clients': 8397, 'total_num_samples': 1913574}
(12-27) 14:39:36 INFO [aggregator.py:583] Wall clock: 0 s, round: 1, Planned participants: 0, Succeed participants: 0, Training loss: 0.0
(12-27) 14:39:36 INFO [client_manager.py:195] Wall clock time: 0, 0 clients online, 8397 clients offline
(12-27) 14:39:36 INFO [aggregator.py:605] Selected participants to run: []
Apparently, it selects no participants to run and the program is stuck here.
I put the aforementioned yml under $WORKDIR. So, the starting command is python $WORKDIR/docker/driver.py submit $WORKDIR/femnist_cluster.yml.
What happened + What you expected to happen
If I set the experiment_mode to "standalone" for example, which is not "simulation", FedScale fails to run. The
femnist_cluster.ymlis:The log is:
Apparently, it selects no participants to run and the program is stuck here.
Versions / Dependencies
FedScale: 7ec441c
Python: 3.7.16
OS: Ubuntu20.04
Reproduction script
I put the aforementioned yml under $WORKDIR. So, the starting command is
python $WORKDIR/docker/driver.py submit $WORKDIR/femnist_cluster.yml.Issue Severity
None