跳转至

Try to implement the reproduction of original unison paper

Before I learnt the real running process

Small Test

For original test, according to README.md

Bash
1
2
3
./ns3 build dctcp-example dctcp-example-mtp
time ./ns3 run dctcp-example
time ./ns3 run dctcp-example-mtp

4-5 minutes for dctcp-example and 1-2 minutes for dctcp-example-mtp

Local Machine + Single Thread

Bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
 lscpu
Architecture:             aarch64
  CPU op-mode(s):         64-bit
  Byte Order:             Little Endian
CPU(s):                   12
  On-line CPU(s) list:    0-11
Vendor ID:                Apple
  Model:                  0
  Thread(s) per core:     1
  Core(s) per cluster:    12
  Socket(s):              -
  Cluster(s):             1
  Stepping:               0x0
  CPU max MHz:            2000.0000
  CPU min MHz:            2000.0000
  BogoMIPS:               48.00
  Flags:                  fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 asimddp sha512 asimdfhm dit uscat ilrcpc flagm ssbs sb
                           dcpodp flagm2 frint
Vulnerabilities:
  Gather data sampling:   Not affected
  Itlb multihit:          Not affected
  L1tf:                   Not affected
  Mds:                    Not affected
  Meltdown:               Not affected
  Mmio stale data:        Not affected
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Not affected
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:             Mitigation; __user pointer sanitization
  Spectre v2:             Not affected
  Srbds:                  Not affected
  Tsx async abort:        Not affected

Based on Thread(s) per core: 1, we can learn that hyper-thread is off on local machine.

Bash
1
2
3
4
5
6
7
8
9
%% dctcp-example %%
real    5m15.132s
user    5m14.952s
sys 0m0.162s

%% dctcp-example-mtp %%
real    1m34.903s
user    6m16.114s
sys 0m1.078s

It aligns with the data provided in the original paper.

Sugon Server + Hyper Thread

Bash
1
2
3
4
5
6
7
8
9
%% dctcp-example %%
real    14m57.122s
user    14m56.322s
sys 0m0.853s

%% dctcp-example-mtp %%
real    5m14.145s
user    20m49.211s
sys 0m3.165s

Sugon Server + Single Thread

Bash
1
2
3
4
5
6
7
8
9
%% dctcp-example %%
real    15m54.774s
user    15m54.431s
sys     0m0.386s

%% dctcp-example-mtp %%
real    4m49.496s
user    19m12.724s
sys     0m1.938s

Medium Tests

sugon, hyperthread open

fat-tree-mtp-k2-c2: 46.0808s

fat-tree-mtp-k2-c4: 50.2891s

fat-tree-mtp-k2-c8: 53.13s

fat-tree-mtp-k2-c16: 77.3042s

fat-tree-ori-k2-c2: 91.5243s

fat-tree-ori-k2-c4: 156.968s

fat-tree-ori-k2-c8: 156.317s

fat-tree-ori-k2-c16: 202.152s

k c result
2 2 1.986
2 4 3.12
2 8 2.942
2 16 2.61

local machine, hyperthread close

fat-tree-mtp-k2-c2: 12.2736s

fat-tree-mtp-k2-c4: 13.3277s

fat-tree-mtp-k2-c8: 14.8488s

fat-tree-mtp-k2-c16: 19.2544s

fat-tree-ori-k2-c2: 24.6369s

fat-tree-ori-k2-c4: 39.1419s

fat-tree-ori-k2-c8: 43.6409s

fat-tree-ori-k2-c16: 54.4671s

k c result
2 2 2.00
2 4 2.937
2 8 2.940
2 16 2.83

sugon, hyperthread close

fat-tree-mtp-k2-c2: 47.7066s

fat-tree-mtp-k2-c4: 56.5749s

fat-tree-mtp-k2-c8: 58.9219s

fat-tree-mtp-k2-c16: 74.1024s

fat-tree-ori-k2-c2: 105.104s

fat-tree-ori-k2-c4: 162.615s

fat-tree-ori-k2-c8: 181.431s

fat-tree-ori-k2-c16: 229.165s

k c result
2 2 2.203
2 4 2.874
2 8 3.079
2 16 3.09

Large Tests

Config cmd not good

Go to ns-3-msccl: repo

In Script Set: use fat-tree-mtp.sh and fat-tree.sh

k flow_t result
4 1 4.72
8 1
12 1
16 1

This test equals to our real goal.

But now I believe its ./ns3 configure might be wrong. We can just collect result now and use another config cmd later.

Config cmd correct

k flow_t result
4 1
8 1
12 1
16 1

After I got it

In this part, I have learnt the principle under unison.

Hence I will try to perform experiment reproduction of original unison paper on sugon machine without hyperthread.

In Unison Branch

I try to follow exp.py in Unison-evaluations-for-mtp and make corresponding shell scripts (ori + mtp) in unison branch.

test-mtp.sh

Bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#!/bin/bash

mkdir -p ./fat-tree-data/ori-data
mkdir -p ./fat-tree-data/mtp-data

# Note: this time mtp and ori scripts are totally different.
# Differ: 1) configuration cmd 2) --thread
echo "Cleaning ns3 for safety..."
./ns3 clean
echo "Configuring ns3 (mtp mode)..."
./ns3 configure -d optimized --enable-modules applications,flow-monitor,mpi,mtp,nix-vector-routing,point-to-point --enable-mtp --enable-examples
echo "Building fat-tree (mtp mode)..."
./ns3 build fat-tree-mtp

for k in 8 16; do
  for c in 8 16; do
    cmd="./ns3 run \"fat-tree-mtp \
    --k=$k \
    --cluster=$c \
    --delay=3000 \
    --bandwidth=100Gbps \
    --flow=false \
    --incast=1 \
    --victim=$(seq -s- 0 $((k*k/4-1))) \
    --time=0.1 \
    --interval=0.01 \
    --flowmon=false \
    --thread=$c\" \
    2>&1 | tee \"./fat-tree-data/mtp-data/fat-tree-mtp-k${k}-c${c}.txt\""

    echo "Running test with k=$k, cluster=$c"
    eval $cmd
    echo "Completed test with k=$k, cluster=$c"

    sleep 2
  done
done

echo "All fat-tree-mtp tests completed!"

test-ori.sh

Bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#!/bin/bash

mkdir -p ./fat-tree-data/ori-data
mkdir -p ./fat-tree-data/mtp-data

# Note: this time mtp and ori scripts are totally different.
# Differ: 1) configuration cmd 2) --thread
echo "Cleaning ns3 for safety..."
./ns3 clean
echo "Configuring ns3 (ori mode)..."
./ns3 configure -d optimized --enable-modules applications,flow-monitor,mpi,mtp,nix-vector-routing,point-to-point --enable-examples
echo "Building fat-tree (ori mode)..."
./ns3 build fat-tree-ori

for k in 8 16; do
  for c in 8 16; do
    cmd="./ns3 run \"fat-tree-ori \
    --k=$k \
    --cluster=$c \
    --delay=3000 \
    --bandwidth=100Gbps \
    --flow=false \
    --incast=1 \
    --victim=$(seq -s- 0 $((k*k/4-1))) \
    --time=0.1 \
    --interval=0.01 \
    --flowmon=false\" \
    2>&1 | tee \"./fat-tree-data/ori-data/fat-tree-ori-k${k}-c${c}.txt\""

    echo "Running test with k=$k, cluster=$c"
    eval $cmd
    echo "Completed test with k=$k, cluster=$c"

    sleep 2
  done
done

echo "All fat-tree-ori tests completed!"
  1. Need to differ from mtp and ori.
    1. In fat-tree-mtp.cc and fat-tree-ori.cc: mtp::enable(numTh)
    2. In scripts: 1) configuration cmd 2) --thread
  2. The corresponding code is in UNISON-for-ns-3/tree/unison/src/mtp/examples, hence we need to add --enable-examples in config cmd.

But actually it's not 100% same as that in exp.py, since there is no --enable-examples arg in exp.py.

We need 100% reduction, so jump to Unison-evaluations-for-mtp branch.

but we can still collect data generated here :)

k_fat cluster result
8 8
8 16
16 8
16 16

In Unison-evaluations-for-mtp Branch

This commit is about it.

test-mtp.sh:

Bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#!/bin/bash

mkdir -p ./fat-tree-data/ori-data
mkdir -p ./fat-tree-data/mtp-data

# Note: this time mtp and ori scripts are totally different.
# Differ: 1) configuration cmd 2) --thread
echo "Cleaning ns3 for safety..."
./ns3 clean
echo "Configuring ns3 (mtp mode)..."
./ns3 configure -d optimized --enable-modules applications,flow-monitor,mpi,mtp,nix-vector-routing,point-to-point --enable-mtp
echo "Building fat-tree (mtp mode)..."
./ns3 build fat-tree

for k in 8 16; do
  for c in 8 16; do
    cmd="./ns3 run \"fat-tree \
    --k=$k \
    --cluster=$c \
    --delay=3000 \
    --bandwidth=100Gbps \
    --flow=false \
    --incast=1 \
    --victim=$(seq -s- 0 $((k*k/4-1))) \
    --time=0.1 \
    --interval=0.01 \
    --flowmon=false \
    --thread=$c\" \
    2>&1 | tee \"./fat-tree-data/mtp-data/fat-tree-mtp-k${k}-c${c}.txt\""

    echo "Running test with k=$k, cluster=$c"
    eval $cmd
    echo "Completed test with k=$k, cluster=$c"

    sleep 2
  done
done

echo "All fat-tree (mtp mode) tests completed!"

test-ori.sh

Bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#!/bin/bash

mkdir -p ./fat-tree-data/ori-data
mkdir -p ./fat-tree-data/mtp-data

# Note: this time mtp and ori scripts are totally different.
# Differ: 1) configuration cmd 2) --thread
echo "Cleaning ns3 for safety..."
./ns3 clean
echo "Configuring ns3 (ori mode)..."
./ns3 configure -d optimized --enable-modules applications,flow-monitor,mpi,mtp,nix-vector-routing,point-to-point
echo "Building fat-tree (ori mode)..."
./ns3 build fat-tree

for k in 8 16; do
  for c in 8 16; do
    cmd="./ns3 run \"fat-tree \
    --k=$k \
    --cluster=$c \
    --delay=3000 \
    --bandwidth=100Gbps \
    --flow=false \
    --incast=1 \
    --victim=$(seq -s- 0 $((k*k/4-1))) \
    --time=0.1 \
    --interval=0.01 \
    --flowmon=false\" \
    2>&1 | tee \"./fat-tree-data/ori-data/fat-tree-ori-k${k}-c${c}.txt\""

    echo "Running test with k=$k, cluster=$c"
    eval $cmd
    echo "Completed test with k=$k, cluster=$c"

    sleep 2
  done
done

echo "All fat-tree (ori mode) tests completed!"

The only change is that: we delete --enable-examples arg in config cmd.

One thing need to focus is: the simulation experiment code is in UNISON-FOR-NS-3/scratch, like scratch/fat-tree.cc(what we run here).

We can easily get the linking relationship in ./CMake.txt:

Bash
1
2
# Build scratch/simulation scripts
add_subdirectory(scratch)

The fat-tree topology here is totally different from that in unison branch!!!

Micro-benchmark

k_fat cluster result
8 8 1.23
8 16 1.59
16 8 0.71
16 16 1.46

Medium-benchmark

k_fat cluster result
8 24 1.54
8 48 X
8 72 X

Tips

1) Don't run script in different branch at the same time!

Tips: experiments in unison and unison-evaluations-for-mtp can not run at the same time, reason: they are git branch, the result file is the same, will lead to strange errors.

2) Need consider thread interaction!

We have to add some barrier in scripts to avoid: some entities run without config, a good way is to set sleep 20

3) Tmux is account-level rather than folder-level.

4) It is notable that multiple experiments required by the same figure should be performed under the same hardware configuration.

If different figures require the same experiment, you can perform this experiment just once.

MORE important issue!!!!!!

unison-mpi

The test in paper:

  • k=8, cluster=48
  • k=8, cluster=72
  • k=8, cluster=96

They are all corresponding to fat-tree-distributed.

prerequisites

问题1: 需要分布式系统联合操作

问题2: 这个实验,服务器最少要144核,sugon目前只有72

Bash
1
2
3
4
5
6
7
bxhu@sugon:~/UNISON-for-ns-3$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          46 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   72
  On-line CPU(s) list:    0-71