How to Install the Required Environment for LMTuner from Scratch on a Server

Install GPU Driver


          # preparing environment
          sudo apt-get install gcc
          sudo apt-get install make
          wget https://developer.download.nvidia.com/compute/cuda/11.5.1/local_installers/cuda_11.5.1_495.29.05_linux.run
          sudo sh cuda_11.5.1_495.29.05_linux.run

Install Conda and Python


            wget -c https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
            sudo chmod 777 Miniconda3-latest-Linux-x86_64.sh
            bash Miniconda3-latest-Linux-x86_64.sh

            conda create -n LMTuner python==3.9
            conda activate LMTuner

Install Python Libraries


            pip install torch==1.13.0+cu117 torchvision==0.14.0+cu117 torchaudio==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu117
            pip install tqdm transformers sklearn pandas numpy accelerate sentencepiece wandb SwissArmyTransformer jieba rouge_chinese datasets

If you have ensured installing the correct GCC library (>5.0.0), you can continue installing apex and deepspeed:


            git clone https://github.com/NVIDIA/apex
            cd apex
            pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
            pip install deepspeed

Install LMTuner


            git clone https://github.com/WENGSYX/LMTuner
            pip install .

System Requirements :

Ubuntu 14.04+, Debian 8+, CentOS 6+, or Fedora 27+
an NVIDIA GPU with driver version >= 460.32.03 or AMD GPU with ROMc >= 4.0

Error in Apex installation :

If prompted "Pytorch binaries were compiled with Cuda“, it means the PyTorch version may not match cuda, but this will not affect the installation of apex, so replace line 32 of the apex setup.py file "if (bare_metal_version != torch_binary_version):" with "if 0:"
No module named 'packaging': pip install packaging
ninja: build stopped: subcommand failed.: pip install ninja
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.: modify the cpp file in torch to ninja -V
g++: error: /home/wengyixuan/apex/build/temp.linux-x86_64-cpython-39/csrc/fused_dense.o: No such file or directory: Give up, incompatible, either the torch environment or the pytorch environment or the apex environment is wrong. It is recommended to change to other versions, such as python=3.8, torch==1.7/1.8 or historical versions of apex (because newer versions may be updated in the future)

Common Issues

OOM (out of memory): Reduce batch size, use gradient accumulation or adjust model parameters.
Gradient explosion/vanishing: Use gradient clipping, adjust learning rate properly.
Overfitting: Consider more regularization (e.g. L2 regularization, dropout etc), use larger dataset.
Low GPU utilization: Check data loading and batch sizes, avoid small batches.
Mixed precision training issues: Ensure mixed precision enabled, check numerical stability, use gradient scaling.
Network communication bottleneck: Optimize data loading, use dedicated libs (like NCCL) and high-speed interconnects.
Unable to Use Single Node Multi-GPU: May be due to p2p transmission, set NCCL_P2P_DISABLE = 1
Learning Rate is 0 During Training: May be gradient vanishing, remember to check model, optimizer, etc. And use Gaussian initialization for weights.