Checking Your System Configuration
To begin, verify your system's configuration with these commands:
Check the kernel version used by your NVIDIA driver:
cat /proc/driver/nvidia/version
View installed NVIDIA packages:
cat /var/log/dpkg.log | grep nvidia
List all NVIDIA drivers installed:
sudo dpkg --list | grep nvidia-*
Problem 1: NVML Initialization Error in Docker
When running nvidia-smi inside a Docker container, you encounter:
Failed to initialize NVML: Unknown Error
Solution: When starting the Docker container, include the --gpus all flag:
sudo docker run --gpus all -it ubuntu18_torch1.6:v0.3
Problem 2: CUDA Not Detected in PyTorch
After installing nvidia-docker, nvidia-driver, CUDA, cuDNN, and PyTorch with CUDA support, torch.cuda.is_available() returns False inside Docker.
Solution: Add environment variables when running the container:
sudo docker run --gpus all -it -e NVIDIA_DRIVER_CAPABILITIES=compute,utility -e NVIDIA_VISIBLE_DEVICES=all
Problem 3: PyTorch Not Compiled with GPU Support
When running PyTorch code in PyCharm, you get: RuntimeError: Not compiled with GPU support.
Solution: Delete the entire build folder in the benchmark directory, then recompile the package:
python setup.py build develop
After successful compilation, save the Docker image:
sudo docker commit -a "comment" container_id image_name:image_tag
Then configure PyCharm to use the new Docker image.
Problem 4: Docker Daemon Connection Error in PyCharm
When setting up Docker in PyCharm 2020.3 (Settings > Build, Execution, Deployment > Docker), you see: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Solution: Fix file ownership for the Docker socket:
sudo chown your-username /var/run/docker.sock
Problem 5: Tensor Type ID Error
When running project code in Docker, you get: RuntimeError: Unrecognized tensor type ID: AutogradCUDA.
Cause: The project was compiled with PyTorch 1.6 + torchvision 0.7, but then updated to PyTorch 1.7 + torchvision 0.8.
Solution: Recompile the project with the current versions:
python setup.py build develop
Problem 6: PyTorch Installation Timeout
When trying to upgrade PyTorch in Docker with pip install pytorch1.7.1-*.whl, you encounter timeout errors.
Solution: Use the --no-deps flag:
pip install --no-deps pytorch1.7.1-*.whl
Problem 7: Port Conflict in Multi-GPU Training
When configuring NUM_WORKERS=2 for multi-GPU training, you get:
RuntimeError: Address already in use
Cause: TCP port is already in use.
Solution 1: Specify a custom port when running the program:
python train.py --master_port 29501
Solution 2: Find and terminate the process using the port:
netstat -nltp
kill -9 PID
Problem 8: Type Comparison Error
The following code fails:
if box1 == torch.Tensor:
box1 = box1.cpu().numpy()
Correction: Use type() for comparison:
if type(box1) == torch.Tensor:
box1 = box1.cpu().numpy()
Problem 9: CUDA to NumPy Conversion Error
When encountering: cant convert cuda:0 device type tensor to numpy.
Solution: Explicitly convert tensors to CPU before NumPy conversion:
if type(box1) == torch.Tensor:
box1 = box1.cpu().numpy()
if type(box2) == torch.Tensor:
box2 = box2.cpu().numpy()
Problem 10: Multi-GPU Training Convergence Issues
Single GPU training converges normally, but multi-GPU training fails to converge.
For deeper understanding of NVIDIA Docker CUDA containerization principles, refer to: