Regression Algorithms: A Practical Guide to XGBoost, LightGBM, SVR, and Random Forest

LightGBM Parameters

Official documentation:

  • English: https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRegressor.html#lightgbm.LGBMRegressor
  • Chinese: https://lightgbm.cn/docs/6/

The LGBMRegressor constructor accepts the following parameters:

lightgbm.LGBMRegressor(boosting_type='gbdt', num_leaves=31, max_depth=-1, learning_rate=0.1, n_estimators=100, subsample_for_bin=200000, objective=None, class_weight=None, min_split_gain=0.0, min_child_weight=0.001, min_child_samples=20, subsample=1.0, subsample_freq=0, colsample_bytree=1.0, reg_alpha=0.0, reg_lambda=0.0, random_state=None, n_jobs=None, importance_type='split', **kwargs)

Recommended parameter ranges:

  • Learning rate: [0.01, 0.15]
  • Maximum depth: [3, 25]
  • Feature fraction / colsample_bytree: [0.5, 1]
  • Bagging fraction / subsample: [0.5, 1]
  • lambda_l1: [0, 0.01~0.1, 1]
  • lambda_l2: [0, 0.1, 0.5, 1]
  • min_gain_to_split / min_split_gain: [0, 0.05 ~ 0.1, 0.3, 0.5, 0.7, 0.9, 1]
  • min_sum_hessian_in_leaf / min_child_weight: [1, 3, 5, 7]

XGBoost

Official documentation: https://xgboost.readthedocs.io/en/latest/parameter.html#general-parameters

Advantages

  • Regularization
  • Parallel processing
  • Customizable optimization objectives and evaluation metrics
  • Built-in handling of missing values
  • Greedy algorithm for pruning

Key Parameters

  • learning_rate: Reduces the weight of each step to improve model robustness. Typical values: 0.01-0.2.
  • min_child_weight: Minimum sum of instance weight needed in a child. Prevents overfitting. Default: 1.
  • max_depth: Maximum depth of a tree. Prevents overfitting. Typical values: 3-10. Default: 6.
  • gamma: Minimum loss reduction required to make a further partition. Default: 0.
  • subsample: Subsample ratio of the training instance. Prevents overfitting. Typical values: 0.5-1. Default: 1.
  • colsample_bytree: Subsample ratio of columns when constructing each tree. Typical values: 0.5-1. Default: 1.

Visualization

# Feature importance plot
xgb.plot_importance(model)

# Tree visualization
xgb.plot_tree(model, num_trees=2)

# Timing execution
start = timeit.default_timer()
# Code to be timed
end = timeit.default_timer()
print(f"Execution time: {end - start} seconds")

Support Vector Regression (SVR) Parameters

Official documentation: https://xgboost.readthedocs.io/en/latest/parameter.html#general-parameters

When optimizing hyperparameters for SVR, it's often more effective to focus on C and epsilon rather than C and gamma. The former combination typically yields better performance improvements.

Random Forest

Key Parameters

  • max_features: Increasing this value improves model performance but reduces tree diversity.
  • n_estimators: Number of trees in the forest. Higher values yield better performance but slower computation.
  • min_samples_leaf: Important parameter. For small datasets: 1-50, for large datasets: 200-300.
  • min_samples_split: Range: 2-30.

Random Forest is relatively robust to parameter changes, often producing good results even with default parameters.

Bayesian Optimization with Tree-structured Parzen Estimator

Resources:

  • https://optunity.readthedocs.io/en/latest/user/solvers/TPE.html#hyperopt
  • https://github.com/WillKoehrsen/hyperparameter_optimization/blob/master/Introduction%20to%20Bayesian%20Optimization%20with%20Hyperopt.ipynb

Parameter Space Definition Functions

  • hp.pchoice(label, p_options): Returns one of p_options with specified probabilities.
  • hp.uniform(label, low, high): Uniform distribution between low and high.
  • hp.quniform(label, low, high, q): Round uniform value divided by q, then multiplied by q.
  • hp.loguniform(label, low, high): Value whose log is uniformly distributed.
  • hp.randint(label, upper): Random integer in [0, upper).
  • hp.normal(label, mu, sigma): Normal distribution with mean mu and standard deviation sigma.
  • hp.qnormal(label, mu, sigma, q): Quantized normal distribution.
  • hp.lognormal(label, mu, sigma): Log-normal distribution.
  • hp.qlognormal(label, mu, sigma, q): Quantized log-normal distribution.

Optimization Algorithms

  • Random search (hyperopt.rand.suggest)
  • Simulated annealing (hyperopt.anneal.suggest)
  • TPE algorithm (hyperopt.tpe.suggest)

One-Hot Encoding Example

# Reshape feature
feature = feature.reshape(X.shape[0], 1)

# Apply one-hot encoding
encoder = OneHotEncoder(sparse=False)
feature = encoder.fit_transform(feature)

# Encode string input values as integers
encoded_features = None
for i in range(X.shape[1]):
    label_encoder = LabelEncoder()
    feature = label_encoder.fit_transform(X[:, i])
    feature = feature.reshape(X.shape[0], 1)
    encoder = OneHotEncoder(sparse=False)
    feature = encoder.fit_transform(feature)
    
    if encoded_features is None:
        encoded_features = feature
    else:
        encoded_features = numpy.concatenate((encoded_features, feature), axis=1)

print(f"Encoded features shape: {encoded_features.shape}")

Data Loading Example

# Load data from URL
dataset = pd.read_csv('https://labfile.oss.aliyuncs.com/courses/1283/adult.data.csv')
print(dataset.head())

Tags: regression Machine Learning xgboost lightgbm svr

Posted on Thu, 07 May 2026 13:14:56 +0000 by big-dog1965