- Background
1.1 Problem Description
During performance testing, intermittent performance drops and write failures with latencies reaching over 100 seconds were observed.
1.2 Root Cause Identification
RGW logs showed that the system triggered an automatic bucket resharding operation due to high object count in a single bucket.
[root@node113 ~]# cat /var/log/ceph/ceph-client.rgw.node113.7480.log | grep reshard
2020-09-16 04:51:50.239505 7fe71d0a7700 0 RGWReshardLock::lock failed to acquire lock on reshard.0000000009 ret=-16
2020-09-16 06:11:56.304955 7fe71d0a7700 0 RGWReshardLock::lock failed to acquire lock on reshard.0000000013 ret=-16
...
RGW configuration showed that each shard could store up to 100,000 objects, and the bucket was initially configured with 8 shards. When object count exceeded 800,000, resharding was triggered.
[root@node111 ~]# ceph --show-config | grep rgw_dynamic_resharding
rgw_dynamic_resharding = true
[root@node111 ~]# ceph --show-config | grep rgw_max_objs_per_shard
rgw_max_objs_per_shard = 100000
[root@node111 ~]# ceph --show-config | grep rgw_override_bucket_index_max_shards
rgw_override_bucket_index_max_shards = 8
Key Configuration Parameters
- rgw_dynamic_resharding: Enables automatic bucket resharding when a bucket’s object count exceeds the per-shard limit.
- rgw_override_bucket_index_max_shards: Maximum number of shards per bucket. Default is 0, maximum is 7877.
- rgw_max_objs_per_shard: Maximum number of objects per shard (default: 100,000).
- rgw_reshard_thread_interval: Interval for resharding thread scan (default: 10 minutes).
1.4 Bucket Sharding Overview
RGW maintains an index for each bucket containing metadata of all objects. A single index object can become a bottleneck for performance and reliability, especially under high write loads.
Bucket Sharding
Starting from Hammer version, Ceph supports bucket sharding to distribute index data across multiple RADOS objects, allowing for larger bucket capacity. This feature must be configured before bucket creation.
Dynamic Bucket Sharding
Introduced in Luminous version, this feature enables automatic sharding based on object growth. However, resharding locks the bucket for writes, which can cause performance issues during the process.
- Solutions
2.1 When Object Count Is Predictable
Disable dynamic resharding and pre-calculate the number of shards based on expected object count.
[root@node45 ~]# cat /etc/ceph/ceph.conf
[global]
rgw_dynamic_resharding = false
rgw_override_bucket_index_max_shards = 30
Restart RGW service after configuration update:
[root@node45 ~]# systemctl restart ceph-radosgw.target
2.2 When Object Count Is Unpredictable
Keep dynamic resharding enabled and set a reasonable default shard count:
[root@node45 ~]# cat /etc/ceph/ceph.conf
[global]
rgw_override_bucket_index_max_shards = 8
Restart RGW service:
[root@node45 ~]# systemctl restart ceph-radosgw.target