Introduction to GlusterFS
GlusterFS is an open-source distributed file system composed of storage servers, clients, and optional NFS/Samba storage gateways. Unlike traditional distributed file systems that rely on metadata servers, GlusterFS operates without a metadata server component, enhancing performance, reliability, and stability.
Traditional distributed file systems typically use metadata servers to store directory information and structure. While this approach offers efficient directory browsing, it introduces potential single points of failure. If the metadata server fails, the entire storage system collapses regardless of node redundancy. GlusterFS, with its serverless metadata design, provides excellent horizontal scalability, high reliability, and efficient storage.
GlusterFS serves as the core of Gluster, a scale-out storage solution, offering powerful horizontal scalability for data storage. It can scale to support petabyte-level storage capacity and handle thousands of clients. GlusterFS aggregates physically distributed storage resources using TCP/IP or InfiniBand RDMA networks, providing unified storage services through a global namespace.
Key Features
- Scalability and High Performance
- Scale-out architecture allows increasing storage capacity and performance by simply adding storage nodes (disk, computing, and I/O resources can be independently scaled). Supports high-speed network interconnects like 10GbE and InfiniBand.
- Gluster's Elastic Hash eliminates dependency on metadata servers, addressing single points of failure and performance bottlenecks, enabling truly parallel data access. The algorithm intelligently locates any data shard across the storage pool without requiring index lookups or metadata server queries.
- High Availability
- GlusterFS automatically replicates files through mirroring or multiple copies, ensuring data accessibility even during hardware failures.
- The self-healing feature restores data to a consistent state when inconsistencies occur. Data repair is performed incrementally in the background with minimal performance impact.
- Supports all storage types by using standard operating system file systems (EXT3, XFS, etc.) rather than proprietary formats. Data can be accessed using traditional disk access methods.
- Global Unified Namespace
Integrates all node namespaces into a unified namespace, combining all node storage capacities into a large virtual storage pool accessible to frontend hosts for data read/write operations.
- Elastic Volume Management
Data is stored in logical volumes independently partitioned from a logical storage pool. The logical storage pool can be expanded or removed online without service interruption. Logical volumes can be grown or shrunk online with load balancing across multiple nodes. Filesystem configurations can be modified and applied in real-time to adapt to workload changes or performance tuning.
- Standard Protocol Support
Gluster storage services support NFS, CIFS, HTTP, FTP, SMB, and Gluster native protocols, fully compatible with POSIX standards. Existing applications can access Gluster data without modifications, and dedicated APIs are also available.
Terminology
- Brick: A dedicated partition on trusted hosts provided for physical storage. The basic storage unit in GlusterFS and the storage directory exposed by servers in the trusted pool. Format: SERVER:EXPORT (e.g., 192.168.232.10:/data/mydir/).
- Volume: A collection of Bricks. A logical device for data storage, similar to LVM logical volumes. Most Gluster management operations are performed on volumes.
- FUSE: A kernel module that allows users to create their own file systems without modifying kernel code.
- VFS: The interface provided by kernel space to user space for disk access.
- Glusterd: The background management process that runs on each node in the storage cluster.
- Stripe: A data distribution technique that splits files into fixed-size blocks and sequentially stores them across multiple bricks. The stripe size can be configured, with a default of 4MB.
Workflow
- Clients or applications access data through GlusterFS mount points.
- The Linux kernel receives and processes requests via the VFS API.
- VFS passes data to the FUSE kernel filesystem, registering an actual FUSE filesystem. The FUSE filesystem delivers data to the GlusterFS client through the /dev/fuse device file, acting as a proxy.
- The GlusterFS client processes data according to configuration settings.
- Processed data is transmitted over the network to remote GlusterFS servers and written to server storage devices.
Volume Types
Distributed Volume
Files are distributed across all Brick Servers using a HASH algorithm. This is GlusterFS's default volume type. Files are stored as whole units on different servers based on the HASH algorithm, expanding disk space without redundancy. This is equivalent to file-level RAID0 without fault tolerance. Files are not split; each file resides entirely on one server node. Storage efficiency doesn't improve and may decrease due to network communication overhead.
Characteristics:
- Files distributed across different servers without redundancy.
- Easier and cheaper to expand volume size.
- Single points of failure cause data loss.
- Dependent on underlying data protection.
Example:
#!/bin/bash # Create a distributed volume named 'dist-vol' # Files will be distributed based on HASH across: # server1:/storage/dir1, server2:/storage/dir2, server3:/storage/dir3 gluster volume create dist-vol server1:/storage/dir1 server2:/storage/dir2 server3:/storage/dir3 force
Stripe Volume
Similar to RAID0, files are split into data blocks and distributed across multiple Brick Servers in a round-robin fashion. File storage is block-based, supporting large files with higher read efficiency for larger files, but without redundancy.
Striping: A data distribution technique that splits files into fixed-size blocks (stripes) and sequentially stores them across multiple bricks. The stripe size can be configured, defaulting to 4MB.
Characteristics:
- Data is split into smaller blocks distributed across different stripe areas in the block server cluster.
- Distribution reduces load and smaller files accelerate access speed.
- No data redundancy.
Example:
#!/bin/bash # Create a stripe volume named 'stripe-vol' with 2 stripes # Files will be split and stored across: # server1:/storage/dir1 and server2:/storage/dir2 gluster volume create stripe-vol stripe 2 transport tcp server1:/storage/dir1 server2:/storage/dir2 force
Replica Volume
Files are synchronized across multiple Bricks, creating multiple file copies equivalent to file-level RAID1 with fault tolerance. Data distribution across multiple Bricks significantly improves read performance but decreases write performance. Replica volumes provide redundancy, ensuring normal data usage even if one node fails. However, disk utilization is lower due to replica storage.
Characteristics:
- All servers in the volume maintain complete copies.
- Replica count is determined during volume creation but must equal the number of storage servers in the volume's Bricks.
- Requires at least two block servers.
- Provides redundancy.
Example:
#!/bin/bash # Create a replica volume named 'rep-vol' with 2 replicas # Files will be stored as copies across: # server1:/storage/dir1 and server2:/storage/dir2 gluster volume create rep-vol replica 2 transport tcp server1:/storage/dir1 server2:/storage/dir2 force
Distributed Stripe Volume
The number of Brick Servers is a multiple of the stripe count (number of Bricks distributing data blocks). Combines features of distributed and stripe volumes, primarily for large file access. Requires at least 4 servers to create.
Example:
File1 and File2 are located on Server1 and Server2 respectively through distributed volume functionality. In Server1, File1 is split into 4 segments, with segments 1 and 3 in Server1's exp1 directory and segments 2 and 4 in Server1's exp2 directory. In Server2, File2 is also split into 4 segments, with segments 1 and 3 in Server2's exp3 directory and segments 2 and 4 in Server2's exp4 directory.
Example:
#!/bin/bash # Create a distributed stripe volume named 'dist-stripe-vol' # When creating distributed stripe volumes, the number of storage servers in the volume's Bricks must be a multiple of the stripe count (≥2x) # With 4 Bricks (server1:/storage/dir1, server2:/storage/dir2, server3:/storage/dir3, server4:/storage/dir4) and stripe count of 2 gluster volume create dist-stripe-vol stripe 2 transport tcp server1:/storage/dir1 server2:/storage/dir2 server3:/storage/dir3 server4:/storage/dir4 force
When creating volumes, if the number of storage servers equals the stripe or replica count, a stripe or replica volume is created. If the number of storage servers is 2x or more the stripe or replica count, a distributed stripe or distributed replica volume is created.
Distributed Replica Volume
The number of Brick Servers is a multiple of the replica count (number of data copies). Combines features of distributed and replica volumes, primarily for scenarios requiring redundancy.
Example:
File1 and File2 are located on Server1 and Server2 respectively through distributed volume functionality. When storing File1, two identical copies exist based on replica volume characteristics: one in Server1's exp1 directory and another in Server2's exp2 directory. When storing File2, two identical copies also exist: one in Server3's exp3 directory and another in Server4's exp4 directory.
Example:
#!/bin/bash # Create a distributed replica volume named 'dist-replica-vol' # When creating distributed replica volumes, the number of storage servers in the volume's Bricks must be a multiple of the replica count (≥2x) # With 4 Bricks (server1:/storage/dir1, server2:/storage/dir2, server3:/storage/dir3, server4:/storage/dir4) and replica count of 2 gluster volume create dist-replica-vol replica 2 transport tcp server1:/storage/dir1 server2:/storage/dir2 server3:/storage/dir3 server4:/storage/dir4 force
Stripe Replica Volume
Similar to RAID10, combines features of stripe and replica volumes.
Distributed Stripe Replica Volume
A composite volume of the three basic types. When writing files, the system first splits files into stripes. These stripes are assigned to different storage nodes, with replicas created on other nodes. During reading, the system can retrieve different stripes from multiple nodes in parallel and reconstruct the file.
GlusterFS Deployment
Environment Setup
| Node | Disks | Mount Points |
|---|---|---|
| node1/192.168.232.10 | /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 | /data/sdb1 /data/sdc1 /data/sdd1 /data/sde1 |
| node2/192.168.232.20 | /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 | /data/sdb1 /data/sdc1 /data/sdd1 /data/sde1 |
| node3/192.168.232.30 | /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 | /data/sdb1 /data/sdc1 /data/sdd1 /data/sde1 |
| node4/192.168.232.40 | /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 | /data/sdb1 /data/sdc1 /data/sdd1 /data/sde1 |
Initial Preparation
Disable Firewall
#!/bin/bash systemctl stop firewalld setenforce 0
Configure Hostname
#!/bin/bash hostnamectl set-hostname node1
Refresh Disk Interfaces
#!/bin/bash echo "- - -" > /sys/class/scsi_host/host0/scan echo "- - -" > /sys/class/scsi_host/host1/scan echo "- - -" > /sys/class/scsi_host/host2/scan
Verify that disks have been added on all four machines.
Disk Partitioning, Formatting, and Mounting
#!/bin/bash
# Script for disk partitioning, formatting, and mounting
PART_SCRIPT="/opt/disk_setup.sh"
cat << 'EOF' > $PART_SCRIPT
#!/bin/bash
# Discover new disk devices
DISKS=$(ls /dev/sd* | grep -o 'sd[b-z]' | uniq)
for DISK in $DISKS
do
# Partition the disk
echo -e "n\np\n\n\n\nw\n" | fdisk /dev/$DISK &> /dev/null
# Format with XFS
mkfs.xfs /dev/${DISK}1 &> /dev/null
# Create mount point
mkdir -p /data/${DISK}1 &> /dev/null
# Add to fstab
echo "/dev/${DISK}1 /data/${DISK}1 xfs defaults 0 0" >> /etc/fstab
done
# Mount all filesystems
mount -a &> /dev/null
EOF
# Make script executable and run it
chmod +x $PART_SCRIPT
$PART_SCRIPT
# Verify mounting
df -h
Prepare Yum Repository
#!/bin/bash # Setup local yum repository cd /etc/yum.repos.d/ mkdir repo.bak mv *.repo repo.bak # Create repository configuration cat << 'EOF' > /etc/yum.repos.d/glfs.repo [glfs] name=GlusterFS Repository baseurl=file:///opt/gfsrepo gpgcheck=0 enabled=1 EOF # Clean and update repository cache yum clean all && yum makecache
Install Server Software
#!/bin/bash # Install GlusterFS packages yum -y install glusterfs glusterfs-server glusterfs-fuse glusterfs-rdma # Enable and start glusterd service systemctl enable --now glusterd.service # Verify service status systemctl status glusterd.service
Add Nodes to Storage Trust Pool
Add nodes to the trust pool from any single node. If using domain names, ensure they're specified in the hosts file.
#!/bin/bash # Add all nodes to the trust pool gluster peer probe node1 gluster peer probe node2 gluster peer probe node3 gluster peer probe node4
Create Volumes
| Volume Name | Volume Type | Bricks |
|---|---|---|
| dist-vol | Distributed Volume | node1(/data/sdb1), node2(/data/sdb1) |
| stripe-vol | Stripe Volume | node1(/data/sdc1), node2(/data/sdc1) |
| rep-vol | Replica Volume | node3(/data/sdb1), node4(/data/sdb1) |
| dist-stripe-vol | Distributed Stripe Volume | node1(/data/sdd1), node2(/data/sdd1), node3(/data/sdd1), node4(/data/sdd1) |
| dist-replica-vol | Distributed Replica Volume | node1(/data/sde1), node2(/data/sde1), node3(/data/sde1), node4(/data/sde1) |
Distributed Volume
#!/bin/bash # Create a distributed volume gluster volume create dist-vol node1:/data/sdb1 node2:/data/sdb1 force # Start the volume gluster volume start dist-vol # View volume information gluster volume info dist-vol # List all volumes gluster volume list
Mounting: Mount the distributed volume on clients for permanent access.
Testing:
#!/bin/bash # Test file creation and storage MOUNT_POINT="/mnt/dist_test" mkdir -p $MOUNT_POINT # Create test files dd if=/dev/zero of=$MOUNT_POINT/testfile1.log bs=1M count=40 dd if=/dev/zero of=$MOUNT_POINT/testfile2.log bs=1M count=40 dd if=/dev/zero of=$MOUNT_POINT/testfile3.log bs=1M count=40 dd if=/dev/zero of=$MOUNT_POINT/testfile4.log bs=1M count=40 dd if=/dev/zero of=$MOUNT_POINT/testfile5.log bs=1M count=40
Stripe Volume
#!/bin/bash # Create a stripe volume with 2 stripes gluster volume create stripe-vol stripe 2 node1:/data/sdc1 node2:/data/sdc1 force # Start the stripe volume gluster volume start stripe-vol # View stripe volume information gluster volume info stripe-vol
Testing: Data is split 50% with no replicas or redundancy.
Replica Volume
#!/bin/bash # Create a replica volume with 2 replicas gluster volume create rep-vol replica 2 node3:/data/sdb1 node4:/data/sdb1 force # Start the replica volume gluster volume start rep-vol # View replica volume information gluster volume info rep-vol
Verification: Verify data replication and redundancy.
Distributed Stripe Volume
#!/bin/bash # Create a distributed stripe volume # Specifying type as stripe with value 2, followed by 4 Brick Servers (2x the stripe count) # creates a distributed stripe volume gluster volume create dist-stripe-vol stripe 2 node1:/data/sdd1 node2:/data/sdd1 node3:/data/sdd1 node4:/data/sdd1 force # Start the distributed stripe volume gluster volume start dist-stripe-vol # View volume information gluster volume info dist-stripe-vol
Distributed Replica Volume
#!/bin/bash # Create a distributed replica volume # Specifying type as replica with value 2, followed by 4 Brick Servers (2x the replica count) # creates a distributed replica volume gluster volume create dist-replica-vol replica 2 node1:/data/sde1 node2:/data/sde1 node3:/data/sde1 node4:/data/sde1 force # Start the distributed replica volume gluster volume start dist-replica-vol # View volume information gluster volume info dist-replica-vol
Additional Maintenance Commands
#!/bin/bash # List all GlusterFS volumes gluster volume list # View information for all volumes gluster volume info # Check status of all volumes gluster volume status # Stop a volume gluster volume stop dist-stripe-vol # Delete a volume (must be stopped first) gluster volume delete dist-stripe-vol # Set access control for a volume # Deny specific IP gluster volume set dist-replica-vol auth.deny 192.168.80.100 # Allow specific network range gluster volume set dist-replica-vol auth.allow 192.168.80.* # Allows all IPs in the 192.168.80.0 network to access the dist-replica-vol