Big Data Processing Overview
Big data involves analyzing massive datasets to extract valuable insights for organizational decision-making. Core processing stages include:
- Data acquisition
- Data processing
- Result visualization
Hadoop Framework
Hadoop provides distributed processing capabilities for large datasets across computer clusters. Its architecture addresses two fundamental challenges:
Distributed Storage (HDFS)
Large files exceeding single-node capacity are segmented into blocks (typically 128MB) distributed across nodes. A master NameNode manages file metadata and coordinates access.
Distributed Computation (MapReduce)
Computational tasks are divided into smaller units executed in parallel across cluster nodes. The YARN resource manager handles task scheduling and resource allocation.
Core Components
- HDFS: Distributed file system for high-throughput data access
- MapReduce: Parallel processing framework for large datasets
- YARN: Resource management and job scheduling
- Comon Utilities: Foundational libraries and tools
Cluster Environment Configuration
Infrastructure Planning
| IP | Hostname | Roles |
|---|---|---|
| 192.168.174.100 | node01 | NameNode, ResourceManager, ZK |
| 192.168.174.120 | node02 | DataNode, NodeManager, ZK |
| 192.168.174.130 | node03 | DataNode, NodeManager, ZK |
System Configuration
- Disable firewalls: ```
systemctl stop firewalld
systemctl disable firewalld
- Disable SELinux: ```
sed -i 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/selinux/config
- Configure hostnames and network mappings
SSH Key Authantication
- Generate keys on all nodes: ```
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
- Distribute public keys: ```
ssh-copy-id node01
Time Synchronization
yum install -y ntp
crontab -e
*/1 * * * * /usr/sbin/ntpdate pool.ntp.org
Software Installation
Java Development Kit
tar -zxvf jdk-8u221-linux-x64.tar.gz -C /opt/
echo 'export JAVA_HOME=/opt/jdk1.8.0_221
export PATH=$JAVA_HOME/bin:$PATH' >> /etc/profile
source /etc/profile
Zookeeper Cluster
- Confiugre zoo.cfg: ```
dataDir=/data/zookeeper
server.1=node01:2888:3888
server.2=node02:2888:3888
server.3=node03:2888:3888
- Initialize node identities: ```
On node01:
echo 1 > /data/zookeeper/myidOn node02:
echo 2 > /data/zookeeper/myidOn node03:
echo 3 > /data/zookeeper/myid - Start the ensemble: ```
zkServer.sh start
Hadoop Cluster Deployment
Configuration Files
core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://node01:8020</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/data/hadoop/tmp</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///data/hadoop/namenode</value>
</property>
</configuration>
yarn-site.xml
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>node01</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Cluster Initialization
- Format HDFS: ```
hdfs namenode -format
- Start services: ```
start-dfs.sh
start-yarn.sh
mapred --daemon start historyserver
Verification
- HDFS Web UI: http://node01:9870
- YARN Web UI: http://node01:8088