Increase Hadoop Distributed Storage Dynamically

Setup Hadoop Cluster And Increase Storage Without Downtime

Harshet Jain
4 min readApr 15, 2021
Image by author

Today, I discuss a great topic where I show you how we can increase our storage on the fly without any disturbance. As you know, nowadays mostly every industry uses distributed storage system as because of big data comes every day and they also concern about cost and performance.

In market, there is a lot of software’s available who provides distributed storage like Hadoop, Splunk.

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

In distributed storage system there is two ways to increase the storage.

  1. Horizontal scaling
  2. Vertical scaling

First, we create a Hadoop cluster where we setup master and slave nodes. Then, I talk about storage part.

Hadoop Master Node

We need a Hadoop software I use Hadoop 1.2.1 you can download from anywhere. We also need a java software because it created on the top of java download the compatible version of java. Now, configure the master node setup.

Go to /etc/hadoop/ and edit the core-site.xml file and hdfs-site.xml file

core-site.xml
hdfs-site.xml

In hdfs-site.xml we write the folder name which we share in the value keyword.

Note: In Hadoop master node also called Namenode and slave node called Datanode.

Format the master node.

hadoop namenode -format

Start the services

hadoop-daemon.sh start namenode

Hadoop Slave Node

Same as master you require Hadoop and java software again. Now, configure the slave node setup. Nearly same configuration just changes some value of keywords.

core-site.xml

Here, in the value keyword write your name node IP.

Now, Interesting part comes up before we share our storage, we attach a separate hard disk and mount to that folder and do like if storage is full so can add more hard disk without any changes done.

We use LVM concept to fulfill above condition. First, attach a hard disk. I attach two hard disks of 1GB

Hard disks

you can check using command fdisk -l

fdisk -l

Need to create PV (Physical volume) and VG (Volume group) for LVM (Logical Volume Management). First create PV using command pvcreate {disk name} After that you can check using command pvdisplay

pvdisplay

Create VG from PV’s vgcreate {name}{pvname} and check vgdisplay {vgname}

vgdisplay

Now, create LVM of what size you want but it’s in between VG Size.

lvcreate --size +{size}G --name {name} {vg name}

Check by lvdisplay {vgname}/{lv name}

lvdisplay

Then, format that lvm

mkfs.ext4 {lv path}

Now, it’s ready for use just mount that with the folder

mount {lv path} {folder name}

Finally, put the folder name in Hadoop configuration file.

hdfs-site.xml

Start the services

hadoop-daemon.sh start datanode

Now, your cluster is ready for use. You can check the details of datanodes

hadoop dfsadmin -report
Nodes report

Let’s talk about our main topic if by chance our storage is full and need to add on the fly without any disturbance so, there is two ways:

  1. We can add one more data node same as above this is called Horizontal scaling. But it’s a little bit lengthy process.
  2. We can add more data storage in the Hadoop slave node this is called Vertical scaling.

In Vertical Scaling also, there is two ways to add data:

  1. Increase the lv size if you have extra storage is available in VG.
lvextend --size +{size}G {lv path}
resize2fs {lv path}
Added data

2. Add a new hard disk and create a pv and attach with VG. Then, increase the size of lv

vgextend {vg name} {pv name}

Thats all …

Let’s suppose a scenario we need urgent storage so we can reduce also the storage of other lv and add that to another lv but only condition is both are in same VG.

Thank you for reading!

--

--

Harshet Jain
0 Followers

AWS | DevOps | Ansible | Kubernetes | Big Data | Linux