What is Sharding?

MongoDB sharding provides us scalable solution to store a large amount of data among the number of servers rather than storing on a single server. In practical terms, it is not feasible to store exponentially growing data on a single machine. Querying a huge amount of data stored on a single server could lead to high resource utilization and may not provide satisfactory read and write throughput. Basically, there are two types of scaling methods that exist to undertake growing data with the system:

VerticalHorizontal

Vertical Scaling works with enhancing single server performance by adding more powerful processors, upgrading RAM, or adding more disk space to the system. But there are the possible implications of applying vertical scaling in practical use cases with existing technology and hardware configurations. Horizontal Scaling works with adding more servers and distribute the load on multiple servers. Since each machine will be handling the subset of the whole dataset, it provides better efficiency and cost-effective solution rather than deploying the high-end hardware. But it requires additional maintenance of complex infrastructure with a large number of servers.

Sharding components

To achieve sharding in MongoDB, the following components are required: Shard is a Mongo instance to handle a subset of original data. Shards are required to be deployed in the replica set. Mongos is a Mongo instance and acts as an interface between a client application and a sharded cluster. It works as a query router to shards. Config Server is a Mongo instance which stores metadata information and configuration details of cluster. MongoDB requires the config server to be deployed as a replica set.

Sharding Architecture

MongoDB cluster consists of a number of replica sets. Each replica set consists of a minimum of 3 or more mongo instances. A sharded cluster may consist of multiple mongo shards instances, and each shard instance works within a shard replica set. The application interacts with Mongos, which in turn communicates with shards. Therefore in Sharding, applications never interact directly with shard nodes. The query router distributes the subsets of data among shards nodes based upon the shard key.

Sharding Implementation

Follow the below steps for sharding Step 1

Start config server in the replica set and enable replication between them.

mongod –configsvr –port 27019 –replSet rs0 –dbpath C:\data\data1 –bind_ip localhost mongod –configsvr –port 27018 –replSet rs0 –dbpath C:\data\data2 –bind_ip localhost mongod –configsvr –port 27017 –replSet rs0 –dbpath C:\data\data3 –bind_ip localhost Step 2

Initialize replica set on one of the config servers.

rs.initiate( { _id : “rs0”,  configsvr: true,  members: [   { _id: 0, host: “IP:27017” },   { _id: 1, host: “IP:27018” },   { _id: 2, host: “IP:27019” }    ] }) Step 3

Start sharding servers in the replica set and enable replication between them.

mongod –shardsvr –port 27020 –replSet rs1 –dbpath C:\data\data4 –bind_ip localhost mongod –shardsvr –port 27021 –replSet rs1 –dbpath C:\data\data5 –bind_ip localhost mongod –shardsvr –port 27022 –replSet rs1 –dbpath C:\data\data6 –bind_ip localhost Step 4

Initialize replica set on one of the sharded servers.

rs.initiate( { _id : “rs0”,  members: [   { _id: 0, host: “IP:27020” },   { _id: 1, host: “IP:27021” },   { _id: 2, host: “IP:27022” }    ] }) Step 5

Start the mangos for the sharded cluster

mongos –port 40000 –configdb rs0/localhost:27019,localhost:27018, localhost:27017 Step 6

Connect the mongo route server

mongo –port 40000

Now, add sharding servers.

sh.addShard( “rs1/localhost:27020,localhost:27021,localhost:27022”) Step 7

On mongo shell enable sharding on DB and collections.Enable sharding on DB

sh.enableSharding(“geekFlareDB”) Step 8

To shard the collection shard key (described later in this article) is required.

Syntax: sh.shardCollection(“dbName.collectionName”, { “key” : 1 } )
Note if the collection doesn’t exist, create as follows. Step 9 Insert data into the collection. Mongo logs will start growing, and indicating a balancer is in action and trying to balance the data among shards. Step 10 The last step is to check the status of the sharding. Status can be checked by running below command on the Mongos route node.

Sharding status

Check sharding status by running below command on the mongo route node. sh.status()

Data distribution

The Mongos router distributes the load among shards based upon the shard key, and to evenly distribute data; balancer comes into action. The key component to distribute data among shards are

A balancer plays a role in balancing the subset of data among the sharded nodes. Balancer runs when Mongos server starts distributing loads among shards. Once started, Balancer distributed data more evenly. To check the state of balancer run sh.status() or sh.getBalancerState() orsh.isBalancerRunning().

OR After inserting the data, we could notice some activity in the Mongos daemon stating that it is moving some chunks for the specific shards and so on, i.e., the balancer will be in action trying to balance the data across the shards. Running balancer could lead to performance issues; hence it is suggested to run the balancer within a certain balancer window.

Shard Key determines the logic to distribute documents of sharded collection among the shards. Shard key can be an indexed field or indexed compound field which is required to be present in all documents of the collection to be inserted. Data will be partitioned into chunks, and each chunk will be associated with the range based shard key. On the basis of the range query router will decide which shard will store the chunk.

Shard Key can be selected by considering five properties :

CardinalityWrite distributionRead distributionRead targetingRead locality

Removing the shard node

Before removing shards from the cluster, the user is required to ensure the safe migration of data to remaining shards. MongoDB takes care of safely draining data to other shards nodes before the removal of the required shard node. Run below command to remove the required shard. Step 1 First, we need to determine the hostname of the shard to be removed. Below command will list all the shards present in the cluster along with the state of the shard. db.adminCommand( { listShards: 1 } ) Step 2 Issue the below command to remove the required shard from the cluster. Once issued, the balancer takes care of the removal of chunks from draining shard node and then balances the distribution of remaining chunks among the rest shards nodes. db.adminCommand( { removeShard: “shardedReplicaNodes” } ) Step 3 To check the status of the draining shard, issue the same command again. db.adminCommand( { removeShard: “rs1/127.0.0.1:27020,127.0.0.1:27021,127.0.0.1:27022” } ) We need to wait until the draining of the data is completed. msg and state fields will show if draining of data has been completed or not, as follows We can also check the status with the command sh.status(). Once removed sharded node will not be reflected in the output. But if draining will be ongoing, the sharded node will come with draining status as true. Step 4 Keep on checking the status of draining with the same above command, until the required shard is removed completely.Once the completed, the output of the command will reflect the message and state as completed. Step 5 Finally, we need to check the remaining shards in the cluster. To check the status enter sh.status() or db.adminCommand( { listShards: 1 } ) Here, we can see the removed shard is no longer present in the list of shards.

Benefits of Sharding over Replication

In replication, the primary node handles all write operations, whereas secondary servers are required to maintain backup copies or serve read-only operations. But in sharding along with replica sets, the load gets distributed among numbers of servers.A single replica set is limited to 12 nodes, but there is no restriction on the number of shards.Replication requires high-end hardware or verticle scaling for handling large datasets, which is too expensive compared to adding additional servers in sharding.In replication, read performance can be enhanced by adding more slave/secondary servers, whereas, in sharding, both read and write performance will be enhanced by adding more shards nodes.

Sharding limitation

The Sharded cluster doesn’t support unique indexing across the shards until the unique index is prefixed with full shard key.All update operations for sharded collection either on one or many documents must contain the sharded key or _id field in the query.Collections can be sharded if their size doesn’t exceed the specified threshold. This threshold can be estimated on the basis of the average size of all shard keys and configured sized of chunks.Sharding comprises of operational limits on max collection size or number of splits.Choosing the wrong shard keys to lead to performance implications.

Conclusion

MongoDB offers built-in sharding to implement a large database without compromising the performance. I hope the above helps you to setup MongoDB sharding. Next, you may want to get familiar with some of the commonly used MongoDB commands.

MongoDB Sharding  Step by Step Practical Guide - 69MongoDB Sharding  Step by Step Practical Guide - 34