Document Splitting

Follow

MongoDB has a document size limitation, which is tied to BSON object size limitation and it is 16Mb per document.

Even if you have not reached the limit, depending on the amount of writes to you make to the document, when document becomes large, writes become much slower.

So there is a document splitting technique, which is similar to MongoDB sharding, but works locally by splitting data logically.

We are using this technique on all metrics collections, as well as events and users collections. Since we are calculating totals for each metric or user, we need to make sure that same metric and same user would always go into the same document.

So for example, when a value for metric carrier comes in, we take the value, hash it using md5, encode it in base64 and get the first symbol.

Then we append this symbol to the id of the collection we will put this metric in. That way we almost evenly divide incoming metric into 64 split documents.

var crypto = require("crypto");

var carrier = "AT&T";

var postfix = crypto.createHash("md5").update(carrier).digest('base64')[0];

var id = app_id+"_2016:0_"+postfix;

common.db.collection("carriers").update({_id:id}, {$set:someUpdate}, {upsert:true}, function(){});

The structure of each document is exactly the same as it would be when used normally in single document. Then only difference is that the same data is split across multiple documents and each write is happening to a smaller document.

Of course now when fetching data, we need to fetch all 64 documents and merge the data together. It still justifies the time writes take for large documents.

var base_id = app_id+"_2016:0";
var docs = [];

//common.base64 is just an array of all possible symbols from 64 alphabet, as
//["0","1","2","3","4","5","6","7","8","9","a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z","A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z","+","/"];

for(var i = 0; i < common.base64.length; i++){
	docs.push(base_id+"_"+common.base64[i]);
}

common.db.collection("carriers").find({_id:{$in: docs}}).toArray(function(err, res){
	//merge data
});

And we just merge them by summing up all properties.

Looking for help?