Here I am providing the answers for FINAL exam M101J,which I found out upon solving.Hope you use it wisely, my point of discussing each of the questions is that everyone could check which point are they going wrong and yes I could also get a better solution than mine. So please use it as a extra check after you have solved the question once on your part so that the explanations benefit you the most.
Question 1 :
Here we need to query the enron dataset
calculate the number of messages sent by Andrew Fastow, CFO, to Jeff Skilling, the president. Andrew Fastow's email addess was andrew.fastow@enron.com. Jeff Skilling's email was jeff.skilling@enron.com
So for this first we need to download the enron zip/tar and then import in the mongoDB database name enron and collection name messages . Command for import
mongoimport -d enron -c messages > enron.json
Now switch to mongo Shell commands:
use enron
db.messages.find({"headers.To":"andrew.fastow@enron.com","headers.From":"jeff.skilling@enron.com"}).count()
This will produce the answer as 3
Question 2:
Please use the Enron dataset you imported for the previous problem. For this question you will use the aggregation framework to figure out pairs of people that tend to communicate a lot. To do this, you will need to unwind the To list for each message.
The mongo shell command which will retrieve the desired answer would be
db.messages.aggregate([
{
$project: {
from: "$headers.From",
to: "$headers.To"
}
},
{
$unwind: "$to"
},
{
$group : { _id : { _id: "$_id", from: "$from", to: "$to" }
}
},
{
$group : { _id : { from: "$_id.from", to: "$_id.to" }, count: {$sum :1}
}
},
{
$sort : {count:-1}
},
{
$limit: 2
}
])
This would give you the top 2 communication , and check the top most which would turn out to be :
"result" : [
{
"_id" : {
"from" : "susan.mara@enron.com",
"to" : "jeff.dasovich@enron.com"
},
"count" : 750
},
{
"_id" : {
"from" : "soblander@carrfut.com",
"to" : "soblander@carrfut.com"
},
"count" : 679
}
],
"ok" : 1
So, it clearly shows the answer is "susan.mara@enron.com" to "jeff.dasovich@enron.com"
Question 3:
In this problem you will update a document in the Enron dataset to illustrate your mastery of updating documents from the shell. Please add the email address "mrpotatohead@10gen.com" to the list of addresses in the "headers.To" array for the document with "headers.Message-ID" of "<8147308.1075851042335.JavaMail.evans@thyme>"
For this there would be a simple update expression using mongo shell as :
db.messages.update({"headers.Message-ID":"<8147308.1075851042335.JavaMail.evans@thyme>"},{$addToSet:{"headers.To":"mrpotatohead@10gen.com"}})
Then run the validation code and get the validation code as : 897h6723ghf25gd87gh28
Question 4:
Enhancing the Blog to support viewers liking certain comments.
Here you need to work on the BlogPostDAO.java at the area marked as XXXXXX
postsCollection.update(new BasicDBObject("permalink", permalink), new BasicDBObject("$inc", new BasicDBObject("comments." + ordinal + ".num_likes", 1)));
Here in the above command we search the posts collection with the permanent link and increment the like counter by one for the comment which is clicked for like or in other words the ordinal or the order of the comment in the comments array, this ensures that the like is incremented for the comment clicked for like.
Doing this you could see that the like button starts working.
Now run the validator , and you will get the code as : 983nf93ncafjn20fn10f
Question 5 :
In this question a set of indexes are given and we have to select the indexes which might have been used , in execution of
db.fubar.find({'a':{'$lt':10000}, 'b':{'$gt': 5000}}, {'a':1, 'c':1}).sort({'c':-1})
As the Find portion searches on a,b and a,c and sorting is carries on c reverse order.
_id_ -- This index is not used either in sort or find clause of the operation
a_1_b_1 -- This index is used in the find operation as find is on a,b
a_1_c_1 -- This index is used in the find operation as find is on a,c
c_1 -- This index is also used, because there is a provision that a index is not utilized for the find operation but for the sort it is used as sort({'c':-1})
a_1_b_1_c_1 - This involves all the three a,b,c and this is also used as it can also be used as a valid index
Question 6
Suppose you have a collection of students of the following form:
{
"_id" : ObjectId("50c598f582094fb5f92efb96"),
"first_name" : "John",
"last_name" : "Doe",
"date_of_admission" : ISODate("2010-02-21T05:00:00Z"),
"residence_hall" : "Fairweather",
"has_car" : true,
"student_id" : "2348023902",
"current_classes" : [
"His343",
"Math234",
"Phy123",
"Art232"
]
}
Now suppose that basic inserts into the collection, which only include the last name, first name and student_id, are too slow. What could potentially improve the speed of inserts. Check all that apply.
Add an index on last_name, first_name if one does not already exist.
Set w=0, j=0 on writes
Remove all indexes from the collection
Provide a hint to MongoDB that it should not use an index for the inserts
Build a replica set and insert data into the secondary nodes to free up the primary nodes.
option 1 - As a fact adding index affects reading not writing so it would be indifferent with the indexing so not this option
Option 2 seems to be valid as when w=0 and j=0 is done for the writes no waiting is done at all are no wait is required to obtain as the write confirmations , simply the data is dumped without verification therefore speeding the writes
Option 3 removing indexes would actually help as it would reduce the load and speed up the writing process
Option 4 This seems absurd
Option 5 This is not actually possible as writes are not possible on the secondary nodes so not valid option
Question 7
You have been tasked to cleanup a photosharing database. The database consists of two collections, albums, and images. Every image is supposed to be in an album, but there are orphan images that appear in no album. Here are some example documents (not from the collections you will be downloading).
When you are done removing the orphan images from the collection, there should be 90,017 documents in the images collection.
In order to remove the Orphans talked I wrote a Java Program :
/**
*
* @author Ankur Gupta
*/
public class Test {
public static void main(String[] args) throws IOException {
MongoClient c = new MongoClient(new MongoClientURI("mongodb://localhost"));
DB db = c.getDB("finaltask");
int i =0;
DBCollection album = db.getCollection("albums");
DBCollection image = db.getCollection("images");
DBCursor cur = image.find();
cur.next();
while (cur.hasNext()){
Object id = cur.curr().get("_id");
DBCursor curalbum = album.find(new BasicDBObject("images", id));
if(!curalbum.hasNext()){
image.remove(new BasicDBObject("_id", id));
}
cur.next();
}
}
}
In order to verify above statement after removing orphans :
db.albums.aggregate({$unwind:"$images"},{$group:{_id:null,sum:{$sum:"$images"},count:{$sum:1}}})
The result looks like:
"result" : [
{
"_id" : null,
"sum" : NumberLong("4501039268"),
"count" : 90017
}
],
"ok" : 1
To prove you did it correctly, what are the total number of images with the tag 'sunrises" after the removal of orphans?
db.images.find({"tags":"sunrises"}).count()
This will fetch the final answer as
45044
Question 8:
Supposed we executed the following Java code. How many animals will be inserted into the "animals" collection?
public class Question8 {
public static void main(String[] args) throws IOException {
MongoClient c = new MongoClient(new MongoClientURI("mongodb://localhost"));
DB db = c.getDB("test");
DBCollection animals = db.getCollection("animals");
BasicDBObject animal = new BasicDBObject("animal", "monkey");
animals.insert(animal);
animal.removeField("animal");
animal.append("animal", "cat");
animals.insert(animal);
animal.removeField("animal");
animal.append("animal", "lion");
animals.insert(animal);
}
}
When you run the above , then you will see an error is thrown that there is a duplicate ID , as we are trying to add , documents again and again on the same Id as we are modifying the same document . So the only one document will be inserted in the collection which will be the first insert as {_id::xxx,"animal","monkey"}
then when again ("animal","cat") is tried to push then the id is same so , it throws duplicate key . So answer is that only one document gets inserted.
Question 9:
Imagine an electronic medical record database designed to hold the medical records of every individual in the United States. Because each person has more than 16MB of medical history and records, it's not feasible to have a single document for every patient. Instead, there is a patientcollection that contains basic information on each person and maps the person to a patient_id, and arecord collection that contains one document for each test or procedure. One patient may have dozens or even hundreds of documents in the record collection.
We need to decide on a shard key to shard the record collection. What's the best shard key for therecord collection, provided that we are willing to run scatter gather operations to do research and run studies on various diseases and cohorts? That is, think mostly about the operational aspects of such a system.
patient_id
_id
primary care physican (your principal doctor)
date and time when medical record was created
patient first name
patient last name
Here among the options given for the shard key most favourable is patient_id , as there are large number of patient_id and they have been distributed in different diseases, and when a scatter gather operation is carried out then the data is far more expanded on the basis of patient_id.
Other options are not suitable for the scatter and gather operation.
Question 10:
Understanding the output of explain We perform the following query on the enron dataset:
db.messages.find({'headers.Date':{'$gt': new Date(2001,3,1)}},{'headers.From':1, _id:0}).sort({'headers.From':1}).explain()
and get the following explain output.
{
"cursor" : "BtreeCursor headers.From_1",
"isMultiKey" : false,
"n" : 83057,
"nscannedObjects" : 120477,
"nscanned" : 120477,
"nscannedObjectsAllPlans" : 120581,
"nscannedAllPlans" : 120581,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 250,
"indexBounds" : {
"headers.From" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"server" : "Andrews-iMac.local:27017"
}
The query did not utilize an index to figure out which documents match the find criteria.
The query used an index for the sorting phase.
The query returned 120,477 documents
The query performed a full collection scan
Here the correct options will be :
Option 1 seems to be correct as if you could notice that "cursor" : "BtreeCursor headers.From_1" that means that headers.From_1 is used which is not in the find clause but is in the sorting
Option 2 also seems to be correct as "cursor" : "BtreeCursor headers.From_1" the cursor is used in the sorting phase
Option 3 This option is wrong as 83057 records as n=83057
Option 4 This option is correct as if we see nscanned objects is 120477 so it has scanned all
Hope that above explanation prove helpful, please put your precious comments and suggestions on better method to do any question.