몽고디비 인 액션 6장 -2

도큐먼트 재구성

MongoDB 집계 파이프라인은 도큐먼트를 변형하여 출력 도큐먼트를 생성하는 데 사용할 수 있는 많은 함수를 가지고 있다.
일반적으로 $project 연산자와 사용되지만 $group 연산자에 대한 _id를 정의할 때 사용할 수도 있음.
(집계 프레임워크 재형성 함수 목록 : https://docs.mongodb.com/manual/reference/operator/aggregation/group/)

db.users.aggregate([     
     {$match: {username: 'kbanker'}},     
    //first, last 두개의 필드가 있는 name 하위 객체 생성
     {$project: {name: {first:'$first_name',        
                             last:'$last_name'}}
     } ])

{ "_id" : ObjectId("4c4b1476238d3b4dd5000001"),
      "name" : { "first" : "Kyle",
                 "last" : "Banker" }
}

문자열함수
- $concat : 두 개 이상의 문자열을 단일 문자열로 연결
- $strcasecmp : 대/소문자를 구분하지 않는 문자열 비교를 하며, 숫자 반환
- $substr : 문자열의 부분 문자열 생성 (v3.4 deprecated, $substrBytes 대체)
- $toLower : 소문자 변환
- $toUpper : 대문자 변환

db.users.aggregate([
     {$match: {username: 'kbanker'}},
     {$project: 
         //first_name과 last_name을 공백으로 연결
        {name: {$concat:['$first_name', ' ', '$last_name']},    
                  //firstInital은 이름의 첫 번째 문자로 설정      
                  firstInitial: {$substr: ['$first_name',0,1]},
                  //사용자 이름은 대문자로 변경            
                  usernameUpperCase: {$toUpper: '$username'}  
          }
     }
 ])

{ "_id" : ObjectId("4c4b1476238d3b4dd5000001"),
  "name"  :  "Kyle  Banker",
  "firstInitial"  :  "K",
  "usernameUpperCase" : "KBANKER" 
}

산술 함수
- $add: 배열번호를 추가
- $divide: 첫 번째 숫자를 두 번째 숫자로 나눔
- $mod: 첫 번째 숫자의 나머지를 두 번째 숫자로 나눔
- $multiply: 숫자 배열을 곱함
- $subtract: 첫 번째 숫자에서 두 번째 숫자를 뺌
날짜/시간 함수
- $dayOfYear: 연 중의 일로서 1 ~ 366
- $dayOfMonth : 월 중의 일로서 1 ~ 31
- $dayOfWeek : 주 중의 일로서 1 ~ 7 (1: 일요일)
- $year: 날짜의 연도
- $month: 날짜의 달 (1 ~ 12)
- $week: 날짜의 주 (0 ~ 53)
- $hour: 시간 (0 ~ 23)
- $minute: 분 (0 ~ 59)
- $secont: 초 (0 ~ 59)
- $millisecond: 시간 중 밀리초 (0~ 999)
논리 함수
- $and: 모든 값이 true일 경우 true
- $cmp: 두 개 값을 비교하여 결과 반환, 두 값이 동일하면 0
- $cond: if .. then .. else 조건부 논리
- $eq: 두 값이 동일하면 true
- $gt: >
- $gte: >=
- $ifNull: null일 경우 true
- $lt: <
- $lte: <=
- $ne: 두 값이 동일하지 않으면 true
- $not: 주어진 값의 받대 조건 반환, true면 false, false면 true
- $or: 어떤 하나라도 true면 true
집함 함수
- $setEquals: 두 개의 집합이 완전히 같은 요소를 가지면 true
- $setIntersection: 두 개의 집합에서 공통적으로 존재하는 요소 배열 반환
- $setDifference: 두 번째 집합에 없는 첫 번째 집합의 요소 반환
- $setUnion: 두 집합의 합집합
- $setIsSubset: 두 번째 집합이 첫 번째 집합의 부분집합이면 true
- $anyElementTrue: 집합의 요소 중 하나라도 true면 true
- $allElementsTrue: 집합의 모든 요소가 true면 true

{ "_id" : ObjectId("4c4b1476238d3b4dd5003981"),
   "productName" : "Extra Large Wheel Barrow",
   "tags" : [ "tools", "gardening", "soil" ]}
{ "_id" : ObjectId("4c4b1476238d3b4dd5003982"),
 "productName"  :  "Rubberized  Work  Glove,  Black", 
  "tags" : [ "gardening" ]}


testSet1 = ['tools']   
db.products.aggregate([
     {$project:
         {productName: '$name',
          tags:1,
          setUnion: {$setUnion:['$tags',testSet1]}, // $setUnion 으로 tags(=tools) 결합
        }
     }
 ])



{   "_id" : ObjectId("4c4b1476238d3b4dd5003981"),
     "productName" : "Extra Large Wheel Barrow",
     "tags" : ["tools", "gardening", "soil"],     //tools, gardening, soil 조합
     "setUnion" : ["gardening","tools","soil"]
} 
{   "_id" : ObjectId("4c4b1476238d3b4dd5003982"),
     "productName" : "Rubberized Work Glove, Black",
     "tags" : ["gardening"],
     "setUnion" : ["tools", "gardening"]       //tolls,gardening 조합
}

기타 함수
- $meta: 텍스트 검색 관련 정보에 접근
- $size: 배열의 크기 반환 => 배열이 비어있는지 여부 확인에 유용
- $map: 배열의 각 멤버에 표현식(expression) 적용 => $unwind를 사용하지 않고 배열의 내용을 바꾸고 싶을 때 유용.
- $let: 표현식의 범위 내에서 사용되는 변수 정의 => 여러 개의 $project 단계를 사용하지 않고 임시 정의 변수 사용 가능.
- $literal: 표현식의 값을 평가하지 않고 반환 => 필드값을 0,1 또는 $로 초기화할 때 발생하는 문제를 피할 수 있음

집계 파이프라인 성능에 대한 이해

집계 파이프라인 성능에 중요한 영향을 미칠 수 있는 사항

파이프라인에서 가능한 한 빨리 도큐먼트 수와 크기를 줄인다.
인덱스는 $match, $sort 에서만 사용 가능
sharding을 사용하는 경우(매우 큰 컬렉션의 경우) $match 및 $project 연산자는 개별 샤드에서 실행된다. 다른 연산자를 사용하면 남아있는 파이프라인이 프라이머리 샤드에서 실행된다.

집계 파이프라인의 옵션
aggregate() 함수에 전달할 수 있는 두 번째 매개변수 옵션

explain() – 파이프라인을 실행하고 오직 파이프라인 프로세스 세부 정보만 반환
allowDiskUse – 중간 결과를 위해 디스크를 사용
cursor – 초기 배치 크기 지정

//옵션 사용 형식
db.collection.aggregate(pipeline,additionalOptions) // additionalOptions : aggregate()에 전달할 수 있는 선택적 JSON 객체

//additionalOptions 형식
{explain:true, allowDiskUse:true, cursor: {batchSize: n} }

집계 파이프라인의 explain() 함수
SQL에서의 explain과 유사하다. 개발자가 쿼리에서 사용한 인덱스를 밝혀 냄으로써 느린 연산을 진단할 수 있다.

> db.numbers.find({num: {"$gt": 19995 }}).explain("executionStats")

 {
     "queryPlanner" : {
         "plannerVersion" : 1,
         "namespace" : "tutorial.numbers",
         "indexFilterSet" : false,
         "parsedQuery" : {
             "num" : {
                 "$gt" : 19995
             }
         },
        "winningPlan" : {
            "stage" : "FETCH",
            "inputStage" : {
                "stage" : "IXSCAN",
                "keyPattern" : {
                    "num" : 1
                },
                "indexName" : "num_1", //num_1 인덱스를 사용
                "isMultiKey" : false,
                "direction" : "forward",
                "indexBounds" : {
                    "num" : [
                        "(19995.0, inf.0]"
                    ]
                }
            }
        },
        "rejectedPlans" : [ ]
    },
    "executionStats" : {
        "executionSuccess" : true,
        "nReturned" : 4,                //4개의 도큐먼트가 반환
        "executionTimeMillis" : 0,    // 훨씬 더 빠름
        "totalKeysExamined" : 4,
        "totalDocsExamined" : 4,    //4개의 도큐먼트만 스캔
            "executionStages" : {
              "stage" : "FETCH",
              "nReturned" : 4,
              "executionTimeMillisEstimate" : 0, //훨씬 더 빠름
              "works" : 5,
              "advanced" : 4,
              "needTime" : 0,
              "needFetch" : 0, 
              "saveState" : 0,
              "restoreState" : 0,
              "isEOF" : 1,
              "invalidates" : 0,
              "docsExamined" : 4,
              "alreadyHasObj" : 0,
              "inputStage" : {
                 "stage" : "IXSCAN",
                 "nReturned" : 4,
                 "executionTimeMillisEstimate" : 0, //훨씬 더 빠름
                 "works" : 4,
                 "advanced" : 4,
                 "needTime" : 0,
                 "needFetch" : 0,
                 "saveState" : 0,
                 "restoreState" : 0,
                 "isEOF" : 1,
                 "invalidates" : 0,
                 "keyPattern" : {
                    "num" : 1
                 },
                 "indexName" : "num_1",            //num_1 인덱스 사용
                 "isMultiKey" : false,
                 "direction" : "forward",
                 "indexBounds" : { 
                    "num" : [
                         "(19995.0, inf.0]"
                     ]
                 },
                 "keysExamined" : 4,
                 "dupsTested" : 0,
                 "dupsDropped" : 0,
                 "seenInvalidated" : 0,
                 "matchTested" : 0
             }
         }
     },
 "serverInfo"  :  {
         "host" : "rMacBook.local",
         "port" : 27017, 
         "version"  :  "3.0.6",
         "gitVersion" : "nogitversion" 
    },
     "ok" : 1
   }

> countsByRating = db.reviews.aggregate([
 ...
    {$match  :  {'product_id':  product['_id']}}, // $match 먼저 수행
 ...  {$group : { _id:'$rating',
 ...            count:{$sum:1}}}
 ... ],{explain:true})                                  //explain 옵션 true


{
    "stages" : [
         {
             "$cursor" : { 
                "query" : {
                         "product_id" : ObjectId("4c4b1476238d3b4dd5003981")
                 },
                 "fields" : {
                         "rating" : 1,
                         "_id" : 0
                 },
                 "plan" : {
                      "cursor" : "BtreeCursor ",    //인덱스 기반 커서인 BTreeCursor 사용
                       "isMultiKey" : false,
                       "scanAndOrder" : false,
                       "indexBounds" : {
                       "product_id" : [                //단일 제품에 사용되는 범위
                               [
                                 ObjectId("4c4b1476238d3b4dd5003981"),
                                 ObjectId("4c4b1476238d3b4dd5003981")
                             ]
                         ]
                     },
                     "allPlans" : [
                         ... 
                    ]
             "$group" : { 
                "_id" : "$rating",
                 "count" : {
                         "$sum" : {
                                 "$const" : 1
                         }
                 }
             }
         }
     ],
     "ok" : 1 }

인덱스가 사용되었는지 그리고 인덱스 내에서 범위 스캔이 되었는지 여부 확인

allowDiskUse 옵션
일반적으로 allowDiskUse 옵션을 사용하면 파이프라인 속도가 느려질 수 있다.

assert: command failed: {
         "errmsg" : "exception: Exceeded memory limit for $group,
         but didn't allow  external sort. Pass allowDiskUse:true to opt in.",
         "code" : 16945,
         "ok" : 0 
} : aggregate failed

파이프라인 단계에서 MongoDB가 허용하는 100MB 램 제한을 초과할 때 오류 메시지.

db.orders.aggregate([
     {$match: {purchase_data: {$gte: new Date(2010, 0, 1)}}}, //처리할 도큐먼트를 줄이기 위해 $match 먼저 사용
     {$group: {
         _id: {year : {$year :'$purchase_data'},
               month: {$month :'$purchase_data'}},
         count: {$sum:1},
         total: {$sum:'$sub_total'}}},
     {$sort: {_id:-1}}
 ], {allowDiskUse:true});      //MongoDB가 중간 저장을 위해 디스크를 사용하게 해줌

집계 커서 옵션
V2.6이전에는 파이프라인의 결과가 16MB 제한된 한 개의 도큐먼트였으나 V2.6 이후 기본값은 커서를 반환하는 것이다.
많은 양의 데이터를 스트리밍할 수 있게 해줌 (toArray(), pretty() 메서드 사용 지양)
$group 파이프라인 연산자를 사용하여 각각을 프로그램에 보내지 않고 출력 도큐먼트를 계산할 수 있다.

countsByRating  =  db.reviews.aggregate([
      {$match : {'product_id': product['_id']}},
      {$group : { _id:'$rating',
            count:{$sum:1}}}
 ],{cursor:{}})   // 커서를 반환

cursor.hasNext(): 결과에 다음 도큐먼트가 존재하는지 확인
cursor.next(): 결과에서 다음 도큐먼트 반환
cursor.toArray(): 전체 결과를 배열로 반환
cursor.forEach(): 결과의 각 행에 대해 함수 실행
cursor.map(): 결과의 각 행에 대해 함수 실행하고, 함수 반환값의 배열을 반환
cursor.itcount(): 항목수를 반환 (테스트 전용)
cursor.pretty(): 형식을 갖춘 결과의 배열 표시

기타 집계 기능

.count()
product = db.products.findOne({‘slug’: ‘wheelbarrow-9092’})
reviews_count = db.reviews.count({‘product_id’: product[‘_id’]}) //제품에 대한 리뷰 집계
.distinct() : 도큐먼트 최대 크기인 16MB로 제한 됨
db.orders.distinct(‘shipping_address.zip’)

맵리듀스 : 일반적으로 집계 프레임워크보다 훨씬 느림. 그러나 유연함.

// 판매 요약 정보 제공 집계 파이프라인
db.orders.aggregate([
     {"$match": {"purchase_data":{"$gte" : new Date(2010, 0, 1)}}},
     {"$group": {
         "_id": {"year" : {"$year" :"$purchase_data"},
                "month" : {"$month" : "$purchase_data"}},
         "count": {"$sum":1},
         "total": {"$sum":"$sub_total"}}},
     {"$sort": {"_id":-1}}]);

맵리듀스를 이용하여 아래와 같이 표현

// 1. map 함수를 작성하여 그룹화 중인 키를 정의하고 계산에 필요한 모든 데이터를 패키지화함
map  =  function()  {
     var shipping_month = (this.purchase_data.getMonth()+1) +   //주문 생성된 월
         '-' + this.purchase_data.getFullYear();

     var tmpItems = 0;
     this.line_items.forEach(function(item) {
          tmpItems  +=  item.quantity;
     });

    //emit() 함수 이용, 첫 번째 인수는 그룹화 기준, 두 번째 인수는 일반적으로 값을 줄이는 도큐먼트
     emit(shipping_month, {order_total: this.sub_total,         
                                   items_total: tmpItems});
 };

// 2. 대응하는 reduce 함수는 이를 더 명확하게 만들어야함
// reduce 작업은 해당 값이 원하는 방식으로 함계 집계된 다음 단일 값으로 반환되는 지 확인.
// 두번 이상 호출될 수 있음.
reduce = function(key, values) {   //리듀스 함수에는 키와 하나 이상의 값의 배열이 전달
     var result = { order_total: 0, items_total: 0 };
     values.forEach(function(value){
              result.order_total  +=  value.order_total;
              result.items_total  +=  value.items_total;
     });
     return ( result );
 };

쿼리 필터를 추가하고 결과를 저장하기

// filter는 특정 주문만 선택
// 맵은 키-값 쌍을 내보내며, 일반적으로 각각 하나씩 출력
// 리듀스는 맵에 의해 생성된 값 배열을 전달받고 출력
filter = {purchase_data: {$gte: new Date(2010, 0, 1)}};
db.orders.mapReduce(map,  reduce,  {query:  filter,  out:  'totals'});  //totals라는 컬렉션에 결과 저장

// 맵리듀스 결과 컬렉션 조회
> db.totals.find()
 { "_id" : "11-2014", "value" : {"order_total" : 4897, "items_total" : 1 } }
 { "_id" : "4-2014", "value" : {"order_total" : 4897, "items_total" : 1 } }
 { "_id" : "8-2014", "value" : {"order_total" : 11093, "items_total" : 4 } }

Leave a Comment Cancel reply