Druid

1 minute read

Published:

This is my personal notes for pydruid.

Introduction

Druid是一个高效的数据查询系统, 主要解决的是对于大量的基于时序的数据进行聚合查询。数据可以实时摄入,进入到Druid后立即可查,同时数据是几乎是不可变。通常是基于时序的事实事件,事实发生后进入Druid,外部系统就可以对该事实进行查询。
python 操作 druid 也十分方便, 这里贴出两个pydruid的官方文档:
文档1
文档2

在pyhon中的使用方式:

from pydruid.client import *
from pydruid.utils.filters import Dimension
import pydruid.utils.aggregators as aggregators

query = PyDruid('http://druid-broker-public.bdwg.xxxx.hadoop:8082/', 'xxxxx')

top_langs = query.topn()
top_langs = query.groupby()
top_langs = query.segment_metadata()

一、TopN查询

    top_langs = query.topn(
    datasource = "babel_rap_product_82",  #from xx
    granularity = "day",  #day or hour or minute etc.
    intervals = "2019-02-24T00:00/2019-03-22T00",  #time
    dimension = "rap_dim_group",  #  统计的维度 group by
    filter =
             ((Dimension("rap_dim_t") == 20) & ((Dimension("rap_dim_rseat")=="longvideo_click") | (Dimension("rap_dim_rseat")=="longvideo_click_blank") | (Dimension("rap_dim_rseat")=="long_shortvideo") | (Dimension("rap_dim_rseat")=="poster_shortvideo")))
             & (Dimension("rap_dim_rpage")=="qy_home") & (Dimension('rap_dim_block')=="qy_return_video") &
             ((Dimension("rap_dim_t") == 21) | ((Dimension("rap_dim_t") == 20) &
             ((Dimension("rap_dim_rseat")=='shortvideo_click') | (Dimension("rap_dim_rseat")=="replay_shortvideo") | (Dimension("rap_dim_rseat")=="poster_shortvideo") |
              (Dimension("rap_dim_rseat")=="long_shortvideo") | (Dimension("rap_dim_rseat")=="longvideo_click_blank") | (Dimension("rap_dim_rseat")=="longvideo_click")))),

    aggregations = {"count_sum":aggregators.longsum('count')},  # 统计的指标
    metric = "count_sum",   # order by
    threshold = 100   # limit N
)

以上查询返回的结果是: 2月24到3月22每天各个group符合条件的总count数量,从小到大排序,每天最多返回100个group的数据

二、GroupBy查询


       top_langs = query.groupby(
        datasource = "babel_rap_product_150",  #from xx
        granularity = 'all',  #day or hour or minute etc.
        intervals = "2019-04-12T00:00/2019-04-15T00",  #time
        dimensions = ["rap_dim_recall_tag"],  #  统计的维度 group by
        filter =
                (Dimension("rap_dim_t") == 21) & (Dimension("rap_dim_deviceId") == deviceId),
        aggregations = {"count_sum":aggregators.longsum('count')},  # 统计的指标
        limit_spec={
            "type": "default",
            'limit': 10,
            "columns": [{"direction":"descending","dimension":"count_sum"}]
        }
    )

以上查询返回的结果是: 4月12到4月15每天各个tag展示的数量,从大到小排序,limit可加可不加,限制每天返回的条数
granularity为all时,不区分时间段,所有时间段集中在一起返回