Druid
Published:
This is my personal notes for pydruid.
Introduction
Druid是一个高效的数据查询系统, 主要解决的是对于大量的基于时序的数据进行聚合查询。数据可以实时摄入,进入到Druid后立即可查,同时数据是几乎是不可变。通常是基于时序的事实事件,事实发生后进入Druid,外部系统就可以对该事实进行查询。
python 操作 druid 也十分方便, 这里贴出两个pydruid的官方文档:
文档1
文档2
在pyhon中的使用方式:
from pydruid.client import *
from pydruid.utils.filters import Dimension
import pydruid.utils.aggregators as aggregators
query = PyDruid('http://druid-broker-public.bdwg.xxxx.hadoop:8082/', 'xxxxx')
top_langs = query.topn()
top_langs = query.groupby()
top_langs = query.segment_metadata()
一、TopN查询
top_langs = query.topn(
datasource = "babel_rap_product_82", #from xx
granularity = "day", #day or hour or minute etc.
intervals = "2019-02-24T00:00/2019-03-22T00", #time
dimension = "rap_dim_group", # 统计的维度 group by
filter =
((Dimension("rap_dim_t") == 20) & ((Dimension("rap_dim_rseat")=="longvideo_click") | (Dimension("rap_dim_rseat")=="longvideo_click_blank") | (Dimension("rap_dim_rseat")=="long_shortvideo") | (Dimension("rap_dim_rseat")=="poster_shortvideo")))
& (Dimension("rap_dim_rpage")=="qy_home") & (Dimension('rap_dim_block')=="qy_return_video") &
((Dimension("rap_dim_t") == 21) | ((Dimension("rap_dim_t") == 20) &
((Dimension("rap_dim_rseat")=='shortvideo_click') | (Dimension("rap_dim_rseat")=="replay_shortvideo") | (Dimension("rap_dim_rseat")=="poster_shortvideo") |
(Dimension("rap_dim_rseat")=="long_shortvideo") | (Dimension("rap_dim_rseat")=="longvideo_click_blank") | (Dimension("rap_dim_rseat")=="longvideo_click")))),
aggregations = {"count_sum":aggregators.longsum('count')}, # 统计的指标
metric = "count_sum", # order by
threshold = 100 # limit N
)
以上查询返回的结果是: 2月24到3月22每天各个group符合条件的总count数量,从小到大排序,每天最多返回100个group的数据
二、GroupBy查询
top_langs = query.groupby(
datasource = "babel_rap_product_150", #from xx
granularity = 'all', #day or hour or minute etc.
intervals = "2019-04-12T00:00/2019-04-15T00", #time
dimensions = ["rap_dim_recall_tag"], # 统计的维度 group by
filter =
(Dimension("rap_dim_t") == 21) & (Dimension("rap_dim_deviceId") == deviceId),
aggregations = {"count_sum":aggregators.longsum('count')}, # 统计的指标
limit_spec={
"type": "default",
'limit': 10,
"columns": [{"direction":"descending","dimension":"count_sum"}]
}
)
以上查询返回的结果是: 4月12到4月15每天各个tag展示的数量,从大到小排序,limit可加可不加,限制每天返回的条数
granularity为all时,不区分时间段,所有时间段集中在一起返回