A Strategy of Efficient and Accurate Cardinality Estimation Based on Query Result

Jintao Gao; Zhanhuai Li; Wenjie Liu

doi:10.1051/jnwpu/20183640768

All issues

Volume 36 / No 4 (August 2018)

JNWPU, 36 4 (2018) 768-777

Abstract

Open Access

Issue		JNWPU Volume 36, Number 4, August 2018


Page(s)		768 - 777
DOI		https://doi.org/10.1051/jnwpu/20183640768
Published online		24 October 2018

JNWPU 2018, 36(4): 768-777

A Strategy of Efficient and Accurate Cardinality Estimation Based on Query Result

一种高效准确的基于查询结果的基数估计策略

Jintao Gao (高锦涛), Zhanhuai Li (李战怀) and Wenjie Liu (刘文洁)

School of Computer Science, Northwestern Polytechnical University, Xi’an 710072, China

Received: 8 May 2017

Abstract

Cardinality estimation is an important component of query optimization. Its accuracy and efficiency directly decide effect of query optimization. Traditional cardinality estimation strategy is based on original table or sample to collect statistics, then inferring cardinality by collected statistics. It will be low-efficiency when handling big data; Statistics exist update latency and are gotten by inferring, which can not guarantee correctness; Some strategies can get the actual cardinality by executing some subqueries, but they do not keep the result, leading to low efficiency of fetching statistics. Against these problems, this paper proposes a novel cardinality estimation strategy, called cardinality estimation based on query result(CEQR). For keeping correctness of cardinality, CEQR directly gets statistics from query results, which is not related with data size; we build a cardinality table to store the statistics of basic tables and middle results under specific predicates. Cardinality table can provide cardinality services for subsequent queries, and we build a suit of rules to maintain cardinality table; To improve the efficiency of fetching statistics, we introduce the source aware strategy, which hashes cardinality item to appropriate cache. This paper gives the adaptability and deviation analytic of CEQR, and proves that CEQR is more efficient than traditional cardinality estimation strategy by experiments.

摘要

基数估计是查询优化的重要组成部分，其高效性、准确性直接影响查询优化效果。传统基数估计策略基于原表或原表样本进行统计信息收集，然后利用收集好的统计信息推导出基数。该策略在数据量大时，统计信息收集效率低；统计信息存在延迟，并且基数通过推导得到，准确度无法保证；一些策略通过子查询的反馈信息得到基数，但结果没有保存，基数获取效率低。为解决这些问题，提出了一种高效准确的基于查询结果的基数估计策略（cardinality estimation based on query result，CEQR），特点是统计信息来源为查询执行结果，不需要进行推导，保证基数的准确度，并且收集效率与原表数据量无关；建立一种基数表，保存基本表和中间结果在某种谓词下的统计信息，为后续查询提供服务，并建立基数维护规则，合理管理基数表；建立资源感知策略，将基数项映射到缓存，加快统计信息获取效率。给出了基于CEQR策略的适应性以及误差分析，并通过实验得出CEQR策略在效率上优于传统基数估计策略。

Key words: big data / cardinality estimation / query optimization / query result / efficient / accurate

关键字 : 大数据 / 基数估计 / 查询优化 / 查询结果 / 高效 / 准确

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.