Bounded Evaluation: Querying Big Data with Bounded Resources

Yang Cao Wen-Fei Fan Teng-Fei Yuan

Yang Cao, Wen-Fei Fan, Teng-Fei Yuan. Bounded Evaluation: Querying Big Data with Bounded Resources[J]. International Journal of Automation and Computing, 2020, 17(4): 502-526. doi: 10.1007/s11633-020-1236-1
Citation: Yang Cao, Wen-Fei Fan, Teng-Fei Yuan. Bounded Evaluation: Querying Big Data with Bounded Resources[J]. International Journal of Automation and Computing, 2020, 17(4): 502-526. doi: 10.1007/s11633-020-1236-1

doi: 10.1007/s11633-020-1236-1

Bounded Evaluation: Querying Big Data with Bounded Resources

More Information
    Author Bio:

    Yang Cao received the B. Sc. degree from Beihang University, China. He received the Ph.D. degree from University of Edinburgh, UK. He is a faculty member in the School of Informatics, University of Edinburgh, UK. He is the recipient of SIGMOD Research Highlight ward 2018, SIGMOD Best Paper ward 2017, and Microsoft Research Asia Fellowship. His research has been invited to publish in TODS special issues on “Best of SIGMOD 2017” and “Best of PODS 2016”, and in the Computer Journal special issue on “Best of BICOD 2015”. His research interests include query processing, graph data management and distributed databases. E-mail: yang.cao@ed.ac.uk (Corresponding author) ORCID iD: 0000-0001-7984-3219

    Wen-Fei Fan received the B. Sc. degree and M.Sc. degree from Peking University China. He received the Ph. D. degree from University of Pennsylvania, USA. He is the Chair of Web Data Management at the University of Edinburgh, UK, the Chief Scientist of Shenzhen Institute of Computing Science, and the Chief Scientist of Beijing Advanced Innovation Center for Big Data and Brain Computing, China. He is a Fellow of the Royal Society (FRS), a Fellow of the Royal Society of Edinburgh (FRSE), a Member of the Academy of Europe (MAE), an ACM Fellow (FACM), and a Foreign Member of Chinese Academy of Sciences. He is a recipient of Royal Society Wolfson Research Merit Award in 2018, ERC Advanced Fellowship in 2015, the Roger Needham Award, UK in 2008, Yangtze River Scholar, China in 2007, the Outstanding Overseas Young Scholar Award, China in 2003 , the Career Award, USA in 2001, and several Test-of-Time and Best Paper Awards USA (Alberto O. Mendelzon Test-of-Time Award of ACM PODS 2015 and 2010, Best Paper Awards for SIGMOD 2017, VLDB 2010, ICDE 2007 and Computer Networks 2002). His research interests include database theory and systems, in particular big data, data quality, data sharing, distributed query processing, query languages, recommender systems and social media marketing. E-mail: wenfei@inf.ed.ac.uk ORCID iD: 0000-0001-5149-2656

    Teng-Fei Yuan received the B.Eng. degree from Shandong University China. He is Ph.D. degree cadidate in LFCS, School of Informatics, University of Edinburgh UK. His research interest is development of BEAS, a system for bounded evaluation of SQL queries. E-mail: tengfei.yuan@ed.ac.uk

  • 1See [1] for more about complexity classes, e.g., NP and PSPACE.
    2Facebook users can check-in to locations via “check-ins”.
    3To reduce notations, we assume w.l.o.g. that selection conditions are conjunctive, i.e., $ \lor $ in selection predicates is reduced using $ \cup $.
    4Following query optimizer in DBMS, $ \lambda_{\sigma}(C) $, $ \lambda_{\Join}(C) $, $ \lambda_{\pi}(X) $ are coefficients that can be estimated from database statistics as a priori.
  • Figure  1.  Effective syntax $ {\cal L}_{\cal B} $ for $ { {\rm{{RA}}}}_{{ {\rm{{aggr}}}}} $ queries

    Figure  2.  ${ {\mathsf{{BEAS}}}}$: Bounded evaluation on ${ {\mathsf{{DBMS}}}}$

    Figure  3.  Reduced graphs for Example 9

    Figure  4.  Effectiveness of bounded evaluation for bounded and unbounded queries

    Figure  5.  Bounded plans under $ {\cal B} $ (set $ \varXi_{{\cal B}} $)

    Table  1.   Notations and definitions

    Notation Definition
    A (or R[A]) An attribute or an aggregate field in Q
    X, Y Sets of attributes in Q
    |Q| Number of attributes and aggregate fields in Q
    ||Q|| Number of relation atoms in Q
    ${ X_{Q} }$ Set of all attributes and aggregate fields in Q
    ${ X_{Q}^{c} }$ Set of attributes A of Q in constant selections ${ \sigma_{A=c} }$
    ${ \Sigma_{Q} }$ Set of equality predicates in selections/joins of Q
    ${ \Sigma_{Q}\!\vdash\!A\!=\!B }$ ${ A=B }$ can be deduced from ${ \Sigma_{Q} }$ via equality transitivity
    ${ Z_{Q} }$ Set of attributes and aggregate fields of the output of Q
    nontr. attr. Attributes participated in algebra operations of Q
    ${ {\cal A} }$/${ {\cal B} }$ Access schema/bag access schema
    ${ |{\cal B}| }$ Total length of bag constraints in ${ {\cal B} }$
    ${ |\!|{{{\cal B}}}|\!| }$ The number of bag constraints in ${ {\cal B} }$
    ${ \phi }$/${ \varphi }$ Access constraint/bag access constraint
    ${ {\cal L}_{\cal B} }$ Effective syntax for ${ { {\\{ {{\rm{RA}}} } } }_{ { {\rm{ {aggr} } } } } }$ queries
    bounded. eval. under ${ {\cal B} }$
    ${ {\cal L}_{\cal B}^{\mathbb{L}} }$ Effective syntax for ${ \mathbb{L} }$-queries bounded. eval. under ${ {\cal B} }$
    下载: 导出CSV

    Table  2.   $ {{c}}() $ with parameters $ \varGamma_{\Join/-/\cup/ { {\rm{{fetch}}}}/{\rm{{gpBy}}}} $

    ${ { { \rm{ { bLAP } } } } }\;$${ {{\xi_{R}}} }$${ {{c}}({{\xi_{R}}}) }$of ${ {{\xi_{R}}} }$
    ${ \{c\} }$or ${ \emptyset }$1
    ${ \sigma_{C}(\xi') }$${ {{c} }(\xi')\times{\lambda_{\sigma} }(C) }$
    ${ \xi_{1}\Join_C \xi_{2} }$${ {\varGamma_{\Join} }( {{c} }(\xi_{1}), {{c} }(\xi_{2}))\times\lambda_\Join(C) }$
    ${ \xi_{1}-\xi_{2} }$${ {\varGamma_{-}}( {{c}}(\xi_{1}), {{c}}(\xi_{2})) }$
    ${ \xi_{1}\cup \xi_{2} }$${ {\varGamma_{\cup}}( {{c}}(\xi_{1}), {{c}}(\xi_{2})) }$
    ${ \pi_{Y}(\xi') }$${ {{c}}(\xi') }$
    ${ { \rm{ { gpBy } } }(\xi', X, \overline{ {\\rm{ {agg} } }(V)}) }$${ \varGamma_{ {\rm{ {gpBy} } } }( {{c} }(\xi'), \lambda_\pi(X)) }$
    ${ { { \rm{ { fetch } } } }(\xi', \varphi) }$with ${ \varphi = R (\!| X \rightarrow Y, N |\!) }$${ {\varGamma_{ { {\rm{ {fetch} } } } } }( {{c} }(\xi'), N) }$
    下载: 导出CSV

    Table  3.   TPCH query evaluation time on 16 GB (ms)

    QueriesQ1Q2Q3Q4Q5Q6Q7Q8Q9Q10Q11
    PostgreSQL ${ t_{ {\rm{ {P} } } } }$8.16 × 1053.68 × 1041.02 × 1051.56 × 1041.50 × 1052.07 × 1049.53 × 1043.71 × 1045.03 × 1042.28 × 1057.05 × 104
    BEAS@PG ${ t_{ {\rm{ {B} } } } }$5.72 × 1041.67 × 1042.97 × 1041.45 × 1041.43 × 1054.25 × 1037.65 × 1043.43 × 1043.24 × 1048.19 × 1043.46 × 103
    Speedup ${ t_{ {\rm{ {P} } } }/t_{ {\rm{ {B} } } } }$14.282.213.441.071.054.881.241.081.552.7920.39
    QueriesQ12Q13Q14Q15Q16Q17Q18Q19Q20Q21Q22
    PostgreSQL ${ t_{ {\rm{ {P} } } } }$6.42 × 1041.79 × 1051.74 × 1045.10 × 1044.46 × 1046.04 × 1032.83 × 1057.27 × 1031.06 × 1052.66 × 1057.74 × 103
    BEAS@PG ${ t_{ {\rm{ {B} } } } }$8.58 × 1031.20 × 1051.10 × 1042.20 × 1043.02 × 1041.49 × 1022.38 × 1051.29 × 1032.70 × 1031.04 × 1055.70 × 103
    Speedup ${ t_{ {\rm{ {P} } } }/t_{ {\rm{ {B} } } } }$7.481.491.582.321.4840.461.195.6539.162.551.36
    下载: 导出CSV

    Table  4.   Average query template evaluation time on AIRCA (ms)

    QueriesQ1Q2Q3Q4Q5Q6Q7Q8Q9Q10
    PostgreSQL ${ t_{{\rm{{P}}}} }$7.00 × 1037.2 × 1032.06 × 1049.04 × 1044.44 × 1049.41 × 1041.36 × 1051.44 × 1059.41 × 1041.38 × 105
    BEAS@PG ${ t_{{\rm{{B}}}} }$1.221.190.642.632.616.344.286.696.024.98
    Speedup ${ t_{{\rm{{P}}}}/t_{{\rm{{B}}}} }$5.75 × 1036.04 × 1033.21 × 1043.44 × 1041.70 × 1041.48 × 1043.19 × 1042.15 × 1041.57 × 1042.77 × 104
    QueriesQ11Q12Q13Q14Q15Q16Q17Q18Q19Q20
    PostgreSQL ${ t_{{\rm{{P}}}} }$1.46 × 1059.46 × 1041.49 × 1051.56 × 10510.63 × 1044.46 × 1044.65 × 1044.22 × 1044.25 × 1049.29 × 104
    BEAS@PG ${ t_{{\rm{{B}}}} }$7.2411.426.558.511.6724.9624.32.844.63 × 1027.26 × 102
    Speedup ${ t_{{\rm{{P}}}}/t_{{\rm{{B}}}} }$2.01 × 1048.29 × 1032.29 × 1041.84 × 1049.11 × 1031.79 × 1031.91 × 1031.48 × 10491.861.29 × 102
    QueriesQ21Q22Q23Q24Q25Q26Q27Q28Q29Q30
    PostgreSQL ${ t_{{\rm{{P}}}} }$9.30 × 1048.98 × 1049.44 × 1049.41 × 1041.39 × 1051.47 × 1059.58 × 1041.75 × 1051.85 × 1051.31 × 105
    BEAS@PG ${ t_{{\rm{{B}}}} }$5.43 × 1024.61 × 1027.25 × 1025.23 × 1021.4 × 1031.74 × 1031.36 × 1032.67 × 1043.06 × 1042.62 × 104
    Speedup ${ t_{{\rm{{P}}}}/t_{{\rm{{B}}}} }$1.71 × 1021.94 × 1021.31 × 1021.80 × 10299.5284.4970.356.546.065.02
    下载: 导出CSV

    Table  5.   Average query template evaluation time on UKMOT (ms)

    QueriesQ1Q2Q3Q4Q5Q6Q7Q8Q9Q10
    PostgreSQL $ t_{{\rm{{P}}}} $1.37 × 1056.47 × 1047.80 × 1044.62 × 1054.65 × 1053.86 × 1055.8 × 1053.91 × 1055.72 × 1055.99 × 105
    BEAS@PG $ t_{{\rm{{B}}}} $0.552.730.6216.1375.25.621.50 × 1023.78 × 10254.791.89 × 102
    Speedup $ t_{{\rm{{P}}}}/t_{{\rm{{B}}}} $2.52 × 1052.38 × 1041.25 × 1052.86 × 1046.18 × 1036.88 × 1043.88 × 1031.04 × 1031.04 × 1043.18 × 103
    QueriesQ11Q12Q13Q14Q15Q16Q17Q18Q19Q20
    PostgreSQL $ t_{{\rm{{P}}}} $4.09 × 1055.91 × 1057.80 × 1055.91 × 1057.74 × 1051.61 × 1051.59 × 1051.53 × 1053.97 × 1053.93 × 105
    BEAS@PG $ t_{{\rm{{B}}}} $3.89 × 10255.421.89 × 1023.94 × 10296.2430.4965.092.517.4 × 1037.41 × 103
    Speedup $ t_{{\rm{{P}}}}/t_{{\rm{{B}}}} $1.06 × 1031.07 × 1044.13 × 1031.49 × 1038.05 × 1035.28 × 1032.44 × 1036.10 × 10453.5553.14
    QueriesQ21Q22Q23Q24Q25Q26Q27Q28Q29Q30
    PostgreSQL $ t_{{\rm{{P}}}} $3.93 × 1056.01 × 1053.94 × 1055.98 × 1056.26 × 1054.13 × 1056.19 × 1057.04 × 1054.86 × 1056.96 × 105
    BEAS@PG $ t_{{\rm{{B}}}} $9.25 × 1037.51 × 1037.72 × 1039.34 × 1037.62 × 1037.82 × 1037.38 × 1036.42 × 1047.65 × 1046.26 × 104
    Speedup $ t_{{\rm{{P}}}}/t_{{\rm{{B}}}} $42.4880.0751.1264.0882.1452.8783.8110.956.3711.12
    下载: 导出CSV
  • [1] C. H. Papadimitriou. Computational Complexity, Reading, USA: Addison-Wesley, 1994.
    [2] S. Abiteboul, R. Hull, V. Vianu. Foundations of Databases, Boston, USA: Addison Wesley, 1995.
    [3] R. Horak. Telecommunications and Data Communications Handbook, New York, USA: Wiley, 2007.
    [4] W. F. Fan, X. Wang, Y. H. Wu, D. Deng. Distributed graph simulation: Impossibility and possibility. Proceedings of the VLDB Endowment, vol. 7, no. 12, pp. 1083–1094, 2014. DOI:  10.14778/2732977.2732983.
    [5] W. F. Fan, F. Geerts, Y. Cao, T. Deng, P. Lu. Querying big data by accessing small data. In Proceedings of the 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, ACM, Melbourne, Victoria, Australia, pp. 173–184, 2015.
    [6] Y. Cao, W. F. Fan. An effective syntax for bounded relational queries. In Proceedings of 2016 International Conference on Management of Data, ACM, San Francisco, USA, 2016.
    [7] The University of Edinburgh. Huawei deal to advance expertise in data science, [Online], Available: https://www.ed.ac.uk/news/2017/huawei-deal-to-advance-expertise-in-data-science, June 14, 2017.
    [8] Facebook. Introducing graph search beta, [Online], Available: https://about.fb.com/news/2013/01/introducing-graph-search-beta/, January 15, 2013.
    [9] I. Grujic, S. Bogdanovic-Dinic, L. Stoimenov. Collecting and analyzing data from e-government Facebook pages. In ICT Innovations, Ohrid, Macedonia, pp. 86–96, 2014.
    [10] Facebook. Newsroom, [Online], Available: http://newsroom.fb.com.
    [11] R. Ramakrishnan, J. Gehrke. Database Management Systems, 2nd ed., New York, USA: McGraw-Hill Education, 2000.
    [12] J. D. Ullman. Principles of Database Systems, 2nd ed., Computer Science Press, 1982.
    [13] A. P. Stolboushkin, M. A. Taitslin. Finite queries do not have effective syntax. In Proceedings of the 14th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, ACM, San Jose, USA, pp. 277–285, 1995.
    [14] A. van Gelder, R. W. Topor. Safety and translation of relational calculus queries. ACM Transactions on Database Systems, vol. 16, no. 2, pp. 235–278, 1991. DOI:  10.1145/114325.103712.
    [15] TPC. TPC-H, [Online], Available: http://www.tpc.org/tpch/.
    [16] W. F. Fan. Making Big Data Small, UK: British Royal Society, 2019. DOI:  10.1098/rspa.2019.0034.
    [17] Y. Cao, W. F. Fan. Data driven approximation with bounded resources. Proceedings of the VLDB Endowment, vol. 10, no. 9, pp. 973–984, 2017. DOI:  10.14778/3099622.3099628.
    [18] Y. Cao, W. F. Fan, T. F. Yuan. Block as a value for SQL over NoSQL. Proceedings of the VLDB Endowment, vol. 12, no. 10, pp. 1153–1166, 2019. DOI:  10.14778/3339490.3339498.
    [19] W. F. Fan, F. Geerts, L. Libkin. On scale independence for querying big data. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, ACM, Snowbird, USA, 2014.
    [20] D. Abadi, P. A. Boncz, S. Harizopoulos, S. Idreos, S. Madden. The design and implementation of modern column-oriented database systems. Foundations and Trends® in Databases, vol. 5, no. 3, pp. 197–280, 2013. DOI:  10.1561/1900000024.
    [21] Microsoft SQL server columnstore indexes: Overview, [Online], Available: https://docs.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-overview?view=sql-server-ver15.
    [22] TPC. TPC-DS, [Online], Available: http://www.tpc.org/tpcds/.
    [23] M. R. Garey, D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness, San Francisco, USA: W. H. Freeman, 1979.
    [24] M. L. Fredman, R. E. Tarjan. Fibonacci heaps and their uses in improved network optimization algorithms. Journal of the ACM, vol. 34, no. 3, pp. 596–615, 1987. DOI:  10.1145/28869.28874.
    [25] Bureau of Transportation Statistics. The carrier on-time performance database, [Online], Available: http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120.
    [26] Bureau of Transportation Statistics. The air carrier statistics database, [Online], Available: http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=110.
    [27] Department for Transport. Anonymised mot tests and results, [Online], Available: http://data.gov.uk/dataset/anonymised_mot_test, January 11, 2019.
    [28] Department for Transport. Roadside survey of vehicle observations, [Online], Available: https://data.gov.uk/dataset/52e1e2ab-5687-489b-a4d8-b207cd5d6767/roadside-survey-of-vehicle-observations.
    [29] Y. Huhtala, J. Kärkkainen, P. Porkka, H. Toivonen. Tane: An efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, vol. 42, no. 2, pp. 100–111, 1999. DOI:  10.1093/comjnl/42.2.100.
    [30] M. Armbrust, A. Fox, D. A. Patterson, N. Lanham, B. Trushkowsky, J. Trutna, H. Oh. Scads: Scale-independent storage for social computing applications. In Proceedings of the 4th Biennial Conference on Innovative Data Systems Research, Asilomar, USA, 2009.
    [31] M. Armbrust, S. Tu, A. Fox, M. J. Franklin, D. A. Patterson, N. Lanham, B. Trushkowsky, J. Trutna. PIQL: A performance insightful query language. In Proceedings of ACM SIGMOD International Conference on Management of Data, ACM, Indiana, USA, pp. 1207–1210, 2010.
    [32] Y. Cao, W. F. Fan, F. Geerts, P. Lu. Bounded query rewriting using views. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, ACM, San Francisco, USA, pp. 107–119, 2016.
    [33] Y. Cao, W. F. Fan, T. Y. Wo, W. Y. Yu. Bounded conjunctive queries. Proceedings of the VLDB Endowment, vol. 7, no. 12, pp. 1231–1242, 2014. DOI:  10.14778/2732977.2732996.
    [34] Y. Cao, W. F. Fan, Y. H. Wang, T. F. Yuan, Y. C. Li, L. Y. Chen. BEAS: Bounded evaluation of SQL queries. In Proceedings of ACM International Conference on Management of Data, ACM, Chicago, USA, pp. 1667–1670, 2017.
    [35] S. Acharya, P. B. Gibbons, V. Poosala. Congressional samples for approximate answering of group-by queries. In Proceedings of ACM SIGMOD International Conference on Management of Data, ACM, Dallas, Texas, USA, pp. 487–498, 2000.
    [36] Y. E. Ioannidis, V. Poosala. Histogram-based approximation of set-valued query-answers. In Proceedings of the 25th International Conference on Very Large Data Bases, Edinburgh, Scotland, UK, pp. 174–185, 1999.
    [37] H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, T. Suel. Optimal histograms with quality guarantees. In Proceedings of the 24rd International Conference on Very Large Data Bases, New York City, USA, pp. 275–286, 2009.
    [38] K. Chakrabarti, M. N. Garofalakis, R. Rastogi, K. Shim. Approximate query processing using wavelets. The VLDB Journal, vol. 10, no. 2–3, pp. 199–223, 2001.
    [39] G. Cormode, M. Garofalakis. Sketching streams through the net: Distributed approximate query tracking. In Proceedings of the 31st International Conference on Very Large Data Bases, ACM, Trondheim, Norway, 2005.
    [40] B. Babcock, S. Chaudhuri, G. Das. Dynamic sample selection for approximate query processing. In Proceedings of ACM SIGMOD International Conference on Management of Data, ACM, San Diego, USA, pp. 539–550, 2003.
    [41] S. Kandula, A. Shanbhag, A. Vitorovic, M. Olma, R. Grandl, S. Chaudhuri, B. Ding. Quickr: Lazily approximating complex AdHoc queries in BigData clusters. In Proceedings of International Conference on Management of Data, ACM, San Francisco, USA, pp. 631–646, 2016.
    [42] S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, I. Stoica. BlinkDB: Queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems, SCM, Prague, Czech Republic, pp. 29–42, 2013.
    [43] C. Li. Computing complete answers to queries in the presence of limited access patterns. The VLDB Journal, vol. 12, no. 3, pp. 211–227, 2003. DOI:  10.1007/s00778-002-0085-6.
    [44] M. Benedikt, J. Leblay, B. ten Cate, E. Tsamoura. Generating Plans from Proofs: Synthesis Lectures on Data Maragement, vol.8, no.1, pp. 1–205, 2016.
    [45] A. Nash, B. Ludäscher. Processing first-order queries under limited access patterns. In Proceedings of the 23rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, ACM, Paris, France, pp. 307–318, 2004.
    [46] M. S. Kester, M. Athanassoulis, S. Idreos. Access path selection in main-memory optimized data systems: Should I scan or should I probe? In Proceedings of ACM International Conference on Management of Data, ACM, Chicago, USA, pp. 715–730, 2017.
    [47] T. Neumann. Query simplification: Graceful degradation for join-order optimization. In Proceedings of ACM SIGMOD International Conference on Management of Data, ACM, Rhode Island, USA, pp. 403–414, 2009.
    [48] M. Eich, P. Fender, G. Moerkotte. Faster plan generation through consideration of functional dependencies and keys. Proceedings of the VLDB Endowment, vol. 9, no. 10, pp. 756–767, 2016. DOI:  10.14778/2977797.2977802.
    [49] B. L. Ding, S. Das, R. Marcus, W. T. Wu, S. Chaudhuri, V. R. Narasayya. AI meets AI: Leveraging query executions to improve index recommendations. In Proceedings of International Conference on Management of Data, ACM, Amsterdam, The Netherlands, pp. 1241–1258, 2019.
    [50] T. Kraska, A. Beutel, E. H. Chi, J. Dean, N. Polyzotis. The case for learned index structures. In Proceedings of International Conference on Management of Data, ACM, Houston, USA, pp. 489–504, 2018.
    [51] A. Galakatos, M. Markovitch, C. Binnig, R. Fonseca, T. Kraska. Fiting-tree: A data-aware index structure. In Proceedings of International Conference on Management of Data, ACM, Amsterdam, The Netherlands, pp. 1189–1206, 2019.
    [52] R. C. Marcus, P. Negi, H. Z. Mao, C. Zhang, M. Alizadeh, T. Kraska, O. Papaemmanouil, N. Tatbul. Neo: A learned query optimizer. Proceedings of the VLDB Endowment, vol. 12, no. 11, pp. 1705–1718, 2019. DOI:  10.14778/3342263.3342644.
    [53] J. Sun, G. Li. An end-to-end learning-based cost estimator. Proceedings of the VLDB Endowment, vol. 13, no. 3, pp. 307–319, 2019. DOI:  10.14778/3368289.3368296.
    [54] I. Trummer, J. Wang, D. Maram, S. Moseley, S. Jo, J. Antonakakis. Skinnerdb: Regret-bounded query evaluation via reinforcement learning. https://arxiv.org/abs/1901.05152v1, 2019.
  • [1] Hai-Qiang Zhang, Hai-Rong Fang, Bing-Shan Jiang, Shuai-Guo Wang.  Dynamic Performance Evaluation of a Redundantly Actuated and Over-constrained Parallel Manipulator . International Journal of Automation and Computing, 2019, 16(3): 274-285. doi: 10.1007/s11633-018-1147-6
    [2] Amit Kumar, Aparajita Ojha.  Experimental Evaluation of Certain Pursuit and Evasion Schemes for Wheeled Mobile Robots . International Journal of Automation and Computing, 2019, 16(4): 491-510. doi: 10.1007/s11633-018-1151-x
    [3] Yu Hao, Zhi-Jie Xu, Ying Liu, Jing Wang, Jiu-Lun Fan.  Effective Crowd Anomaly Detection Through Spatio-temporal Texture Analysis . International Journal of Automation and Computing, 2019, 16(1): 27-39. doi: 10.1007/s11633-018-1141-z
    [4] Piyapong Niamsup, Vu N. Phat.  Robust Finite-time H Control of Linear Time-varying Delay Systems with Bounded Control via Riccati Equations . International Journal of Automation and Computing, 2018, 15(3): 355-363. doi: 10.1007/s11633-016-1018-y
    [5] Wen-Dong Ding, Zheng-Tao Zhang, Da-Peng Zhang, De Xu, Hai-Bing Lv, Xin-Xiang Miao, Guo-Rui Zhou, Hao Liu.  An Effective On-line Surface Particles Inspection Instrument for Large Aperture Optical Element . International Journal of Automation and Computing, 2017, 14(4): 420-431. doi: 10.1007/s11633-017-1079-6
    [6] S. Nagaraju, Manish Kashyap, Mahua Bhattachraya.  An Effective Density Based Approach to Detect Complex Data Clusters Using Notion of Neighborhood Difference . International Journal of Automation and Computing, 2017, 14(1): 57-67. doi: 10.1007/s11633-016-1038-7
    [7] Xiao-Yu Zhang, Qing Fang.  Numerical Evaluation of External Effects on Interspecific Interacting Populations . International Journal of Automation and Computing, 2016, 13(2): 133-141. doi: 10.1007/s11633-015-0938-2
    [8] Shu Liang, Yi-Heng Wei, Jin-Wen Pan, Qing Gao, Yong Wang.  Bounded Real Lemmas for Fractional Order Systems . International Journal of Automation and Computing, 2015, 12(2): 192-198. doi: 10.1007/s11633-014-0868-4
    [9] Xiao-Ming Tang,  Bao-Cang Ding.  Design of Networked Control Systems with Bounded Arbitrary Time Delays . International Journal of Automation and Computing, 2012, 9(2): 182-190. doi: 10.1007/s11633-012-0632-6
    [10] Jing Sun,  Ying-Jie Xing.  An Effective Image Retrieval Mechanism Using Family-based Spatial Consistency Filtration with Object Region . International Journal of Automation and Computing, 2010, 7(1): 23-30. doi: 10.1007/s11633-010-0023-9
    [11] Hui-Fang Deng, Wen Deng, Han Li, Hong-Ji Yang.  Authentication and Access Control in RFID Based Logistics-customs Clearance Service Platform . International Journal of Automation and Computing, 2010, 7(2): 180-189. doi: 10.1007/s11633-010-0180-x
    [12] Qing-Jin Peng,  Xiu-Mei Kang,  Ting-Ting Zhao.  Effective Virtual Reality Based Building Navigation Using Dynamic Loading and Path Optimization . International Journal of Automation and Computing, 2009, 6(4): 335-343. doi: 10.1007/s11633-009-0335-9
    [13] Subramanian Appavu Alias Balamurugan, Ramasamy Rajaram.  Effective and Efficient Feature Selection for Large-scale Data Using Bayes’ Theorem . International Journal of Automation and Computing, 2009, 6(1): 62-71. doi: 10.1007/s11633-009-0062-2
    [14] Spyridon K. Gardikiotis,  Nicos Malevris.  A Two-folded Impact Analysis of Schema Changes on Database Applications . International Journal of Automation and Computing, 2009, 6(2): 109-123. doi: 10.1007/s11633-009-0109-4
    [15] Some Remarks on the Boundedness and Convergence Properties of Smooth Sliding Mode Controllers . International Journal of Automation and Computing, 2009, 6(2): 154-158. doi: 10.1007/s11633-009-0154-z
    [16] Ying Zhang,  Adrian R L Travis.  Creation and Evaluation of a Multi-sensory Virtual Assembly Environment . International Journal of Automation and Computing, 2008, 5(2): 163-173. doi: 10.1007/s11633-008-0163-3
    [17] Computational Intelligence Determines Effective Rationality . International Journal of Automation and Computing, 2008, 5(1): 63-66. doi: 10.1007/s11633-008-0063-6
    [18] Marcello Bonfè, Paolo Castaldi, Walter Geri, Silvio Simani.  Design and Performance Evaluation of Residual Generators for the FDI of an Aircraft . International Journal of Automation and Computing, 2007, 4(2): 156-163. doi: 10.1007/s11633-007-0156-7
    [19] An Evaluation of the Reliability of Complex Systems Using Shadowed Sets and Fuzzy Lifetime Data . International Journal of Automation and Computing, 2006, 3(2): 145-150. doi: 10.1007/s11633-006-0145-2
    [20] Matthias Maisch, Bernd Bertsche, Ralf Hettich.  An Approach to Online Reliability Evaluation and Prediction of Mechanical Transmission Components . International Journal of Automation and Computing, 2006, 3(2): 207-214. doi: 10.1007/s11633-006-0207-5
  • 加载中
图(5) / 表(5)
计量
  • 文章访问数:  26
  • HTML全文浏览量:  26
  • PDF下载量:  9
  • 被引次数: 0
出版历程
  • 收稿日期:  2020-02-21
  • 录用日期:  2020-04-20
  • 网络出版日期:  2020-07-04
  • 刊出日期:  2020-08-01

Bounded Evaluation: Querying Big Data with Bounded Resources

doi: 10.1007/s11633-020-1236-1
    作者简介:

    Yang Cao received the B. Sc. degree from Beihang University, China. He received the Ph.D. degree from University of Edinburgh, UK. He is a faculty member in the School of Informatics, University of Edinburgh, UK. He is the recipient of SIGMOD Research Highlight ward 2018, SIGMOD Best Paper ward 2017, and Microsoft Research Asia Fellowship. His research has been invited to publish in TODS special issues on “Best of SIGMOD 2017” and “Best of PODS 2016”, and in the Computer Journal special issue on “Best of BICOD 2015”. His research interests include query processing, graph data management and distributed databases. E-mail: yang.cao@ed.ac.uk (Corresponding author) ORCID iD: 0000-0001-7984-3219

    Wen-Fei Fan received the B. Sc. degree and M.Sc. degree from Peking University China. He received the Ph. D. degree from University of Pennsylvania, USA. He is the Chair of Web Data Management at the University of Edinburgh, UK, the Chief Scientist of Shenzhen Institute of Computing Science, and the Chief Scientist of Beijing Advanced Innovation Center for Big Data and Brain Computing, China. He is a Fellow of the Royal Society (FRS), a Fellow of the Royal Society of Edinburgh (FRSE), a Member of the Academy of Europe (MAE), an ACM Fellow (FACM), and a Foreign Member of Chinese Academy of Sciences. He is a recipient of Royal Society Wolfson Research Merit Award in 2018, ERC Advanced Fellowship in 2015, the Roger Needham Award, UK in 2008, Yangtze River Scholar, China in 2007, the Outstanding Overseas Young Scholar Award, China in 2003 , the Career Award, USA in 2001, and several Test-of-Time and Best Paper Awards USA (Alberto O. Mendelzon Test-of-Time Award of ACM PODS 2015 and 2010, Best Paper Awards for SIGMOD 2017, VLDB 2010, ICDE 2007 and Computer Networks 2002). His research interests include database theory and systems, in particular big data, data quality, data sharing, distributed query processing, query languages, recommender systems and social media marketing. E-mail: wenfei@inf.ed.ac.uk ORCID iD: 0000-0001-5149-2656

    Teng-Fei Yuan received the B.Eng. degree from Shandong University China. He is Ph.D. degree cadidate in LFCS, School of Informatics, University of Edinburgh UK. His research interest is development of BEAS, a system for bounded evaluation of SQL queries. E-mail: tengfei.yuan@ed.ac.uk

English Abstract

Yang Cao, Wen-Fei Fan, Teng-Fei Yuan. Bounded Evaluation: Querying Big Data with Bounded Resources[J]. International Journal of Automation and Computing, 2020, 17(4): 502-526. doi: 10.1007/s11633-020-1236-1
Citation: Yang Cao, Wen-Fei Fan, Teng-Fei Yuan. Bounded Evaluation: Querying Big Data with Bounded Resources[J]. International Journal of Automation and Computing, 2020, 17(4): 502-526. doi: 10.1007/s11633-020-1236-1
    • Querying big data can be prohibitively costly. As an indicator, it is NP-hard to decide whether a tuple is in the answer $ Q( {\cal D}) $ in a dataset $ {\cal D} $ to an SPC (select, project, Cartesian product) query Q, and it is PSPACE-hard when Q is a query in relational algebra (denoted by RA)[2]. It takes days to join two tables with 10 million tuples each[3]. One might be tempted to think that parallel computation could do the job. However, there exist computational problems for which parallel scalability is beyond reach, i.e., no matter how many machines are used, the parallel runtime of algorithms for such problems may not be reduced[4]. Worse still, small businesses typically have constrained resources and may not afford large-scale parallel computation.

      Is querying big data beyond the reach of small companies, or is it just a privilege of big companies? Is it possible to extend DBMS with an immediate capacity to answer common queries over big datasets under constrained resources?

      One approach to tackling the challenge has recently been studied, based on bounded evaluation[5, 6]. To answer a query Q on a dataset $ {\cal D} $, the idea is to look at only a “bounded” fraction $ {\cal D}_Q $ of $ {\cal D} $ that suffices to compute $ Q( {\cal D}) $, instead of at the entire $ {\cal D} $. This is doable by using an access schema $ {\cal A} $, which is a combination of cardinality constraints and associated indices. Under $ {\cal A} $, Q is boundedly evaluable if for all datasets $ {\cal D} $ that conform to $ {\cal A} $, one can identify $ {\cal D}_Q \subseteq {\cal D} $ by reasoning about the cardinality constraints, and fetch $ {\cal D}_Q $ by using the indices of $ {\cal A} $, such that (a) $ Q( {\cal D}_{Q}) = Q( {\cal D}) $ and (b) $ {\cal D}_Q $ is determined by $ {\cal A} $ and Q only. In other words, if Q is boundedly evaluable under $ {\cal A} $, query Q can then be exactly answered via bounded evaluation, by accessing only DQ of size bounded by the cardinalities in $ {\cal A} $.

      The theory has been tested in industry and is found to “improve the performance by orders of magnitude”[7].

      Example 1. Consider query Q1 from Facebook Graph Search[8]: Find all my friends who have check-ins in UK. The query is posed on dataset $ {\cal D}_{1} $ with two relations: (a) friend (uid, fid), stating that person fid is a friend of uid, and (b) checkin (uid, loc, cty, date), stating that person uid checked in at location loc in country cty on date. Written as an RA query, Q1 is as follows (u0 denotes “me”):

      $ {Q_1}(x) = {\rm{{friend}}} (u_{0}, x) \Join {\rm{{checkin}}} (x, {\rm{{loc}}}, {\rm{“UK”}}, {\rm{{date}}}) .$

      Here dataset $ {\cal D}_1 $ is big, with trillions of friend links and check-ins[9]. It is costly to compute $ Q_{1}( {\cal D}_{1}) $ directly.

      Now consider a set $ {\cal A}_1 $ of real-life cardinality constraints:

      $ \circ \phi_{1} $: $ {\rm{{friend}}} ({\rm{{pid}}} \rightarrow {\rm{{fid}}}, 5\,000) $;

      $ \circ \phi_{2} $: $ {\rm{{checkin}}} ({\rm{{uid}}} \rightarrow {\rm{{country}}}, 193) $.

      Here constraint $\phi_{1}$ specifies a Facebook policy[10]: a limit of 5 000 friends per user; and $\phi_{2}$ states that each user can check-in at most 193 countries. Indices can be built on $ {\cal D}_{1} $ based on $\phi_{1}$ such that given a person, it returns the ids of all her friends by accessing at most 5 000 friend tuples; similarly for $\phi_{2}$. Taken together, these constraints and their associated indices are called access constraints[5].

      Using $ {\cal A}_{1} $, we can compute $ Q_{1}( {\cal D}_{1}) $ by accessing at most 970 000 tuples from $ {\cal D}_{1} $, instead of trillions. (1) We fetch T1 of at most 5000 fid′s of friend tuples with uid = u0, by using $ \phi_{1} $. (2) For each fid f in T1, we fetch T2 of at most 193 country values with $\phi_{2}$. (3) We return the set of fid′s in T1 with country = UK in T2. The plan fetches at most 5 000 + 193 × 5 000 tuples to compute $ Q_{1}( {\cal D}_{1}) $, no matter how big $ {\cal D}_{1} $ is. Hence, Q1 is boundedly evaluable under $ {\cal A}_{1} $.

      As shown in Example 1, bounded evaluation answers a query Q over a big dataset by accessing a set $ {\cal D}_{Q} $ of data values with bounded size. It does this by retrieving values (i.e., partial tuples) using indices associated with cardinality constraints that correlate attributes. One might think that this can also be carried out by conventional index-only plans for query optimization[11]. However, the two are different problems as indicated by their complexity bounds: deciding whether an SPC query can be answered with a “bounded” query plan is EXPSPACE-hard[5], while it is in PTIME to decide whether it has an index-only plan[11].

      While bounded evaluation is promising, more work has to be done, from theory to systems. Bounded evaluation has only been studied for RA queries under the set semantics[5, 6]. In the real world, queries are often expressed in RAaggr, i.e., RA extended with aggregation under the bag semantics. RAaggr can express all SQL (structured query language) queries that do not carry arithmetic expressions. This makes bounded evaluation more intriguing.

      Example 2. Recall query Q1 and access schema $ {\cal A}_{1} $ from Example 1. Consider query Q2 to find the number of UK check-ins from each of my friends. Written in RAaggr, Q2 is:

      Q2 = gpBy(Q3, uid, count(cty)), where

      ${Q_3} = \pi_{{\rm{{uid}}}, {\rm{{cty}}}}({\rm{{friend}}}(u_{0}, x) \Join {\rm{{checkin}}}(x, {\rm{{loc}}}, {\rm{"UK"}}, {\rm{{date}}}).$

      Here gpBy(Q3, uid, count(cty)) groups the results of Q3 by attribute uid and calculates count(cty) for each group (see Section 2.1 for more details about gpBy operator). In contrast to Q1, $ {\cal A}_{1} $ does not help us answer Q2. Using φ2, we can fetch a set of distinct countries for each friend x. However, x may have multiple UK check-ins. Access schema no longer suffices for RAaggr under the bag semantics.

      For practical use to emerge from the study, it is necessary to extend bounded evaluation from RA to RAaggr (SQL). This gives rise to several questions. How should we extend the access schema of [5, 6] to support the bag semantics? We will see that the problem for checking whether an SQL query is boundedly evaluable is undecidable. Given the negative result, is bounded evaluation beyond reach in practice? More specifically, is there any practical and decidable special case? Is it possible to develop a systematic method that allows us to efficiently check the bounded evaluability of SQL queries? In addition, after determining that an SQL query is boundedly evaluable, how can we generate and optimize a query plan to carry out its bounded evaluation?

      Contributions. This paper answers these questions by extending the study to RAaggr, from theory to practice.

      (1) Bounded evaluation for SQL. We extend bounded evaluation from RA to RAaggr, i.e., SQL (without arithmetic) to support arbitrarily nested aggregate sub-queries and group-by clauses. We introduce bag access schemas, an extension of the access schema of [5, 6] to support the bag semantics. We also formulate bounded query plans for RAaggr.

      (2) Complexity of bounded evaluation. Not surprisingly, bounded evaluability is undecidable for SQL since it is already undecidable for RA[5]. We identify practical conditions that cover a number of real-life queries, for which the bounded evaluability can be efficiently determined. These conditions tell us what makes queries bounded.

      (3) Effective syntax. To accommodate the undecidability, we develop an effective syntax $ {\cal L}_{\cal B} $ for boundedly evaluable RAaggr queries. We show that under a bag access schema $ {\cal B} $, (a) an RAaggr query Q is boundedly evaluable if and only if it is equivalent to a query $ Q'\in {\cal L}_{\cal B} $; and (b) it is in PTIME (polynomial time) to check whether Q is in $ {\cal L}_{\cal B} $. That is, $ {\cal L}_{\cal B} $ is a core subclass of bounded evaluable RAaggr queries that are syntactically checkable without sacrificing the expressive power. This is along the same lines as how commercial database systems (DBMS) deal with safe relational calculus queries, which are undecidable to decide[12-14].

      (4) Extending DBMS with bounded evaluation. We present a framework, referred to as BEAS (bounded evaluable SQL) to provide commercial DBMS with the capability of bounded evaluation of RAaggr queries. Given an RAaggr query Q and a bag access schema $ {\cal B} $, BEAS first checks whether Q is boundedly evaluable under $ {\cal B} $. If so, it generates a query plan for Q to compute $ Q( {\cal D}) $ by accessing a bounded small fraction $ {\cal D}_{Q} $ of $ {\cal D} $ using $ {\cal B} $. Otherwise, it leverages access schema and generates a partially bounded plan, to bound sub-queries of Q. We develop algorithms underlying BEAS.

      (5) Experimental study. As proof of concept, we extend PostgreSQL with bounded evaluation, denoted by BEAS@PG. Using TPCH benchmark[15] and real-life datasets, we evaluate the performance of BEAS@PG compared to PostgreSQL. We find that BEAS@PG improves PostgreSQL by up to 4 orders of magnitude for boundedly evaluable queries.

      Querying big data under bounded resources. This work is a component of a framework for querying big data. As outlined in [16], the framework works as follows: given an SQL query Q posed on a big dataset $ {\cal D} $,

      (1) it first checks whether Q is boundedly evaluable;

      (2) if so, it computes the exact answer $ Q( {\cal D}) $ by accessing a bounded fraction DQ of $ {\cal D} $ via bounded evaluation;

      (3) otherwise, it computes approximate answers to Q in $ {\cal D} $ also by accessing a bounded amount of data, and provides deterministic accuracy ratios[17].

      In the entire process, it only accesses a bounded fraction of data and can be conducted under bounded resources. Hence it is feasible to provide small businesses with a capacity of querying big data despite constrained resources.

      To simplify the discussion, we focus on row-oriented DBMS (a.k.a. row stores) in this paper. Nonetheless, as will be seen in Section 2, the model of bounded evaluation subsumes column stores. Moreover, bounded evaluation can be readily extended to parallel and distributed systems[18].

      Organization. The remainder of the paper is organized as follows. Section 2 defines bag access constraints and formulates boundedly evaluable RAaggr queries. Section 3 studies the complexity of bounded evaluation for RAaggr queries. Section 4 proposes effective syntax $ {\cal L}_{\cal B} $ for boundedly evaluable RAaggr queries. Section 5 introduces BEAS and develops its underlying algorithms. The experimental study is presented in Section 6. We discuss related work in Section 7, and identifies topics for future work in Section 8.

    • We first define bag access schema (Section 2.1) and then formulate bounded evaluation of RAaggr queries, aggregate or not, under the bag semantics (Section 2.2).

    • We start with a review of RAaggr, an extension of RA with a group-by construct and nested aggregate sub-queries.

      RAaggr queries. An RAaggr query is an expression defined in terms of RA operators (i.e., select $ \sigma_C $, project $ \pi_Y $, Cartesian-product $ \times $ or join $ \Join_C $, renaming $ \rho_{A \rightarrow B} $, union $ \cup $ and set difference -), and additionally a group-by aggregate operator

      $$ { {\rm{{gpBy}}}}(Q, X, { {\rm{{agg}}}}_{1}(V_{1}), \cdots, { {\rm{{agg}}}}_{m}(V_{m}))$$

      where (a) Q is an RAaggr query, (b) X is a set of attributes for group-by, (c) aggi is one of aggregate functions max, min, count, sum, avg, and (d) V1, ··· , Vm are attributes such that $ X\cup \bigcup_{i = 1}^{m}\{{ {\rm{{agg}}}}_{i}(V_{i})\} $ forms the output relation of Q. We refer to aggi(Vi) as an aggregate field on attribute Vi. We write the operator as ${ {\rm{{gpBy}}}}(Q, X, {\overline{{ {\rm{{agg}}}}(V})})$ when it is clear from the context. Since Q may include aggregate operators itself, aggregations in an RAaggr query may be arbitrarily nested.

      In SQL syntax, the operator can be written as

      ${ {\rm{{select}}}} X , { {\rm{{agg}}}}_{1}(V_{1}), \cdots, { {\rm{{agg}}}}_{m}(V_{m})$

      from Q

      group by X

      As a special case, when $ X = \emptyset $, ${ {\rm{{gpBy}}}}(Q, X, {\overline{{ {\rm{{agg}}}}(V})})$ does not have a group-by construct. We write it simply as agg(Q).

      Example 3. Query Q2 in Example 2 is an RAaggr query.

      As another example, an RAaggr query with nested aggregation is Q4 over relations R(A, B, C) and S(E, F, W):

      $ {Q_4} = {{ {\rm{{avg}}}}}(\pi_{z}(Q_{5}(y) \Join S(y, 1, z))) $, where

      $ {Q_5}(y) = {{ {\rm{{sum}}}}}(\pi_{y}(R(w, 1, u)\Join R(w, x, y))) $.

      Bag access schema. To support the bag semantics, we extend the access schema of [5, 19] to bag access schema. Over a database schema $ {\cal R} $, a bag access schema $ {\cal B} $ is a set of bag access constraints of the form:

      $$ \varphi = R \left(\!| X \rightarrow Y, N |\!\right)$$

      where R is a relation schema in $ {\cal R} $, X and Y are sets of attributes of R, and N is a positive integer.

      To define the semantics of bag access constraints, we use the following notations. (1) Denote by $ {\cal D} $ a database of $ {\cal R} $, and by D an instance of relation schema R in $ {\cal R} $. (2) For an instance D of R, $ D_{Y}(X = \bar a) = \{t[Y]\ |\ t\in D, t[X] = \bar a\} $, i.e., $ D_{Y}(X = \bar a) $ denotes the set of $ Y $ values corresponding to X-value $ \bar a $. (3) For any XY-value $ \overline{ab} $ in D, $ m(D, \overline{ab}) $ denotes the cardinality of the bag (multiset) $ \{\!|t\in D\ |\ t[XY] = \overline{ab}|\!\} $, i.e., the number of occurrences of $ \overline{ab} $ as XY attributes in D; we refer to $ m(D, \overline{ab}) $ as the multiplicity of $ \overline{ab} $ in D.

      We say that D conforms to $ \psi $, denoted by $ D\models \varphi $, if

      (1) for any $ X $-value $ \bar a $ in $ D $, $ |{{D_{Y}(X = \bar a)}}| \leq N $, i.e., there exist at most $ N $ distinct associated $ Y $-values in $ D $; and

      (2) there exists an index for $ \varphi $ on D such that given any X-value $ \bar a $, by accessing at most N tuples, it retrieves (a) all associated distinct Y-values $ \bar b $ in D, and (b) for each such $ \bar b $, the multiplicity $ m(D, \overline{ab}) $.

      Intuitively, $ D\models \varphi $ if for any X-value, there exist at most N distinct corresponding Y values in D. Moreover, these Y-values (partial tuples) and their multiplicities in D are indexed by ψ and can be efficiently retrieved via the index.

      Example 4. Extending $ {\cal A}_{1} $ of Example 1, a bag access schema $ {\cal B}_{1} $ consists of the following bag access constraints:

      $ \circ $$ \varphi_{1 } = {\rm{{friend}}} (\!|{\rm{{pid}}} \rightarrow {\rm{{fid}}}, 5000 |\!) $;

      $ \circ $$ \varphi_{2} = {\rm{{checkin}}} (\!|{\rm{{uid}}} \rightarrow {\rm{{country}}}, 193 |\!) $.

      Here φ2 says that (a) for any uid u1, there exist at most 193 distinct country values, and (b) there exists an index built on the friend relation that given any uid u1, fetches all associated distinct countrys c and for each country c, the multiplicity of (u1, c) in the friend relation for country c; similarly for φ2.

      As another example, consider a bag access schema $ {\cal B}_{2} $ for Q4 of Example 3, which consists of bag access constraints:

      $ \circ \varphi_{3} = R (\!| A \rightarrow B, 1 |\!) $,

      $ \circ \varphi_{4} = R (\!| B \rightarrow C, 10 |\!) $, and

      $ \circ \varphi_{5} = S (\!| EF \rightarrow W, 10 |\!) $.

      We will see that Q4 can be efficiently answered with $ {\cal B}_{2} $.

      A database instance $ {\cal D} $ of $ {\cal R} $ conforms to a bag access schema $ {\cal B} $, denoted by $ {\cal D}\models {\cal B} $, if $ D\models \varphi $ for every $ \varphi\in {\cal B} $, where $ \varphi = R (\!| X \rightarrow Y, N |\!) $, and D is the instance of R in $ {\cal D} $.

      Intuitively, a bag access constraint φ extends an access constraint ψ of [5, 19] by incorporating multiplicity. Similar to ψ, given any X-value $ \bar a $ in D, φ enforces the cardinality constraint $ |{{D_{Y}(X = \bar a)}}| \leq N $ and returns distinct corresponding Y-values. In contrast to ψ, for each corresponding Y-value $ \bar b $, φ also returns the multiplicity of $ \overline{ab} $ in D. In other words, access constraints under the set semantics[5, 6] are a special case of bag access constraints when we only bound the cardinality and retrieve distinct values.

      Remark 1. When it is clear from the context, we also simply refer to bag access schema as access schema.

    • We next define bounded evaluation for RAaggr queries.

      Multiplicity relations. From an instance D of a relation schema R, we can use the index of a bag access constraint $ R (\!| X \rightarrow Y, N |\!) $ to retrieve a relation consisting of tuples $ (t[X, Y], m(t[X, Y])) $, where t is a tuple in D. It is a set that besides partial tuples t[X, Y] in D, carries multiplicity m(D, t[X, Y]), and is referred to as a multiplicity relation.

      The RAaggr operators can be readily extended to multiplicity relations. Take join operator $ \Join_{C} $ as an example. Given two multiplicity relations I1 and I2, the result of $ I_{1}\Join_{C} I_{2} $, denoted by Is, is a multiplicity relation as follows: (a) tuples in Is have the form (t, M), where t is a result tuple of $ I_{1}\Join_{C}I_{2} $ using the conventional join semantics (ignoring multiplicity), and (b) $M = m(I_{1}, t_{1}) \times m(I_{2}, t_{2}) $, where t is joined from $ t_{1}\in I_{1} $ and $ t_{2}\in I_{2} $, and m(Ii, ti) denotes the multiplicity of tuple ti in multiplicity relation $ I_{i} (i\in \{1, 2\}) $. Similarly, other RAaggr operators are defined on multiplicity relations.

      Bounded RAaggr plans. A bounded RAaggr plan $ \xi $ under a bag access schema $ {\cal B} $ is an algebra tree that extends conventional RAaggr query plans with a new operator fetch(T,φ), where $ \varphi = R (\!| X \rightarrow Y, N |\!) $ is a bag access constraint in $ {\cal B} $, and T is an intermediate multiplicity relation on attributes R[X] (see Appendix for a formal definition). Over an instance D of R that conforms to φ, fetch(T, φ)retrieves an intermediate relation $ S = \bigcup_{\bar a\in T}D_{XY}(X = \bar a) $ by using the index of φ on D, where each tuple t in S is annotated with multiplicity m(D, t) (also retrieved by φ).

      Intuitively, a bounded RAaggr plan starts with a set of constants (possibly $ \emptyset $), retrieves data from $ {\cal D} $ via the fetch operator, and applies RAaggr operators to the fetched data, except that it accesses data by employing the indices of the bag constraints in $ {\cal B} $ only, and allows group-by aggregate and operates on multiplicity relations.

      We denote by $ \xi( {\cal D}) $ the result of applying plan ξ to $ {\cal D} $.

      Example 5. Recall RAaggr query Q2 from Example 2 and bag access schema $ {\cal B}_{1} $ from Example 4. A bounded plan $ \xi_{Q_{2}} $ for Q2 under $ {\cal B}_{1} $, written in algebra expressions, is as follows:

      $T_{1}\;({\rm{{uid}}},\!{\rm{{fid}}})\! = \! { {\rm{{fetch}}}}(\{\!u_{0}\!\}, \varphi_{1})$,

      $T_{2}\;({\rm{{uid}}},\!{\rm{{cty}}})\! = \! { {\rm{{fetch}}}}(\pi_{{\rm{{fid}}}}T_{1}, \varphi_{2})$,

      $T_{3}\;({\rm{{uid}}}, {\rm{{cty}}}) = \sigma_{{\rm{{cty}}}}T_{2}$,

      $ T_{4} = {{\rm{{gpBy}}}}(T_{3}, {\rm{{uid}}}, {\rm{{count}}}({\rm{{cty}}})) $.

      As a more intriguing example, recall query Q4 from Example 3 and bag access schema $ {\cal B}_{2} $ from Example 4. By the cardinality constraint in φ3, for each A-value in any database instance $ {\cal D}\models {\cal B}_{2} $ for w of Q4, there exists at most 1 distinct B-value associated with w. Therefore, $Q_{4}\equiv_{{\cal B}_{2}}Q_{6}$, i.e., Q4 is equivalent to Q6 on every database $ {\cal D}\models {\cal B}_{2} $, where Q6 is

      $ {Q_6} = {\rm{{avg}}}(\pi_{z}(Q_{7}(y) \Join S(y, 1, z))) $, where

      $ {Q_7}\left( y \right) {\rm{{sum}}}(\pi_{y}R(w, 1, y)) $.

      Under $ {\cal B}_{2} $, Q6 (hence Q4) has a bounded query plan $ \xi_{Q_{4}} $:

      $ T_{1}(B, C) = { {\rm{{fetch}}}}(\{1\}, \varphi_{4}) $,

      $ T_{2} = { {\rm{{sum}}}}(\pi_{C}T_{1}) $,

      $ T_{3}(E, F, W) = { {\rm{{fetch}}}}(T_{2}\times \{1\}, \varphi_{5}) $,

      $ T_{4} = { {\rm{{avg}}}}(\pi_{W}T_{3}) $.

      Observe that $ \xi_{Q_{4}} $ does not explicitly use φ3. However, the correctness of $ \xi_{Q_{4}} $ relies on the cardinality of φ3. Moreover, $ \xi_{Q_{4}} $ propagates constants of Q4 via join and fetch, such that all values and their combinations that are needed for answering Q4 are fetched from $ {\cal D}\models {\cal B}_{2} $. In particular, in the presence of nested aggregation, answers to aggregate sub-queries can also be used by fetch, e.g., T3.

      Boundedly evaluable queries. Under access schema $ {\cal B} $, an RAaggr Q is boundedly evaluable if it has a plan ξ such that:

      $\circ\; \xi$ is a bounded RAaggr plan under $ {\cal B} $;

      $ \circ $ constants {c} in ξ are from selection conditions of Q;

      $ \circ $ moreover, for any database $ {\cal D}\models {\cal B} $, $ \xi( {\cal D}) = Q( {\cal D}) $.

      We write $ Q\equiv_{{\cal B}}\xi $ if Q has such a bounded query plan ξ.

      Here for any multiplicity relation D1 and a conventional bag (multiset) D2, we write D1 = D2 if D2 can be obtained from D1 by including m copies of each tuple $ (t, m) \in D_1 $.

      For example, query Q2 of Example 2 is boundedly evaluable under $ {\cal B}_{1} $ of Example 4, since it has a bounded plan $ \xi_{Q_{2}} $ given in Example 5. Similarly, Q4 of Example 3 is also boundedly evaluable under $ {\cal B}_{2} $ of Example 4.

      Observe the following about bounded RAaggr plans ξ.

      (1) Scale independence. Each fetch operation in ξ retrieves data with a cost that can be quantified by the bag constraint employed. Hence the cost of executing ξ is determined by bag access schema $ {\cal B} $ and query plan ξ only, not by the size of dataset $ {\cal D} $ as long as $ {\cal D} $ conforms to $ {\cal B} $. That is, under the bag semantics, bounded RAaggr plans preserve the scale independence of bounded evaluation for RA queries[5, 6].

      (2) Late bag semantics enforcement. Plan ξ fetches and operates on sets since fetch(T, ψ) returns a set. It defers the process of the bag semantics to a stage as late as possible. This reduces performance degradation caused by duplicated values in, e.g., joins, in which duplicates get inflated rapidly.

      (3) Subsuming column-stores. Bounded plan ξ can also express query evaluation over column-stores[20] or columnstore indices[21]. Indeed, in a column store (or a columnstore index), each column (or column index) on attribute A of a relation schema R is essentially a special case of bag access constraint of the form $ R (\!| \emptyset \rightarrow A, N |\!) $. Hence, column store and columnstore index are a special case of bag access schema and hence their evaluation plan can be expressed by a bounded plan ξ under such a bag access schema.

      Note that the efficiency of column stores mainly comes from its implementation-level optimization, e.g., column compression and vectorization[20]. While these optimization strategies can also be used to implement the indices of bag access schema, these are not the focus of this paper. We study query evaluation at the logical level, under generic constraints $ R (\!| X \rightarrow Y, N |\!) $ when X is not necessarily empty. Hence, this paper focuses on row-oriented databases as the underlying platform for implementing bag access schema.

    • In this section, we study the complexity of bounded evaluability and identify practical decidable cases.

      Bounded evaluability. The problem is stated as follows.

      $ \circ $ Input: A database schema $ {\cal R} $, a bag access schema $ {\cal B} $ over $ {\cal R} $, and an RAaggr query Q over $ {\cal R} $.

      $ \circ $ Question: Is Q boundedly evaluable under $ {\cal B} $?

      This bounded evaluability problem is to decide whether a query can be answered by accessing a bounded amount of data, and is underlying the first step of our framework for querying big data under bounded resources (Section 1).

      No matter how important, the problem is hard. To see why it is intriguing, let us consider Example 6.

      Example 6. Consider bag access schema $ {\cal B}_{3} $ and SPC query Q8 defined on relations T(A, B) and U(E, F):

      $ \circ {\cal B}_{3} $ consists of the following two access constraints:

      $ \varphi_{6} = T (\!| A \rightarrow B, N |\!) $,

      $ \varphi_{7} = U (\!| E \rightarrow F, 2 |\!) $;

      $ \circ $ query $ Q_{8} = Q_{9} - Q_{10} $, where

      $ Q_{9} = \pi_{y}(T(x, y) \Join U(w, 1) \Join U(w, x) \Join U(w, y)) $, and

      $ Q_{10} = \pi_{z}(T(z, z) \Join U(u, 1) \Join U(u, z)) $.

      At a first glance, none of Q9 and Q10 seems boundedly evaluable, and hence neither is Q8. Indeed, we cannot retrieve values for any of x, y or z using indices in $ {\cal B}_{3} $. However, putting together $U(w, 1) \Join U(w, x) \Join U(w, y)$ and φ7 of $ {\cal B}_{3} $, one can deduce that x must be equal to either 1 or y in all tuples retrieved from instance of T by any query plan for Q9. In other words, under $ {\cal B}_{3} $, Q9 reduces to SPCU $ Q_{9}^{1}\cup Q_{9}^{2} $, where $ Q_{9}^{1} = \pi_{y}(T(1, y) \Join U(w, 1) \Join U(w, , y) $ and $ Q_{9}^{2} = \pi_{y}(T(y, y) \Join U(w, 1) \Join U(w, y) $. Hence, $ Q_{8} = (Q_{9}^{1}\cup Q_{9}^{2}) - Q_{10} = Q_{9}^{1} $ since $ Q_{9}^{2} \equiv Q_{10} $. It is easy to see that $ Q_{9}^{1} $ is boundedly evaluable under $ {\cal B}_{3} $ and as a result, so is Q8.

      As shown above, it is often necessary to check query equivalence to decide whether a query is bounded. The use of union ($ \cup $) allows us to convert SPC to SPCU under a bag access schema (e.g., Q4 to $ Q_{4}^{1}\cup Q_{4}^{2} $), which may further interact with set difference (–) (e.g., Q3 and $ Q_{4}^{1} $).

      It is beyond reach in practice to check the equivalence of RA or RAaggr queries. Thus, the bounded evaluability problem is already undecidable for RA, a special case of RAaggr.

      Theorem 1.[5] The bounded evaluability problem is undecidable for RA queries.

      Theorem 1 was verified under access schema. As remarked in Section 2, access schema is a special case of bag access schema. Hence, the bounded evaluability problem remains undecidable for RA under a bag access schema. As an immediate corollary, the bounded evaluability problem is undecidable for RAaggr, which subsumes RA.

      Decidable cases. We next identify special cases when the bounded evaluability is decidable. The reason is twofold. (1) The special cases cover a large number of RAaggr queries used in practice, e.g., all SPC sub-queries of built-in benchmark queries in TPCH[15] and TPCDS[22]. (2) These cases reveal insight about why queries become boundedly evaluable. In Section 4, we will deal with generic RAaggr queries, by devising an effective syntax for boundedly evaluable RAaggr queries.sub> queries.

      (I) PTIME cases. The first special case, denoted by $ {\cal C}^{{\rm{{P}}}} $, consists of combinations of SPC queries and bag access schema for which the bounded evaluability can be checked in PTIME, covering all SPC sub-queries of TPCH and TPCDS.

      Class $ {\cal C}^{{\rm{{P}}}} $. For any bag access schema $ {\cal B} $ and SPC query Q over the same database schema $ {\cal R} $, $ ({\cal B}, Q)\in {\cal C}^{{\rm{{P}}}} $ if

      (a) for each bag constraint $ \varphi = R (\!| X \rightarrow Y, N |\!) $, $ N > |\!|{{Q}}|\!| $, where $ |\!|{{Q}}|\!| $ is the number of relation atoms in Q; and

      (b) Q has no self-join.

      Theorem 2. For any bag access schema $ {\cal B} $ and SPC Q,

      (1) it is in PTIME to decide whether $ ({\cal B}, Q) $ is in $ {\cal C}^{{\rm{{P}}}} $; and

      (2) it is in PTIME to decide whether Q is boundedly evaluable under $ {\cal B} $ if $ ({\cal B}, Q) $ is in $ {\cal C}^{{\rm{{P}}}} $.

      Proof. Statement (1) apparently holds. Below we prove (2) by giving a PTIME sufficient and necessary condition for checking the bounded evaluability of Q under $ {\cal B} $.

      The condition needs a notion of covered SPC queries from [6]. It is characterized by a set $ { {\rm{{cov}}}}(Q, {\cal B}) $, which consists of attributes whose values can be retrieved via fetch operations along with $ {\cal B} $, without directly accessing raw data in a database. More specifically, let $ X_{C}^{Q} $ be the set of attributes A in the constant selection predicates of Q, i.e., $ \sigma_{A = c} $ for a constant c. Then $ { {\rm{{cov}}}}(Q, {\cal B}) $ is inductively defined as:

      (a) $ X_{C}^{Q} \subseteq { {\rm{{cov}}}}(Q, {\cal B}) $;

      (b) if $ A\in { {\rm{{cov}}}}(Q, {\cal B}) $ and $ \Sigma_{Q} \vdash A = B $ (denoting that $ A = B $ can be deduced from selection predicates $ \Sigma_{Q} $ via the transitivity of equality), then $ B\in { {\rm{{cov}}}}(Q, {\cal B}) $;

      (c) if $ X\subseteq { {\rm{{cov}}}}(Q, {\cal B}) $ and $ R (\!| X \rightarrow Y, N |\!) \in {\cal B} $, then $ Y\subseteq { {\rm{{cov}}}}(Q, {\cal B}) $; and

      (d) $ { {\rm{{cov}}}}(Q, {\cal B}) $ contains nothing else.

      Denote by $ X_{R}^{Q} $ the set of attributes that are either in the selection or join predicates of Q, or are the top-level projection attributes. Then we show the following.

      Lemma 3. For any $ ({\cal B}, Q)\in {\cal C}^{{\rm{{P}}}} $, Q is boundedly evaluable under $ {\cal B} $ if and only if for each relation R in Q, there is $ \varphi = R (\!| X \rightarrow Y, N |\!) \in {\cal B} $ such that $ X_{R}^{Q}\subseteq XY\subseteq { {\rm{{cov}}}}(Q, {\cal B}) $.

      From Lemma 3, Theorem 2(2) follows since one can simply check the condition of Lemma 3 in PTIME in the sizes of Q and $ {\cal B} $. Below we prove Lemma 3.

      ($ \Rightarrow $) Assume that Q is boundedly evaluable under $ {\cal B} $. Then there exists a bounded plan ξ for Q under $ {\cal B} $. Below we first inductively construct a query Qξ from ξ such that (a) $ \xi \equiv Q_{\xi} \equiv_{{\cal B}} Q $, where $ Q \equiv_{{\cal B}} Q' $ means that $ Q( {\cal D}) = Q'( {\cal D}) $ for any database $ {\cal D}\models {\cal B} $, and (b) Qξ satisfies the condition on Q in Lemma 3. We then show that Q also satisfies the condition when Qξ satisfies it, and thus Lemma 3 holds.

      Construction of Qξ. We construct query Qξ by induction on the structure of ξ as follows:

      $ \circ $ If ξ = {c}, then Qξ = {c}.

      $ \circ $ If $ \xi = \sigma_{C}(\xi') $ (resp. $ \pi_{Y}(\xi') $), then $ Q_{\xi} = \sigma_{C}(Q_{\xi'}) $ (resp. $ \pi_{Y}(Q_{\xi'}) $), where Qξ' is the query constructed for ξ'.

      $ \circ $ If $ \xi = { {\rm{{fetch}}}} (\xi', R (\!| X \rightarrow Y, N |\!)) $, then $ Q_{\xi} = \pi_{Z}(Q_{\xi'} \Join_{X} R(X, Y, Z)) $.

      $ \circ $ If $ \xi = \xi_{1}\times \xi_{2} $, then $ Q_{\xi} = Q_{\xi_{1}} \times Q_{\xi_{2}} $.

      By induction on the structure of ξ, one can readily verify that for each relation R in $ Q_{\xi} $, there exists a bag constraint $ \varphi = R (\!| X \rightarrow Y, N |\!) \in {\cal B} $ such that $ X_{R}^{Q_{\xi}} \subseteq XY\subseteq { {\rm{{cov}}}}(Q_{\xi},{\cal B}) $, i.e., Qξ satisfies the condition of Lemma 3.

      Query Q satisfies the condition. We next show that Q satisfies the condition in Lemma 3 when Qξ does. Since for each bag constraint $ \varphi = R (\!| X \rightarrow Y, N |\!) \in {\cal B} $, $ N>|\!|{{Q}}|\!| $, from $ Q_{\xi} \equiv_{{\cal B}} Q $, one can verify that $ Q_{\xi} \equiv Q $. Thus there exists a homomorphism ρ from Qξ to Q[2]. Moreover, since Q is self-join free, each relation schema R (i.e., relation atom) has at most one occurrence in Q. Then no relation atom in Q can be removed without changing Q. Thus, Q is minimal (an SPC query is minimal if it has no redundant relation atoms[2]). Hence for each relation R in Q, there must exist a relation R' in Qξ such that ρ(R') = R, and moreover, $ X_{R}^{Q} \subseteq \rho(X_{R'}^{Q_{\xi}}) $. Hence $ X_{R}^{Q} \subseteq \rho(X_{R'}^{Q_{\xi}}) \subseteq XY \subseteq \rho( { {\rm{{cov}}}}(Q_{\xi}, {\cal B})) \subseteq { {\rm{{cov}}}}(Q, {\cal B}) $. That is, Q also satisfies the condition of Lemma 3.

      ($ \Leftarrow $) Assume that Q satisfies the condition of Lemma 3. We construct a 3-step bounded query plan ξ under $ {\cal B} $ for Q:

      (a) it has a bounded sub-plan ξR for each relation R in Q that fetches all attribute values needed for answering Q;

      (b) it combines the attribute values for each relation R in Q via a bounded sub-plan $ \xi_{R}^{c} $ such that each (partial) tuple fetched and kept for R is guaranteed to draw values from the same tuple in D; and

      (c) it finally carries out operations in Q over $ \xi_{R}^{c} $ for each relation R in Q.

      To show such a plan ξ exists for Q under the condition of Lemma 3, we only need to prove the following two properties:

      (1) all necessary attribute values for answering Q from each relation R in Q can be retrieved by ξR in step (a), and

      (2) their combinations can be restored by $ \xi_{R}^{c} $ in step (b).

      Proof of (1). We prove (1) by constructing such a bounded plan ξR[A] for each attribute $ A\in X_{R}^{Q} $. Note that only attributes in $ X_{R}^{Q} $ are needed for answering Q. The plan ξR[A] is constructed by translating the proof that witnesses $ A\in { {\rm{{cov}}}}(Q, {\cal B}) $. More specifically, since the condition of Lemma 3 holds for Q and $ {\cal B} $, for any attribute A of $ X_{R}^{Q} $ such that $ A\in { {\rm{{cov}}}}(Q,{\cal B}) $, there must exist a sequence of applications of rules (a)–(c) that defines $ { {\rm{{cov}}}}(Q, {\cal B}) $ such that

      $$ \ell: { {\rm{{cov}}}}_{0} {\mathop {\longmapsto} \limits^{r_{1}}} { {\rm{{cov}}}}_{1} {\mathop {\longmapsto} \limits^{r_{2}}} \cdots {\mathop {\longmapsto} \limits^{r_{n}}}{ {\rm{{cov}}}}_{n} $$

      where $ { {\rm{{cov}}}}_{0} = \emptyset $, $ A\in { {\rm{{cov}}}}_{n} $, step i expands covi-1 by applying rule ri from one of the rules (a)–(c) for defining $ { {\rm{{cov}}}}(Q, {\cal B}) $ given earlier. We translate $ \ell $ into a bounded plan:

      $$ \xi: \xi_{0}, \cdots, \xi_{n} $$

      where ξ0 is empty; ξi is derived from ξ0, ···, ξi-1 based on step $ { {\rm{{cov}}}}_{i-1} {\mathop {\longmapsto} \limits^{r_{i}}} { {\rm{{cov}}}}_{i} $ as follows:

      (i) if ri is rule (a) for a constant c in $ X_{C}^{Q} $, then ξi is {c};

      (ii) if ri is rule (b) with A = B such that $ A\in { {\rm{{cov}}}}_{i-1} $, $ B\not\in { {\rm{{cov}}}}_{i-1} $ and $ B\in { {\rm{{cov}}}}_{i} $, then ξi = ξi-1;

      (iii) if ri is rule (c), then $\xi_{i} = { {\rm{{fetch}}}}(\xi_{j_{1}}\Join \cdots\; \xi_{j_{|X|}}, \varphi = R (\!| X \rightarrow Y, N |\!))$, where $ \xi_{j_{1}} $, ···, $ \xi_{j_{|X|}} $ are the bounded plans that fetch attributes in X ($j_{1}, \cdots, j_{|X|} < i$).

      By the construction, ξi is a bounded plan that fetches all A attribute values for Q. Note that each sub-plan ξi is bounded since it does not involve relation scans.

      Proof of (2). Plan $ \xi_{R}^{c} $ is constructed with $ \xi_{A} $ for each $ A\in X_{R}^{Q} $ by (i) $ T_{1} = \xi_{A_{1}}\Join \cdots \Join \xi_{A_{|X|}} $, where $ A_{i} $′s $ (i\in [1, |X|]) $ range over all attributes in X, and $ X_{A_{i}} $ is the plan generated above for fetching $ A_{i} $ values; and (ii) if $ X_{R}^{Q}\!\subseteq\! XY $ for $ \varphi \!=\! R (\!| X \rightarrow Y, N |\!)\in {\cal B} $, then $ \xi_{R}^{c} \!=\! { {\rm{{fetch}}}} (T_{1}, \varphi) $. Since $ \xi_{A} $ is a bounded plan for each $ A\in X_{R}^{Q} $, $ \xi_{R}^{c} $ is bounded under $ {\cal B} $.

      Hence, when the condition of Lemma 3 holds for Q and $ {\cal B} $, ξ constructed above is a bounded plan for Q under $ {\cal B} $.□

      (II) NP cases. One might want to lift the restriction of condition (b) on the queries in $ {\cal C}^{{\rm{{P}}}} $. This covers all SPC queries, including those with self-joins. However, the bounded evaluability analysis becomes harder unless P = NP.

      Class $ {\cal C}^{{\rm{{NP}}}} $. Denote by $ {\cal C}^{{\rm{{NP}}}} $ the set of pairs of bag access schemas $ {\cal B} $ and SPC queries Q such that for each bag constraint $ \varphi = R (\!| X \rightarrow Y, N |\!) $, $ N> |\!|{{Q}}|\!| $. We have the following.

      Theorem 4. For any bag access schema $ {\cal B} $ and SPC Q,

      (a) it is in PTIME to decide whether $ ({\cal B}, Q) $ is in $ {\cal C}^{{\rm{{NP}}}} $; and

      (b) it is NP-complete to decide whether Q is boundedly evaluable under $ {\cal B} $ if $ ({\cal B}, Q) $ is in $ {\cal C}^{{\rm{{NP}}}} $.

      Proof. Statement (a) is immediate. To prove statement (b), we first give a sufficient and necessary condition for query Q to be boundedly evaluable under $ {\cal B} $ for any $ ({\cal B}, Q)\in {\cal C}^{{\rm{{NP}}}} $. Based on the characterization, we then show that checking bounded evaluability for $ {\cal C}^{{\rm{{NP}}}} $ is NP-complete.

      Let Qm be the minimal equivalent query of Q, i.e., the minimized version of Q, which can be obtained by removing all redundant relations (see [2] for details). For an SPC query Q, there exists a unique minimal equivalent query up to isomorphism[2]. Along the same lines as the proof of Lemma 3, one can verify the following for cases in $ {\cal C}^{{\rm{{NP}}}} $.

      Lemma 5. For any $ ({\cal B}, Q)\in {\cal C}^{{\rm{{NP}}}} $, Q is boundedly evaluable under $ {\cal B} $ if and only if for each relation atom R in Qm, there exists $ \varphi = R (\!| X \rightarrow Y, N |\!) \in {\cal B} $ such that $ X_{R}^{Q_{m}}\subseteq XY\subseteq { {\rm{{cov}}}}(Q_{m}, {\cal B}) $.

      Based on this, we prove that deciding whether Q is boundedly evaluable under $ {\cal B} $ is NP-complete for $ ({\cal B}, Q) $ in $ {\cal C}^{{\rm{{NP}}}} $.

      Upper bound. We give an NP algorithm as follows:

      (a) convert Q into its tableau representation (TQ, u)[2];

      (b) guess a sub-query Q' = (T', u) of Q such that $ T'\subseteq T_{Q} $, and a mapping ρ from TQ to T';

      (c) check (i) whether ρ is a homomorphism from (TQ, u) to (T', u) and (ii) whether for each relation atom R in Q', there exists $ \varphi = R (\!| X \rightarrow Y, N |\!) \in {\cal B} $ such that $ X_{R}^{Q'}\subseteq XY\subseteq { {\rm{{cov}}}}(Q, {\cal B}) $; return “Yes” if so.

      The algorithm is correct since if conditions (i) and (ii) of step (c) hold on Q', they must also hold on the minimal equivalent query Qm of Q. Indeed, ρ also determines a homomorphism from (TQ, u) to $ (T_{Q_{m}}, u) $ since Qm is a minimal equivalent query of Q, i.e., $ T_{Q_{m}}\subseteq T_{Q}'\subseteq T_{Q} $; therefore, if condition (ii) holds on Q', by the homomorphism ρ it must also hold on Qm, i.e., the condition of Lemma 5 applies to Q and $ {\cal B} $. Thus, by Lemma 5, Q is boundedly evaluable. The algorithm is in NP since step (a) is in PTIME, and step (c) is in PTIME in |Q'|, |Q|, |ρ|, and $ |{\cal B}| $ while $ |Q'|\leq |Q| $ and ρ = O(|Q|). Here |Q| is the size of Q, i.e., the number of attributes and aggregate fields in Q; $ |{\cal B}| $ is the total length of bag constraints in $ {\cal B} $.

      Lower bound. To show that the problem is NP-hard, we consider the following problem, denoted by MINCQ.

      $ \circ $ Input: A relation schema R and an SPC query Q over R.

      $ \circ $ Question: Is Q minimized, i.e., is Q a minimal equivalent query of Q?

      It is easy to verify that MINCQ is coNP-complete by reduction from 3-COLORABILITY, which is NP-complete[23].

      Lemma 6. Problem MINCQ is NP-complete.

      We show that the bounded evaluability problem for $ {\cal C}^{{\rm{{NP}}}} $ is NP-hard by reduction from the complement of MINCQ.

      Given an instance of MINCQ, i.e., a relation schema $R(A_{1}, \cdots, A_{m})$ and an ${ {\rm{{SPC}}}}$ query Q over R, we construct a database schema $ {\cal R}' $, an SPC query Q' over $ {\cal R}' $ and a bag access schema $ {\cal B} $. We show that Q is not minimal if and only if Q' is boundedly evaluable under $ {\cal B} $.

      (1) Database schema $ {\cal R}' $ consists of a single relation schema $R'(A_{1}, \cdots, A_{m}, B_{1}, \cdots, B_{\frac{n(n-1)}{2}})$, where n is the number of relation atoms that appear in query Q.

      Intuitively, R' extends R with additional attributes $ B_{1} $, ··· , $ B_{\frac{n(n-1)}{2}} $. As will be shown later, together with Q', such new attributes will be used as join attributes to pairwisely connect the n relation atoms of Q in Q'.

      (2) Query Q' is derived from Q as follows:

      $ \circ $ query Q' retains the same number of joins and relation atoms as Q, such that each relation atom Ri (i.e., renaming of relation schema $ R\in {\cal R} $) is replaced with Ri' (i.e., renaming of relation schema $ R'\in {\cal R}' $); and

      $ \circ $ the selection (join) condition C of query $ Q' $ contains all selection predicates of Q, and in addition, the following predicates: for each pair of relations Ri and Rj in Q ($ i<j $), add equality $ R'_{i}[B_{p}] = R'_{j}[B_{p}] $ to C, where $ p = n(i-1) -\dfrac{i(i-1)}{2} + (j-i) $.

      Intuitively, C preserves all selection conditions of Q and additionally joins each pair of the n relation atoms on a dedicated attribute $ B_{i} (i\in [1, \dfrac{n(n-1)}{n}]) $: (a) for each Bk, there exist exactly two relation atoms Ri' and Rj' such that Ri'[Bk] = R'j[Bk]; and (b) for each Ri' and Rj', there exists exactly one attribute Bk such that Ri'[Bk] = R'j[Bk].

      (3) The bag access schema $ {\cal B} $ consists of $ \dfrac{n(n-1)}{2} - 1 $ constraints. Let W be the set of all attributes of A1, ··· , Am such that they appear in the selection/join conditions or the top-level projection attributes in Q. Then $ {\cal B} $ consists of

      $ \circ $ $\varphi_{1} = R' (\!|\emptyset \rightarrow WB_{2}B_{3}\cdots B_{\frac{n(n-1)}{2}}, N |\!)$,

      $ \circ $ $\varphi_{2} = R' (\!|\emptyset \rightarrow WB_{1}B_{3}\cdots B_{\frac{n(n-1)}{2}}, N |\!)$,

      ...

      $ \circ $ $\varphi_{\frac{n(n-1)}{2}} = R' (\!|\emptyset \rightarrow WB_{1}B_{2}\cdots B_{\frac{n(n-1)}{2}-1}, N |\!)$.

      We next show that query Q is not minimal if and only if Q is boundedly evaluable under $ {\cal B} $.

      $ (\Rightarrow) $ Assume that Q is not minimal. Then none of the $ n $ relation atoms in Q' can be removed by minimizing Q'. Hence all $ \dfrac{n(n-1)}{2} $ attributes $ B_{1} $, ···, $ B_{\frac{n(n-1)}{2}} $, together with W, have to be contained in XY for some $ \varphi = R'(X \rightarrow Y, N, m) $ in $ {\cal B} $ by Lemma 5. This yields $ |W| + \dfrac{n(n-1)}{2} $ attributes. This is impossible since for any $ \varphi_{i}\in {\cal B} $, φi contains $ |W| + \dfrac{n(n-1)}{2} - 1 $ attributes only by its definition above.

      $ (\Leftarrow) $ Assume that Q' is boundedly evaluable under $ {\cal B} $. Since each $ \varphi_{i}\in {\cal B} $ contains $ |W| + \dfrac{n(n-1)}{2} - 1 $ attributes, by Lemma 5, $ X_{R'_{i}}^{Q'_{m}} $ must contain at most $ |W| + \dfrac{n(n-1)}{2} - 1 $ attributes for each relation atom R'i in Q', where Q'm is the minimal equivalent query of Q'. Since $ X_{R'_{i}}^{Q'} $ contains $ |W| + \dfrac{n(n-1)}{2} $ attributes, query Q' is not minimal.□

      Remark 2. Despite its intractability, checking the bounded evaluability for $ {\cal C}^{{\rm{{NP}}}} $ is feasible in practice by Lemma 5. Indeed, there have been effective algorithms for minimizing SPC queries, i.e., computing Qm for Q[2], and the size of Q is typically small. Taking one of these algorithms as an oracle for computing Qm, one can still efficiently check the bounded evaluability of generic SPC queries: first minimize Q, yielding $ Q_{m} $, and then check whether Qm and $ {\cal B} $ satisfy the condition of Lemma 5 in PTIME in |Qm| and $ |{\cal B}| $.

    • In this section, we propose an effective method to check the bounded evaluability of generic RAaggr queries. We show that while the problem is undecidable (Theorem 1), there exists an effective syntax for boundedly evaluable RAaggr queries, which reduces the problem to syntactic checking (Section 4.1). In addition, we identify two practical subclasses of RAaggr queries and provide their effective syntax. In particular, we give one for RA and show that it covers more bounded queries than the one given in [6] (Section 4.2).

    • Under an access schema $ {\cal B} $, an effective syntax for boundedly evaluable queries of $ \mathbb{L} $ ($ \mathbb{L} $ refers to, e.g., RA or RAaggr) is a subclass $ {\cal L}_{\cal B}^{\mathbb{L}} $ of $ \mathbb{L} $ such that for any Q in $ \mathbb{L} $,

      (a) if Q is boundedly evaluable, then there exists a query $ Q' $ in $ {\cal L}_{\cal B}^{\mathbb{L}} $ such that $ Q\equiv_{{\cal B}} Q' $;

      (b) every query Q in $ {\cal L}_{\cal B}^{\mathbb{L}} $ is boundedly evaluable; and

      (c) it is in PTIME in the size $ |Q| $ of query Q and the length $ |{{{\cal B}}}| $ of constraints in $ {\cal B} $ to check whether $ Q\in {\cal L}_{\cal B}^{\mathbb{L}} $.

      Here $ Q\equiv_{{\cal B}} Q' $ if $ Q( {\cal D}) = Q'( {\cal D}) $ for all databases $ {\cal D}\models {\cal B} $.

      Intuitively, the effective syntax reduces the problem of deciding the bounded evaluability of $ \mathbb{L} $ queries to syntactic checking of $ {\cal L}_{\cal B}^{\mathbb{L}} $. Indeed, every boundedly evaluable $ \mathbb{L} $ query can find an equivalent query in $ {\cal L}_{\cal B}^{\mathbb{L}} $ under $ {\cal B} $. Hence, we can safely settle with queries in $ {\cal L}_{\cal B}^{\mathbb{L}} $, since $ {\cal L}_{\cal B}^{\mathbb{L}} $ can express, up to equivalence, all boundedly evaluable $ \mathbb{L} $ queries.

      Remark 3. To some extent, the development of effective syntax $ {\cal L}_{\cal B}^{\mathbb{L}} $ is analogous to the study of range-safe queries for relational calculus. Indeed, the problem for checking the “safety” of relational calculus queries is also undecidable[2]. Despite this, range-safe queries are supported by commercial DBMS, by making use of an effective syntax of range-safe relational calculus queries. We follow the same approach to dealing with the bounded evaluability of RAaggr queries.

      Below we develop such an effective syntax, denoted by $ {\cal L}_{\cal B} $, for RAaggr queries that are boundedly evaluable under $ {\cal B} $.

      The class $ {\cal L}_{\cal B} $. In a nutshell, we characterize $ {\cal L}_{\cal B} $ with three sets: $ { {\rm{{BA}}}}(Q,{\cal B}) $, $ { {\rm{{BR}}}}(Q,{\cal B}) $ and $ { {\rm{{BQ}}}}(Q,{\cal B}) $. Informally, under a bag access schema $ {\cal B} $, for an RAaggr query Q,

      $ \circ { {\rm{{BA}}}}(Q,{\cal B}) $ contains attributes (e.g., $ A $) and aggregates (e.g., $ { {\rm{{sum}}}}(A) $) of Q whose values can be fetched via $ {\cal B} $;

      $ \circ { {\rm{{BR}}}}(Q,{\cal B}) $ consists of relations in Q whose partial tuples that are needed to answer Q can be reconstructed from the fetched values for attributes in $ { {\rm{{BA}}}}(Q,{\cal B}) $; and

      $ \circ { {\rm{{BQ}}}}(Q,\!{\cal B}) $ contains boundedly evaluable sub-queries of Q.

      An RAaggr query Q is included in $ {\cal L}_{\cal B} $ if $ Q\in { {\rm{{BQ}}}}(Q,{\cal B}) $.

      Intuitively, these sets characterize RAaggr queries Q for which the values of all attributes necessary for answering Q can be “deduced” from constants in Q, via joins and fetch under access schema $ {\cal B} $. Such attributes participate in RAaggr operations of Q, and are referred as the nontrivial attributes of query Q. The class of such queries makes an effective syntax for boundedly evaluable RAaggr queries.

      More specifically, sets BA, BR and BQ are defined in a mutual recursive way using rules given in Fig. 1, with notations explained in Table 1. Intuitively, (1) rule γ1 of Fig. 1 includes constant attributes $ X^{c}_{Q} $ (see Table 1) in $ { {\rm{{BA}}}}(Q,{\cal B}) $; (2) γ2 propagates value from attributes and aggregate fields to join attributes; (3) γ3 specifies value propagation via fetch; (4) γ4 says that if a sub-query is boundedly evaluable, then its output attributes and aggregate fields can also be fetched; (5) γ5 adds a relation atom R to $ { {\rm{{BR}}}}(Q,{\cal B}) $ only when the partial tuples of R can be reconstructed from combinations of the fetched values; and (6) γ6 says that a sub-query is boundedly evaluable if all its relations can be correctly fetched.

      Figure 1.  Effective syntax $ {\cal L}_{\cal B} $ for $ { {\rm{{RA}}}}_{{ {\rm{{aggr}}}}} $ queries

      Table 1.  Notations and definitions

      Notation Definition
      A (or R[A]) An attribute or an aggregate field in Q
      X, Y Sets of attributes in Q
      |Q| Number of attributes and aggregate fields in Q
      ||Q|| Number of relation atoms in Q
      ${ X_{Q} }$ Set of all attributes and aggregate fields in Q
      ${ X_{Q}^{c} }$ Set of attributes A of Q in constant selections ${ \sigma_{A=c} }$
      ${ \Sigma_{Q} }$ Set of equality predicates in selections/joins of Q
      ${ \Sigma_{Q}\!\vdash\!A\!=\!B }$ ${ A=B }$ can be deduced from ${ \Sigma_{Q} }$ via equality transitivity
      ${ Z_{Q} }$ Set of attributes and aggregate fields of the output of Q
      nontr. attr. Attributes participated in algebra operations of Q
      ${ {\cal A} }$/${ {\cal B} }$ Access schema/bag access schema
      ${ |{\cal B}| }$ Total length of bag constraints in ${ {\cal B} }$
      ${ |\!|{{{\cal B}}}|\!| }$ The number of bag constraints in ${ {\cal B} }$
      ${ \phi }$/${ \varphi }$ Access constraint/bag access constraint
      ${ {\cal L}_{\cal B} }$ Effective syntax for ${ { {\\{ {{\rm{RA}}} } } }_{ { {\rm{ {aggr} } } } } }$ queries
      bounded. eval. under ${ {\cal B} }$
      ${ {\cal L}_{\cal B}^{\mathbb{L}} }$ Effective syntax for ${ \mathbb{L} }$-queries bounded. eval. under ${ {\cal B} }$

      As shown in Table 1, $ \Sigma_Q $ denotes the set of equality predicates embedded in the selection or join conditions of Q, and $ \Sigma_Q \vdash A = B $ denotes that equality $ A = B $ can be deduced from $ \Sigma_Q $ by the transitivity of the equality relation.

      Example 7. Recall query Q2 from Example 2 and bag access schema $ {\cal B}_{1} $ from Example 4. We show that $ Q_{2}\in {\cal L}_{{\cal B}_{1}} $.

      (1) Initially, by rule γ1, $ {\rm{{BA}}}(Q_{2}, {\cal B}_{1}) $ includes f.uid and c.cty, where f (resp. c) are shorthands for friend (resp. checkin).

      (2) By rule γ3 and φ1, $ {\rm{{BA}}}(Q_{2}, {\cal B}_{1}) $ further includes f.fid.

      (3) Since $ \Sigma_{Q}\vdash {\rm{{f.fid}}} = {\rm{{c.uid}}} $, by $ {\rm{{f.fid}}}\in {\rm{{BA}}}(Q_{2}, {\cal B}_{1}) $, we have that $ {\rm{{BA}}}(Q_{2}, {\cal B}_{1}) $ includes c.uid by rule γ2.

      (4) By γ5, $ {\rm{{f}}}\in { {\rm{{BR}}}}(Q_{2}, {\cal B}_{1}) $ because both f.uid and f.fid are in $ {\rm{{BA}}}(Q_{2}, {\cal B}_{1}) $ and are attributes of φ1. Similarly, $ {\rm{{c}}}\in { {\rm{{BR}}}}(Q_{2}, {\cal B}_{1}) $ because of φ2. Note that only c.uid and c.cty are nontrivial attributes of Q in relation c and they are both in $ {\rm{{BA}}}(Q_{2}, {\cal B}_{1}) $.

      (5) By γ6, sub-query Q3 and query Q2 itself are in $ { {\rm{{BQ}}}}(Q, {\cal B}) $.

      As another example, recall query Q6 from Example 5 and $ {\cal B}_{2} $ from Example 4. We next show that $ Q_{6}\in {\cal L}_{{\cal B}_{2}} $.

      (1) One can readily deduce that $ {\rm{{BA}}}(Q_{6}, {\cal B}_{2}) $ includes B and C by using γ1 and γ3, and that $ { {\rm{{BR}}}}(Q_{6}, {\cal B}_{2}) $ includes R with $\gamma_{5} $.

      (2) By γ6, $ { {\rm{{BQ}}}}(Q_{6}, {\cal B}_{2}) $ includes sub-query Q7 of Q6.

      (3) Hence, further by γ4 we have that $ Z_{Q_{7}}\in {\rm{{BA}}}(Q_{6}, {\cal B}_{2}) $, where $ Z_{Q_{7}} $ is the output of Q7, i.e., an aggregate field.

      (4) By γ2, we know that $ {\rm{{BA}}}(Q_{6}, {\cal B}_{2}) $ includes $ F $ since $ \sigma_{Q_{6}}\vdash Z_{Q_{7}} = F $ and $ F $ is not an aggregate field.

      (5) Thus, by γ5 and γ6, $ S\in { {\rm{{BR}}}}(Q_{6}, {\cal B}_{2}) $ and $ Q_{6}\in { {\rm{{BQ}}}}(Q_{6}, {\cal B}_{2}) $.

      Note that $ Q_{4}\not \in {\cal L}_{{\cal B}_{2}} $. However, $ Q_{4}\equiv_{{\cal B}_{2}}Q_{6} $, $ Q_{6}\in {\cal L}_{{\cal B}_{2}} $ and Q6 is boundedly evaluable under $ {\cal B}_{2} $. Similarly, for Q8, $ Q_{9}^{1} $ and $ {\cal B}_{3} $ of Example 6, one can verify that $ Q_{8}\not\in {\cal L}_{{\cal B}_{3}} $ but $ Q_{9}^{1}\in {\cal L}_{{\cal B}_{3}} $ and $ Q_{9}^{1} $ is boundedly evaluable under $ {\cal B}_{3} $.

      We next show that $ {\cal L}_{\cal B} $ is indeed an effective syntax for boundedly evaluable RAaggr queries under $ {\cal B} $.

      Theorem 7. Under any bag access schema $ {\cal B} $, $ {\cal L}_{\cal B} $ is an effective syntax for boundedly evaluable RAaggr queries.

      Proof. We show below that $ {\cal L}_{\cal B} $ has properties (a) and (b) of an effective syntax, by proving the following lemmas. We will constructively prove property (c) in Section 5.2.

      (I) For any bounded plan ξ under $ {\cal B} $, there is $ Q\equiv_{{\cal B}} \xi $ in $ {\cal L}_{\cal B} $.

      (II) For any $ Q\in {\cal L}_{\cal B} $, there is plan $ \xi\equiv_{{\cal B}} Q $ bounded under $ {\cal B} $.

      These suffice. Indeed, for any Q that is bounded under $ {\cal B} $, by definition there must exist a plan ξQ bounded under $ {\cal B} $; hence by (I), there exists $ Q'\equiv_{{\cal B}}\xi_{Q} $ in $ {\cal L}_{\cal B} $. On the other hand, if $ Q\equiv_{{\cal B}} Q'\in {\cal L}_{\cal B} $, by (II), Q' has a bounded plan $ \xi'\equiv_{{\cal B}} Q'\equiv_{{\cal B}} Q $, i.e., Q is also boundedly evaluable under $ {\cal B} $. Hence, $ {\cal L}_{\cal B} $ has properties (a) and (b) of an effective syntax.

      We next prove the two lemmas.

      Proof of (I). We prove it by induction on the structure of $ \xi $.

      Base case. When $ \xi $ is $\{c\}$ or $ \emptyset $, by definition $ \xi $ itself is in $ {\cal L}_{\cal B} $.

      Induction. We consider the structure of $ \xi $.

      (i) $ \xi $ is $ {\rm{{gpBy}}}(\xi', X, \overline{{\rm{{agg}}}(V)}) $. By the induction hypothesis, there exists a query $ Q'\equiv_{{\cal B}} \xi' $ such that $ Q'\in {\cal L}_{\cal B} $. Consider $ Q = {\rm{{gpBy}}}(Q', X, \overline{{\rm{{agg}}}(V)}) $. By the definition of $ {\cal L}_{\cal B} $, all relations in $ Q' $ are in $ { {\rm{{BR}}}}(Q', {\cal B})\subseteq { {\rm{{BR}}}}(Q, {\cal B}) $ (since Q and $ Q' $ share the same nontrivial attributes). Hence $ Q\in { {\rm{{BQ}}}}(Q, {\cal B}) $ by rule $ \gamma_{6} $. That is, $ Q\in {\cal L}_{\cal B} $. Obviously, $ Q\equiv_{{\cal B}} \xi $.

      The cases for $ \xi = \pi_{Y}(\xi') $, $ \sigma_{C}(\xi') $, $ \xi_{1}\times \xi_{2} $, $ \xi_{1}\cup \xi_{2} $, $ \xi_{1}-\xi_{2} $ are similar and can be verified along the same lines.

      (ii) $ \xi $ is $ { {\rm{{fetch}}}}(\xi', \varphi) $ with $ \varphi = R (\!| X \rightarrow Y, N |\!) $. By the induction hypothesis, there exists query $ Q'\equiv_{{\cal B}} \xi' $ such that $ Q'\in {\cal L}_{\cal B} $. Consider $ Q = \pi_{R[XY]}(Q'\Join_{Z_{Q'} = R[X]} R) $. By the semantics of ${ {\rm{{fetch}}}}$, $ Q \equiv_{{\cal B}} \xi $. We next show that $ Q\in {\cal L}_{\cal B} $. Since $ Q'\in {\cal L}_{\cal B} $, $ Q'\in { {\rm{{BQ}}}}(Q', {\cal B}) $. Hence by rule $ \gamma_{4} $, $ Q'[X]\subseteq {\rm{{BA}}}(Q', {\cal B})\subseteq {\rm{{BA}}}(Q, {\cal B}) $. Further by $ \gamma_{2} $, $ R[X]\subseteq {\rm{{BA}}}(Q, {\cal B}) $. By $ \gamma_{3} $, $ R[Y]\subseteq {\rm{{BA}}}(Q, {\cal B}) $. Hence by $ \gamma_{5} $ and $ \gamma_{6} $, $ Q\in{ {\rm{{BQ}}}}(Q, {\cal B}) $. That is, $ \xi\in {\cal L}_{\cal B} $.

      Proof of (II). Since $ Q\in {\cal L}_{\cal B} $, by the definition of $ {\cal L}_{\cal B} $ there must exist a proof $ \ell_{Q} $ consisting of applications of rules in Fig. 1 that deduces $ Q\in { {\rm{{BQ}}}}(Q, {\cal B}) $, i.e., a sequence of the form

      $$ ({\rm{{BA}}}_{0}, { {\rm{{BR}}}}_{0}, { {\rm{{BQ}}}}_{0}) {\mathop {\longmapsto} \limits^{r_{1}}} \cdots {\mathop {\longmapsto} \limits^{r_{n}}} ({\rm{{BA}}}_{n}, { {\rm{{BR}}}}_{n}, { {\rm{{BQ}}}}_{n}) $$

      where (a) $ r_{i}(i\in [1, n]) $ is one of the rules in Fig. 1; (b) $ {\rm{{BA}}}_{0} = { {\rm{{BR}}}}_{0} = { {\rm{{BQ}}}}_{0} = \emptyset $; (c) for each step $ i $, only one of $ {\rm{{BA}}}_{i} $, $ { {\rm{{BR}}}}_{i} $ and $ { {\rm{{BQ}}}}_{i} $ is changed in $ {\rm{{BA}}}_{i+1} $, $ { {\rm{{BR}}}}_{i+1} $ and $ { {\rm{{BQ}}}}_{i+1} $, respectively; and (d) $ r_{n} $ is rule $ \gamma_{6} $ that deduces $ Q\in { {\rm{{BQ}}}}(Q, {\cal B}) $. We define the length of $ \ell_{Q} $ as the number $ n $ of rules applied in $ \ell_{Q} $.

      Induction hypothesis. We show that for a proof $ \ell $ of length $ i $,

      $ \circ $ if $ A\in {\rm{{BA}}}_{i+1} $ but $ A\not\in{\rm{{BA}}}_{i} $, values for $ A $ that are necessary for answering Q can be fetched via bounded plan under $ {\cal B} $;

      $ \circ $ if $ R\in { {\rm{{BR}}}}_{i+1} $ but $ R\not \in { {\rm{{BR}}}}_{i} $, then values from R necessary for Q can be fetched via bounded plans under $ {\cal B} $; and

      $ \circ $ if $ q\in { {\rm{{BR}}}}_{i+1} $ but $ q\not\in{ {\rm{{BQ}}}}_{i} $, then the exact answers to sub-query Q of Q can be answered via bounded plans under $ {\cal B} $.

      Note that for any $ Q\in {\cal L}_{\cal B} $, Q must have a proof $ \ell_{Q} $ ending by including $ Q\in { {\rm{{BQ}}}}(Q, {\cal B}) $. If the induction hypothesis holds, Q must have a bounded plan under $ {\cal B} $, which proves lemma (II).

      We next prove it by induction on the length $ l(\ell) $ of $ \ell $.

      Base case. When $ l(\ell) = 1 $, then rule r1 can only be either (i) γ1 of Fig. 1, i.e., to include $ A\in {\rm{{BA}}}_{1} $ from selection A = c of Q; or (ii) γ3 of Fig. 1, i.e., to include R[Y] in BA1 with $ \varphi = R (\!| \emptyset \rightarrow Y, N |\!)\in {\cal B} $. For (i), simply let ξA = {c}. Then ξA is a bounded plan that fetches all necessary values for A. For (ii), let $ \xi_{R[Y]} = { {\rm{{fetch}}}}(\emptyset, \varphi) $. Then by the semantics of fetch, all values for R[Y] that are necessary for answering Q are fetched by ξR[Y] (here we rename Q beforehand such that there exist no duplicated attribute names).

      Induction. Assume that the hypothesis holds for proofs of length at most k. Consider proof $ \ell $ of length k+1. We discuss the last step $ ({\rm{{BA}}}_{i}, { {\rm{{BR}}}}_{i}, { {\rm{{BQ}}}}_{i}) {\mathop {\longmapsto} \limits^{r_{i+1}}} ({\rm{{BA}}}_{i+1}, { {\rm{{BR}}}}_{i+1}, { {\rm{{BQ}}}}_{i+1}) $ of $ \ell $.

      (i) If rk+1 is rule γ1 with attribute A = c, then A can be fetched in exactly the same way as the base case.

      (ii) If rk+1 is rule γ2 that includes attribute B in BAi+1 with A = B, then attribute A must be included in BAi+1 at some steps prior to k+1. By the induction hypothesis, there must exist a bounded plan ξA that fetches all necessary values for answering Q except B. Hence ξB =ξA is also a bounded plan that fetches B for Q by the condition A = B.

      (iii) If rk+1 is rule γ3 that includes R[Y] in BAi+1 with $ R[X] \subseteq {\rm{{BA}}}(Q, {\cal B}) $ and constraint $ \varphi = R (\!| X \rightarrow Y, N |\!) $, then there exist steps $i_{1}, \cdots, i_{p}$ prior to k+1 that include R[X1], ··· , $ R[X_{i_{p}}] $ in BA such that $ R[X_{1}] \cup \cdots \cup R[X_{p}] = R[X] $. Hence by the induction hypothesis, $ R[X_{j}](j \in [1, p]) $ has bounded plan $ \xi_{R[X_{j}]} $. Let $ \xi_{R[X]} $ be $ \Join_{j = 1}^{p}\xi_{R[X_{j}]} $, then ξR[X] fetches all values for R[X] that are necessary for answering Q. Hence, further by the semantics of fetch, R[Y] has a plan fetch(ξR[X], φ) that retrieves all R[Y]-values needed for answering Q.

      (iv) If rk+1 is γ4 that includes $ Z_{Q_{s}} $ in BAi+1 from a subquery Qs of Q that is included in BQj at step $ j< i+1 $, then by the induction hypothesis, there exists a plan $ \xi_{Q_{s}} $ for Qs under $ {\cal B} $ that exactly answers Qs. Hence we can get values for its output attributes $ Z_{Q_{s}} $ simply by $ \xi_{Q_{s}} $. Note that since $ \xi_{Q_{s}} $ is an exact plan for Qs, the values for $ Z_{Q_{s}} $ are guaranteed correct even when Qs is an aggregate subquery.

      (v) If rk+1 is γ5 that includes R in BRi+1 with R[X] and $ \varphi = R (\!| X \rightarrow Y, N |\!) \in {\cal B} $, then by the induction hypothesis, there exist plans $ \xi_{R[X_{1}]} $, ···, $ \xi_{R[X_{p}]} $ to fetch all necessary values for R[X1], ··· , R[Xp] for Q, respectively, such that $ R[X_{1}]\cup \cdots \cup R[X_{p}] = R[X] $. Hence R[X] has a bounded plan $ \xi_{R[X]} = \, \Join_{j = 1}^{p}\xi_{R[X_{j}]} $ that fetches all necessary $ R[X] $-values for Q. Since R[XY] covers all nontrivial attributes of R for Q, by $ \xi_{R[XY]} = { {\rm{{fetch}}}}(\xi_{R[X]}, \varphi) $, we can fetch all combinations of R[XY]-values that are needed for answering Q.

      (vi) If rk+1 is γ6 that includes Qs in BQi+1, then all relations R1, ··· , Rp of Qs have been included in {BR in prior steps. Hence by the induction hypothesis, there exist $ \xi_{R_{1}} $, ···, $ \xi_{R_{p}} $ that fetch all values from R1, ··· , Rp, respectively, which are necessary for Q. Now construct plan $ \xi_{Q_{s}} $ by replacing each relation $ R_{i}(i\in [1, p]) $ in Qs with $ \xi_{R_{i}} $. Then $ \xi_{Q_{s}} $ must be a query plan for Qs of Q since all necessary value combinations can be retrieved from D via $ \xi_{R_{i}}(i\in [1, p]) $, and Qs then filters and combines values exactly the same as on D.

      Hence the hypothesis holds for proofs of length k+1.□

    • These are two important sub-classes of RAaggr: (1) RA consists of RAaggr queries without aggregation; and (2) $ { {\rm{{RA}}}}_{{ {\rm{{aggr}}}}}^{0} $ is the class of RAaggr queries in which group-by aggregation, if it exists, only appears at the top-level (final operation). It is common to find RA and $ { {\rm{{RA}}}}_{{ {\rm{{aggr}}}}}^{0} $ in practice.

      We provide effective syntax for boundedly evaluable RA and $ { {\rm{{RA}}}}_{{ {\rm{{aggr}}}}}^{0} $ queries. Denote by $ {\cal L}_{\cal B}[{ {\rm{{RA}}}}] $ and $ {\cal L}_{\cal B}[ {\rm{RA}}_{{\rm{agg}}}^0] $ the class of ${ {\rm{{RA}}}}$ and the class of $ { {\rm{{RA}}}}_{{ {\rm{{aggr}}}}}^{0} $ queries that are in $ {\cal L}_{\cal B} $, respectively. They yield effective syntax for RA and $ { {\rm{{RA}}}}_{{ {\rm{{aggr}}}}}^{0} $.

      Corollary 8. For any access schema $ {\cal B} $, (1) $ {\cal L}_{\cal B}{[{ {\rm{{RA}}}}]} $ is an effective syntax for ${ {\rm{{RA}}}}$ queries bounded by $ {\cal B} $; (2) ${{\cal L}_{\cal B}}[RA_{\rm{{agg}}}^0]$ is an effective syntax for $ { {\rm{{RA}}}}_{{ {\rm{{aggr}}}}}^{0} $ queries bounded by $ {\cal B} $.

      Proof. Since it is in PTIME to check whether a query is in $ {\cal L}_{\cal B} $ and every query in $ {\cal L}_{\cal B} $ has a bounded plan under $ {\cal B} $, to show that $ {\cal L}_{\cal B}[ { {\rm{{RA}}}}] $ and $ {\cal L}_{\cal B}[ { {\rm{{RA}}}}_{{ {\rm{{aggr}}}}}^{0}] $ are effective syntax for boundedly evaluable RA and $ { {\rm{{RA}}}}_{{ {\rm{{aggr}}}}}^{0} $ queries, respectively, it suffices to show that for any boundedly evaluable RA Q1 and $ { {\rm{{RA}}}}_{{ {\rm{{aggr}}}}}^{0} $ $ Q_{2} $, there exist $ Q_{1}' \in {\cal L}_{\cal B}[ { {\rm{{RA}}}}] $ and $ Q'_{2}\in {\cal L}_{\cal B}[ { {\rm{{RA}}}}_{{ {\rm{{aggr}}}}}^{0}] $ such that $ Q_{1} \equiv_{{\cal B}} Q_{1}' $ and $ Q_{2} \equiv_{{\cal B}} Q_{2}' $. This is verified along the same lines as the proof of Lemma (I) for Theorem 7 above, by showing that every bounded RA (resp. $ { {\rm{{RA}}}}_{{ {\rm{{aggr}}}}}^{0} $) plan has an equivalent query in $ {\cal L}_{\cal B}[ { {\rm{{RA}}}}] $ (resp. $ {\cal L}_{\cal B}[ { {\rm{{RA}}}}_{{ {\rm{{aggr}}}}}^{0}] $).□

      There are close connections between $ { {\rm{{RA}}}}_{{ {\rm{{aggr}}}}}^{0} $ and RA regarding their effective syntax: any effective syntax for boundedly evaluable $ { {\rm{{RA}}}}_{{ {\rm{{aggr}}}}}^{0} $ queries also gives us an effective syntax for boundedly evaluable RA queries, and vice versa. For any class $ {\cal L} $ of $ { {\rm{{RA}}}}_{{ {\rm{{aggr}}}}}^{0} $ queries $ Q = {\rm{{gpBy}}}(Q', X, \overline{{{ {\rm{{agg}}}}}(V)}) $, denote by (a) $ {\cal L}^{0} $ the class of RA queries Q' that are sub-queries embedded in RAaggr queries Q in $ {\cal L} $; and (b) $ {\cal L}[ { {\rm{{RA}}}}] $ the class of RA queries in $ {\cal L} $. Then we have the following.

      Lemma 9. Under any bag access schema $ {\cal B} $,

      (1) RA is an effective syntax for boundedly evaluable RA if $ {\cal L} $ is an effective syntax for boundedly evaluable RAaggr;

      (2) $ {\cal L} $ is an effective syntax for boundedly evaluable RAaggr if $ {\cal L}^{0} $ is an effective syntax for boundedly evaluable RA.

      Proof. Lemma 9(1) can be verified by the definition of effective syntax. We focus on Lemma 9(2) here (the proof for Lemma 9(1) is simpler). By the definition of boundedly evaluable queries, it is easy to show the following lemma: for any RAaggr $ Q = {\rm{{gpBy}}}(Q', X, \overline{{\rm{{agg}}}(V)}) $, under $ {\cal B} $, Q is boundedly evaluable iff Q' is boundedly evaluable.

      We next use the lemma to prove Lemma 9(2). When $ {\cal L}^{0} $ is an effective syntax for boundedly evaluable RA queries under $ {\cal B} $, consider the associated class $ {\cal L} $ of $ {\cal L}^{0} $. (1) First observe that all queries in $ {\cal L} $ are also boundedly evaluable since RA queries in $ {\cal L}^{0} $ are. (2) For any RAaggr query $Q = {\rm{{gpBy}}}(Q_{1}, X, {\overline{{\rm{{agg}}}(V})})$ that is boundedly evaluable, by the lemma above, $ Q_{1} $ is also boundedly evaluable under $ {\cal B} $. Hence, there exists $ Q'_{1}\in {\cal L}^{0} $ such that $ Q_{1}\equiv_{{\cal B}} Q'_{1} $. Hence $Q' = {\rm{{gpBy}}}(Q_{1}', X{, \overline{{\rm{{agg}}}(V})}) \equiv_{{\cal B}} Q$ and $ Q'\in {\cal L} $ (since $ Q_{1}'\in {\cal L}^{0} $). (3) Moreover, it is in PTIME to check whether a query Q is in $ {\cal L} $ by checking whether its embedded RA query $ Q' $ is in $ {\cal L}^{0} $, in PTIME. From (1), (2) and (3) above, Lemma 9(2) follows. □

      By Lemma 9, one can easily extend an effective syntax for boundedly evaluable RA queries, e.g., covered RA in[6], to an effective syntax for boundedly evaluable RAaggr queries.

      One might think that such an extension is also possible for RAaggr. However, when group-by aggregation is nested with other RAaggr operators, a convenient extension is beyond reach. It is much harder for RAaggr to characterize propagation of values from aggregate sub-queries to other relations, or to cover all boundedly evaluable queries up to equivalence.

      Nonetheless, $ {\cal L}_{\cal B}[{ {\rm{{RA}}}}] $ is more expressive than the class of covered RA of [6], which is also an effective syntax for RA.

      Proposition 10. For any bag access schema $ {\cal B} $, the set of RA queries covered by $ {\cal B} $ is properly contained in $ {\cal L}_{\cal B}[{ {\rm{{RA}}}}] $.

      Proof. One can verify that covered RA queries[6] can be expressed in $ {\cal L}_{\cal B}[{ {\rm{{RA}}}}] $ without rule γ4. Hence it is a subclass of $ {\cal L}_{\cal B} $. To see it is a proper subclass, consider an $ {\cal L}_{\cal B}[ { {\rm{{RA}}}}] $ query Q over relations R(A, B) and S(C, D): $ Q = \pi_{D}((\sigma_{A = 1}R_{1} - \sigma_{B = 1} R_{2}) \Join_{A = C} S) $, where $ R_{1} $ and $ R_{2} $ rename R. Consider $ {\cal B} $ consisting of $ R (\!| A \rightarrow B, N_{1} |\!) $, $ R (\!| B \rightarrow A, N_{2} |\!) $ and $ S (\!| C \rightarrow D, N_{3} |\!) $. One can verify that Q is not covered by $ {\cal B} $ since subquery S is not covered (see [6]). However, $ Q\in {\cal L}_{\cal B}[{ {\rm{{RA}}}}] $.□

    • In this section, we show how to extend DBMS with the functionality of bounded evaluation. We first present such a framework (Section 5.1). We then provide algorithms underlying the framework, for checking the bounded evaluability (Section 5.2) and generating bounded plans (Section 5.3).

    • The framework, referred to as BEAS, is shown in Fig. 2. Given an application that involves queries over instances of a database schema $ {\cal R} $, BEAS works as follows.

      Figure 2.  ${ {\mathsf{{BEAS}}}}$: Bounded evaluation on ${ {\mathsf{{DBMS}}}}$

      Offline preprocessing. As shown in C1 of Fig. 2, as offline preprocessing, BEAS discovers a bag access schema $ {\cal B} $ from (sample) instances of $ {\cal R} $, builds indices of $ {\cal B} $ on the database $ {\cal D} $ of $ {\cal R} $ in use, and maintains $ {\cal B} $ in response to updates to $ {\cal D} $.

      Online processing. When a user poses an RAaggr> query Q on $ {\cal D} $, BEAS first checks whether Q is boundedly evaluable under $ {\cal B} $ (>C2). If so, it generates a bounded query plan ξ for Q under $ {\cal B} $ (C3), which is interpreted as an SQL query Qξ and hence can be directly executed by the underlying DBMS on a bounded dataset $ {\cal D}_Q $ identified by plan ξ (C4). If Q is not boundedly evaluable, it generates a query plan ξ' for Q that is partially bounded, to make maximal use of access constraints in $ {\cal B} $ (C5). The (partially) bounded plans are optimized and executed by DBMS (C4).

      Note that the BEAS framework does not need to change the underlying DBMS. Indeed, it interacts with the DBMS via SQL only. Hence, BEAS can be built on top of any existing DBMS, providing a bounded evaluation capacity.

      BEAS can also compute approximate answers to unbounded queries under constrained resources, and offers deterministic accuracy guarantees under access schema[17]. We focus on computing exact answers in this paper.

      Below we develop algorithms for components C2 and C3 of BEAS in Section 5.2 and Section 5.3, respectively.

    • We next develop a practical algorithm for component C2 of BEAS. Under a bag access schema $ {\cal B} $, given an RAaggr query Q, it decides whether Q is boundedly evaluable.

      To do this, we first checks whether Q and $ {\cal B} $ fall into the two classes of special cases, i.e., $ {\cal C}^{{\rm{{P}}}} $ or $ {\cal C}^{{\rm{{NP}}}} $, in PTIME. If so, their bounded evaluability can be decided efficiently as shown in the proofs of Theorems 2 and 4. Otherwise, we check whether Q is in the effective syntax $ {\cal L}_{\cal B} $ for RAaggr (Section 4). Below we give a PTIME algorithm for this.

      Algorithm BEChk. The algorithm, denoted by BEChk, is shown in Algorithm 1. It first sets $ {\rm{{BA}}}(Q,{\cal B}) $, $ { {\rm{{BR}}}}(Q,{\cal B}) $ and $ { {\rm{{BQ}}}}(Q,{\cal B}) $ to $ \emptyset $. It then iteratively updates $ {\rm{{BA}}}(Q, {\cal B}) $, $ { {\rm{{BR}}}}(Q, {\cal B}) $ and $ { {\rm{{BQ}}}}(Q, {\cal B}) $ using the rules in Fig. 1. In each iteration, it

      (a) first computes $ {\rm{{BA}}}(Q, {\cal B}) $ using γ1, γ2 and γ3 (line 3);

      (b) then updates $ { {\rm{{BR}}}}(Q, {\cal B}) $ using γ4 (line 4); and

      (c) it finally updates $ { {\rm{{BQ}}}}(Q, {\cal B}) $ using γ5 (line 5).

      The iteration continues until $ { {\rm{{BQ}}}}(Q, {\cal B}) $ can no longer be updated (line 6). It returns “Yes” if $ Q\in { {\rm{{BQ}}}}(Q, {\cal B}) $ and “No” otherwise (lines 7–8). In each iteration, steps (b) and (c) are straightforward. Below we discuss step (a) in more details.

      Algorithm 1. BEChk

      Input: Relational schema $ {\cal R} $, RAaggr Q and bag access schema $ {\cal B} $ over $ {\cal R} $.

      Output: “Yes” (“No”) if Q is (is not) boundedly evaluable under $ {\cal B} $.

      1 $ {\rm{{BA}}}(Q,{\cal B}) \leftarrow\emptyset $; $ { {\rm{{BR}}}}(Q,{\cal B}) \leftarrow\emptyset $; $ { {\rm{{BQ}}}}(Q, {\cal B}) \leftarrow \emptyset $; flag ← true;

      2 while flag = true do

      3  compute $ {\rm{{BA}}}(Q,{\cal B}) $ using γ1, γ2, γ3 // recall γi in Table 1

      4  compute $ { {\rm{{BR}}}}(Q,{\cal B}) $ using γ5;

      5  compute $ { {\rm{{BQ}}}}(Q,{\cal B}) $ using γ6;

      6  if $ { {\rm{{BQ}}}}(Q,{\cal B}) $ is not changed then flag ← false;

      7  if $ Q\in{ {\rm{{BQ}}}}(Q,{\cal B}) $ then return “Yes”;

      8 else return “No”;

      Computing $ {\rm{{BA}}}(Q, {\cal B}) $ (line 3 of Algorithm 1). In each iteration, $ {\rm{{BA}}}(Q, {\cal B}) $ is updated in two steps, as follows.

      (1) Building universal relation. We first build a “universal schema” UQ of Q w.r.t. $ { {\rm{{BQ}}}}(Q, {\cal B}) $, by mapping attributes and aggregate fields of Q to attributes of UQ, via a mapping function ρ. For any two attributes R[A] and S[B] of Q, ρ(R[A]) = ρ(S[B]) if and only if $ \Sigma_{Q}\vdash R[A] = S[B] $ is in the selection condition of Q. For aggregate field agg(A) and attribute R[B], agg(A)) = ρ(R[B]) only when agg(A) is in $ Z_{Q_{s}} $ (recall Table 1) for some $ Q_{s}\in { {\rm{{BQ}}}}(Q, {\cal B}) $. Accordingly, bag constraints in $ {\cal B} $ are also mapped on to UQ by ρ.

      (2) Computing fetch closure. We then reduce the computation of $ {\rm{{BA}}}(Q, {\cal B}) $ to the computation of fetch closures over UQ with $ {\cal B} $ w.r.t. ρ. For a set W of attributes of UQ, its fetch closure, denoted by $ W^{{\cal B}} $, is a set of attributes of UQ such that

      (i) $ W\subseteq W^{{\cal B}} $;

      (ii) if $ X'\subseteq W^{{\cal B}} $ and $ \varphi = R (\!| X \rightarrow Y, N |\!) \in {\cal B} $ such that $ \rho(R[X]) = X' $ and $ \rho(R[Y]) = Y' $, then $ Y'\subseteq W^{{\cal B}} $; and

      (iii) $ W^{{\cal B}} $ contains nothing else.

      Let $ W = \rho(X_{Q}^{c}) \cup \rho({\rm{{BA}}}(Q, {\cal B})) \cup \bigcup_{Q_{s}\in { {\rm{{BQ}}}}(Q, {\cal B})}\rho(Z_{Q_{s}}) $. We set $ {\rm{{BA}}}(Q, {\cal B}) = \{A\in X_{Q}\,|\, \rho(A)\in W^{{\cal B}}\} $ (see ZQ, XQ in Table 1).

      Example 8. Recall Q2 from Example 2 and bag access schema $ {\cal B}_1 $ from Example 4. Algorithm BEChk iteratively updates BA, BR and BQ for Q2 and $ {\cal B}_{1} $, which are all $ \emptyset $ initially.

      In the first iteration, BEChk starts by updating $ {\rm{{BA}}}(Q_{2}, {\cal B}_{1}) $ (line 3). To do this, it builds a universal relation $ U_{Q_{2}} = \{{\rm{{f.uid}}}, {\rm{{f.fid}}}, {\rm{{c.cty}}}, {\rm{{c.date}}}\} $ via function ρ that maps c.uid to f.fid and keeps all other attributes intact (f and c stand for friend and checkin, respectively). Since $ \rho(X_{Q_{2}}^{c}) = \{{\rm{{f.uid}}}, {\rm{{c.cty}}}\} $, $ \rho({\rm{{BA}}}(Q_{2}, {\cal B}_{1})) = \emptyset $ and $ { {\rm{{BQ}}}}(Q_{2}, {\cal B}_{1}) = \emptyset $, BEChk sets W to $ \rho(X_{Q_{2}}^{c}) $ and computes the fetch closure $ W^{{\cal B}_{1}} $ of W, yielding $ W^{{\cal B}_{1}} = \{{\rm{{f.uid}}}, {\rm{{f.fid}}}, {\rm{{c.cty}}}\} $. Hence it updates $ {\rm{{BA}}}(Q_{2}, {\cal B}_{1}) $ to {f.uid, f.fid, c.cty}. It then updates $ { {\rm{{BR}}}}(Q_{2}, {\cal B}_{1}) $ to {f, c} since all nontrivial attributes of f and c are already in $ {\rm{{BA}}}(Q_{2}, {\cal B}_{1}) $ (line 4). BEChk finally updates $ { {\rm{{BQ}}}}(Q_{2}, {\cal B}_{1}) $ to {Q3, Q2} (line 5) and terminates in next iteration and returns “Yes”.

      It gets more involved for Q6 from Example 5 and $ {\cal B}_{2} $ from Example 4. In the first iteration, BEChk builds a universal schema $ U_{Q_{6}} = \{A, B, C, E, F, W, F'\} $ via a mapping function $ \rho $ that keeps attributes of R and S and maps aggregate field (i.e., the output) sum(y) of Q7 to F'. Note that $ \rho $ does not map sum(y) to E although $ \Sigma_{Q_{6}} \vdash { {\rm{{sum}}}}(y) = E $ since $ Q_{7} \not\in { {\rm{{BQ}}}}(Q_{6}, {\cal B}_{2}) = \emptyset $ yet. BEChk then computes the fetch closure of $ X_{Q_{6}}^{C} $ over $ U_{Q_{6}} $ and sets $ {\rm{{BA}}}(Q_{6}, {\cal B}_{2}) $ to {B, C, F}. It then finds that all nontrivial attributes of R are in $ {\rm{{BA}}}(Q_{6}, {\cal B}_{2}) $ and hence updates $ { {\rm{{BR}}}}(Q_{6}, {\cal B}_{2}) $ to $ \{R\} $. Consequently, it sets $ { {\rm{{BQ}}}}(Q_{6}, {\cal B}_{2}) $ to Q7 as well. In the second iteration, BEChk builds an updated universal relation $ U_{Q_{6}} = \{A, B, C, E, F, W\} $ since $ Q_{7} \in { {\rm{{BQ}}}}(Q_{6}, {\cal B}_{2}) $ and $ \Sigma_{Q_{6}} \vdash { {\rm{{sum}}}}(y) = E $. It continues to update $ {\rm{{BA}}}(Q_{6}, {\cal B}_{2}) $ to {B, C, E, F, W}, $ { {\rm{{BR}}}}(Q_{6}, {\cal B}_{2}) $ to {R, S} and $ { {\rm{{BQ}}}}(Q_{6}, {\cal B}_{2}) $ to {Q7, Q6}. It terminates after the third iterations and returns “Yes” for Q6 under $ {\cal B}_{2} $.

      Correctness & Complexity. To see that BEChk correctly checks the effective syntax $ {\cal L}_{\cal B} $ of Fig. 1, observe the following. (1) For any fixed $ { {\rm{{BQ}}}}(Q, {\cal B}) $, the corresponding $ {\rm{{BA}}}(Q, {\cal B}) $ decided by rules γ1, ··· , γ4 is exactly the fetch closure $ W^{{\cal B}} $ (recall the definition of $ W^{{\cal B}} $ above). (2) The while loop propagates changes from BA to BR and to BQ, and finally to BA again, until reaching a fixed point w.r.t. the rules of Fig. 1.

      BEChk can be implemented in $ O(p_{Q}(|\!|{{Q}}|\!||{{{\cal B}}}| + |{{Q}}|)) $ time, where pQ is the number of sub-queries in Q, ||Q|| is the number of relation atoms in Q, |Q| is the number of attributes and aggregate fields in the relation atoms and predicates of Q, $ |\!|{{{\cal B}}}|\!| $ and $ |{{{\cal B}}}| $ are the number and total length of bag constraints in $ {\cal B} $, respectively (see Table 1). Indeed, computing the fetch closure can be implemented in $ O(|\!|{{Q}}|\!||\!|{{{\cal B}}}|\!|) $-time, and hence each while iteration is in $ O(|\!|{{Q}}|\!||\!|{{{\cal B}}}|\!| + |{{Q}}|) $ time; there are at most pQ iterations.

      Algorithm BEChk provides a constructive proof for property (c) of the effective syntax $ {\cal L}_{\cal B} $ in Theorem 7, i.e., it is in PTIME to check whether $ Q \in {\cal L}_{\cal B} $ for an RAaggr query Q.

      This also completes the proof of Theorem 7.

    • We next provide an algorithm underlying component C3 of BEAS, denoted by BPlan. Given a bag access schema $ {\cal B} $ and an RAaggr query Q that is determined to be boundedly evaluable under $ {\cal B} $ by BEChk of Section 5.2, BPlan generates a bounded RAaggr query plan for Q under $ {\cal B} $.

      Algorithm BPlan. Given a boundedly evaluable RAaggr query $ Q\in {\cal L}_{\cal B} $ (see Section 4), BPlan generates a bounded plan ξQ for Q under $ {\cal B} $ as follows: (1) fetch a bounded amount of data for each relation R that appears in Q, and (2) carry out operations of Q over fetched data. While step (2) is straightforward, step (1) is rather involved.

      To carry out step (1), BPlan generates bounded logical access paths (bLAPs). A bLAP ξR for a relation R in Q fetches all values (partial tuples) of R that are necessary for evaluating Q with $ {\cal B} $; moreover, ξR is a bounded RAaggr plan under $ {\cal B} $. Intuitively, bLAPs play the same role as conventional DBMS access paths. But instead of accessing complete tuples by scan or index, bLAPs fetches values (partial tuples) using $ {\cal B} $ such that the amount of data accessed is bounded.

      More specifically, we give an algorithm, denoted by BAP, as a sub-procedure of BPlan to find a bLAP ξR for R under $ {\cal B} $. While there may exist exponentially many such bLAPs, BAP aims at computing those with minimum cost.

      After BAP computes bLAP ξR for every relation R in Q, algorithm BPlan generates a bounded plan ξQ for Q under $ {\cal B} $, by replacing each R in Q with its bLAP ξR, and by carrying out RAaggr operations of Q on the data retrieved by ξR.

      In the rest of the section, we focus on algorithm BAP.

      Parametric cost measures. To evaluate the quality of bLAPs found by BAP, we start with a generic class of cost functions. Conventional access path measures assess the cost of physical table-access methods, e.g., sequential scan and index scan[11]. These metrics do not apply to bLAPs, which involve, e.g., fetch and joins. Hence, BAP employs a generic cost function c(ξR) that takes user specified functions as parameters, to express various cost measures over ξR as bLAPs, e.g., output size, data access, etc.

      The cost of a bLAP ξR for R under $ {\cal B} $, denoted by c(ξR), is inductively defined in Table 2, with five user configurable parameter functions $ \varGamma_{\Join} $, $ \varGamma_{U} $, $ \varGamma_{-} $, $ \varGamma_{ { {\rm{{fetch}}}}} $ and $ \varGamma_{{\rm{{gpBy}}}} $.

      Table 2.  $ {{c}}() $ with parameters $ \varGamma_{\Join/-/\cup/ { {\rm{{fetch}}}}/{\rm{{gpBy}}}} $

      ${ { { \rm{ { bLAP } } } } }\;$${ {{\xi_{R}}} }$${ {{c}}({{\xi_{R}}}) }$of ${ {{\xi_{R}}} }$
      ${ \{c\} }$or ${ \emptyset }$1
      ${ \sigma_{C}(\xi') }$${ {{c} }(\xi')\times{\lambda_{\sigma} }(C) }$
      ${ \xi_{1}\Join_C \xi_{2} }$${ {\varGamma_{\Join} }( {{c} }(\xi_{1}), {{c} }(\xi_{2}))\times\lambda_\Join(C) }$
      ${ \xi_{1}-\xi_{2} }$${ {\varGamma_{-}}( {{c}}(\xi_{1}), {{c}}(\xi_{2})) }$
      ${ \xi_{1}\cup \xi_{2} }$${ {\varGamma_{\cup}}( {{c}}(\xi_{1}), {{c}}(\xi_{2})) }$
      ${ \pi_{Y}(\xi') }$${ {{c}}(\xi') }$
      ${ { \rm{ { gpBy } } }(\xi', X, \overline{ {\\rm{ {agg} } }(V)}) }$${ \varGamma_{ {\rm{ {gpBy} } } }( {{c} }(\xi'), \lambda_\pi(X)) }$
      ${ { { \rm{ { fetch } } } }(\xi', \varphi) }$with ${ \varphi = R (\!| X \rightarrow Y, N |\!) }$${ {\varGamma_{ { {\rm{ {fetch} } } } } }( {{c} }(\xi'), N) }$

      By parameterizing these user configurable functions, we can support various measures for bLAPs. For example, to estimate the worst-case output size of ξ, we simply set (i) $ \varGamma_{\Join}(c_{1}, c_{2}) $ to $ c_{1}*c_{2} $, (ii) $ \varGamma_{ { {\rm{{fetch}}}}}(c', N) $ to $c'\times N$, (iii) $ \varGamma_{\cup}(c_{1}, c_{2}) $ to $ c_{1}+c_{2} $, (iv) $ \varGamma_{-}(c_{1}, c_{2}) $ to c1, and (v) $ \varGamma_{{\rm{{gpBy}}}}(c', c) $ to c' if $ c\neq 0 $ and to 1 otherwise (assume $ \lambda_{\pi}(X) = 0 $ when $ X = \emptyset $).

      Algorithm BAP. Algorithm BAP works in two steps:

      (1) it reduces bLAPs to proofs of $ R\in { {\rm{{BR}}}}(Q, {\cal B}) $ and encodes all proofs with a directed graph $ {\cal G}(Q, {\cal B}) $ in PTIME; and

      (2) it searches $ {\cal G}(Q, {\cal B}) $ to find proofs with minimum cost, where a proof corresponds to a subgraph in the search trace.

      Here a proof of $ R \in { {\rm{{BR}}}}(Q, {\cal B}) $ is a sequence of applications of the rules given in Fig. 1. Each step of the proof corresponds to one or several operations in a bLAP ξR for R.

      Below we outline BAP (see Appendix for its pseudo code).

      (1) Reduction. It reduces the problem of generating bLAPs for R of Q under $ {\cal B} $ to finding proofs of $ R\in { {\rm{{BR}}}}(Q,{\cal B}) $. It encodes all proofs of $ R\in { {\rm{{BR}}}}(Q, {\cal B}) $ (hence all bLAPs for R) in a weighted directed graph $ {\cal G}(Q, {\cal B}) $, where nodes encode (a) attributes R[X] and R[XY] in constraints $ R (\!| X \rightarrow Y, N |\!) \in {\cal B} $, and (b) relations and sub-queries of Q. Edges encode value propagation among them. It ensures that each proof of $ R\in { {\rm{{BR}}}}(Q, {\cal B}) $ is encoded by a traversal from a dummy node $ u_{\emptyset} $ to node uR encoding R in $ G(Q, {\cal B}) $. Graph $ G(Q, {\cal B}) $ has at most $ 2|\!|{{B}}|\!| + |Q| $ nodes and $ |\!|{{B}}|\!|(|\!|{{B}}|\!|+|Q|) $ edges.

      We illustrate reduced graphs with the following example, and defer the construction details to Appendix.

      Example 9. Recall RAaggr queries Q2 of Example 2 and Q6 of Example 5, and bag access schemas $ {\cal B}_{1} $ and $ {\cal B}_{2} $ from Example 4. Graphs $ G(Q_{2}, {\cal B}_{1}) $ and $ G(Q_{6}, {\cal B}_{2}) $ are shown in Fig. 3. Here $ u_{\emptyset} $ is a dummy node connected to all constant attributes in Q. Edges with numeric weights are to encode deduction steps with rule γ3 of Fig. 1, where the weights are the cardinality N's of the corresponding access constraints.

      Figure 3.  Reduced graphs for Example 9

      As will be show below, proofs of a relation $ R\in { {\rm{{BR}}}}(Q, {\cal B}) $ can be encoded as traversals from $ u_\emptyset $ to uR in $ G(Q, {\cal B}) $.

      (2) Conditional Dijkstra search. Algorithm BAP then adopts a Dijkstra-like search over $ G(Q, {\cal B}) $, from the dummy node $ u_{\emptyset} $ to the relation node uR encoding R, such that the trace of the search encodes a bLAP (i.e., proof) for R under $ {\cal B} $.

      It extends Dijkstra algorithm[24] as follows.

      (a) Conditional expansion. Denote by Uu the attribute, relation or sub-query encoded by a node u in $ G(Q, {\cal B}) $. Note that Uu may be deduced from attributes or sub-queries encoded by multiple predecessors of u as preconditions in the proof of $ R\in { {\rm{{BR}}}}(Q,{\cal B}) $. To capture this, BAP visits a new node u under the condition that Uu can be obtained from predecessors of u via, e.g., joins or fetch.

      With the condition, BAP ensures that for node u encoding relation $ R\in { {\rm{{BR}}}}(Q, {\cal B}) $, a traversal in $ G(Q, {\cal B}) $ from $ u_\emptyset $ to u encodes a bLAP for relation atom R in query Q under $ {\cal B} $. To illustrate this, let us consider Example 10.

      Example 10. A proof of $ {\rm{{f}}}\in { {\rm{{BR}}}}(Q_{2}, {\cal B}_{1}) $ (here f denotes friend) consists of the deduction steps (1), (2) and (4) in Example 7. It is encoded by the unique path from $ u_{\emptyset} $ to uf in $ G(Q_{2}, {\cal B}_{1}) $; similarly for the proof of $ {\rm{{c}}}\in { {\rm{{BR}}}}(Q_{2}, {\cal B}_{1}) $.

      A more informative example is $ S\in { {\rm{{BR}}}}(Q_{6}, {\cal B}_{2}) $. Its proof is also described in Example 7. The proof is encoded by the traversal from $ u_{\emptyset} $ to uS, which is $ G(Q_{6}, {\cal B}_{2}) $ without the edge $ (u_{s}, u_{Q_{6}}) $ as shown in Fig. 3. Note that although there are two simple paths from $ u_{\emptyset} $ to uS, none of them is a valid traversal because of the conditional expansion.

      The bLAPs encoded by these proofs are exactly sub-plans of Q2 and Q6 given in Example 5. Indeed, the bLAP ξc for relation c of Q2 under $ {\cal B}_{1} $ is (T1, T2) of the bounded plan $ \xi_{Q_{2}} $ in Example 5, and the bLAP ξf for f is simply T1. Similarly, the bLAP ξR for R of Q6 under $ {\cal B}_{2} $ is simply T1 of $ \xi_{Q_{4}} $ in Example 5; and bLAP ξS for S is (T1, T2, T3) of $ \xi_{Q_{4}} $.

      Note that a bLAP may involve multiple relations via fetch and join, e.g., ξS for S of Q6. Hence its costs cannot be assessed by traditional access path measures since those methods are developed for evaluating the access method of a single relation via, e.g., sequential or index scan.

      (b) Search revision. Note that the output of an RAaggr sub-query can be used to fetch attributes that have already been deduced, possibly with a smaller cost reduced by c(ξR). To retain the optimality of the search, when visiting a node u that encodes a sub-query of Q, algorithm BAP checks whether this yields a better bLAP by starting a new search from u and marking all nodes as unvisited. It terminates if it cannot further improve the previously searched bLAPs.

      Example 11. Continuing with Example 9, assume that we use c(ξR) to express the worst-case output size of ξR (recall its parameter functions described earlier). Then BAP computes exactly the bLAPs for Q2 and Q6 under $ {\cal B}_{1} $ and $ {\cal B}_{2} $, respectively, as described in Example 10. In particular, $ {{c}}(\xi_{\rm{f}}) =5\;000 $ and$ {{c}}(\xi_{{\rm{{c}}}}) = 5\;000 \times 193 $for Q2 under $ {\cal B}_{1} $; $ {{c}}(\xi_{R}) = 10 $ and $ {{c}}(\xi_{S}) = 10 $ for $ Q_{6} $ under $ {\cal B}_{2} $. In this case, when computing $ \xi_{S} $ for relation S of $ Q_{6} $ under $ {\cal B}_{2} $, it restarts the search once due to the aggregate subquery Q7, which does not improve the bLAPs ξR and ξS.

      Correctness & Complexity. The correctness of algorithm BAP is warranted by the following: (1) each search trace of BAP encodes a proof of $ R\in { {\rm{{BR}}}}(Q,{\cal B}) $; and (2) a proof of $ R\in { {\rm{{BR}}}}(Q, {\cal B}) $ encodes a bLAP for R under $ {\cal B} $. BAP can be implemented in $ O(|Q||{\cal B}|(|\!|{{{\cal B}}}|\!| + |Q|\log(|Q| +2|\!|{{{\cal B}}}|\!|))) $-time (ignoring the complexity of parameter functions of c(ξR)). One can verify that BAP restarts at most N times, where N is the number of nodes in $ G(Q, {\cal B}) $.

      Optimality. Algorithm BAP is able to find optimal bLAPs for a large class of parameter functions for c(ξR). We defer detailed proofs of this optimality to Appendix.

    • We have developed BEAS@PG by extending PostgreSQL with bounded evaluation. Using a benchmark and two real-life datasets, we conducted four sets of experiments to evaluate (1) the overall performance of BEAS@PG vs PostgreSQL; and the effectiveness of bounded evaluation for (2) bounded queries and (3) unbounded queries.

      Experimental setting. We start with the setting.

      Bench mark. We used TPCH benchmark[15]. It uses TPCHdbgen to generate 8 relations with 61 attributes of different scales. It contains 22 built-in benchmark queries.

      Real-life datasets. We also used two real-life datasets.

      (a) US Air carriers (AIRCA) records flight and statistic data of US air carriers. It consists of Flight On-Time Performance Data[25] for departure and arrival information, and Carrier Statistic data[26] for airline market and segment data of the air carriers. It has 3 tables, 200 attributes, and about 16 GB of data with records from 1990 to 1997.

      (b) UK MOT data (UKMOT) integrates the anonymised data[27] that records MOT tests and outcomes, and the roadside survey of vehicle observations[28] that includes vehicles passing observation points in the UK. It has 3 tables with 42 attributes, about 16 GB of data from 2007 to 2011.

      Queries. To test the impact of query structures on the effectiveness of bounded evaluation, we designed a generator to generate queries with different structures over the two real-life datasets. More specifically, we manually created 30 query templates for each of the two datasets (Q1–Q15 are boundedly evaluable and Q16–Q30 are unbounded), with 0 to 4 joins. The generator populates these templates by randomly instantiating parameters in the templates with values from the datasets, yielding 150 queries for each real-life dataset.

      Access schema. We built access schemas with 59, 18 and 14 access constraints over TPCH, AIRCA and UKMOT, respectively. We extended TANE[29], an algorithm for discovering functional dependencies, to first find candidate constraints $ \varphi = R (\!| X \rightarrow Y, N |\!) $ on small sample datasets of 100 MB, and ranked them by their cardinalities N′s. We then checked whether their N′s are insensitive to the size of datasets D, by varying the size of D, e.g., 200 MB and 500 MB. We picked those access constraints with small and size-insensitive N′s, such that the total size of the indices is at most 3 times of the size of its D.

      Configuration. For DBMS, we used PostgreSQL 9.6 with all optimization enabled (BEAS@PG is built with PostgreSQL 9.6). In favor of PostgreSQL, besides indices for access constraints, we also built the following extra indices for PostgreSQL: (1) for each access constraint $ R (\!| X \rightarrow Y, N |\!) $, we built a B-tree index on attributes X over R as well; (2) we built all primary key and foreign key indices; and (3) we also built B-tree on numerical attributes. Note that these were only for PostgreSQL, not built for BEAS@PG. We set the cost measure parameters of BEAS@PG as the worst-case output size estimation (recall Section 5.3).

      The experiments were conducted on an Amazon EC2 Dense-storage instance m4.xlarge, with 16 GB of memory, 4 Intel Xeon E5-2676 vCPUs, and 500 GB of EBS SSD storage. Both the plan generation time and the execution time of the generated plans are included in evaluation time. All the experiments were run 3 times. The average is reported here.

      Experimental results. We next report our findings.

      Exp-1: Overall performance. We first report the evaluation time of 22 TPCH queries over 16 GB of TPCH data, and the 60 query templates over the entire AIRCA and UKMOT datasets, where evaluation time of a query template is the average of the evaluation time of its 5 instantiated queries.

      (1) Index size. The indices of all the access constraints over TPCH, AIRCA and UKMOT account for 2.98, 0.01 and 0.25 times of the size of the datasets, respectively; the additional indices built only for PostgreSQL (in favor of the conventional DBMS) are of size 2.21, 0.87 and 1.5 times of that of TPCH, AIRCA and UKMOT, respectively.

      (2) Query overview. None of the TPCH queries is boundedly evaluable under the access constraints selected. This is because the TPCH data generator scales cardinalities N's of almost all candidate access constraints $ R (\!| X \rightarrow Y, N |\!) $ due to its simple scaling up strategy. This rules out most of the candidate constraints when we scale up to larger datasets while using a fixed threshold for N. For the 60 query templates over AIRCA and UKMOT, 30 of them are boundedly evaluable under the access constraints used, 15 for each dataset. Note that one could build more access constraints to allow more bounded queries. We will evaluate the performance of BEAS@PG for bounded and unbounded queries in more details in Exp-2 and Exp-3, respectively.

      (3) Performance. The results for TPCH, AIRCA and UKMOT are reported in Tables 35, respectively.

      Table 3.  TPCH query evaluation time on 16 GB (ms)

      QueriesQ1Q2Q3Q4Q5Q6Q7Q8Q9Q10Q11
      PostgreSQL ${ t_{ {\rm{ {P} } } } }$8.16 × 1053.68 × 1041.02 × 1051.56 × 1041.50 × 1052.07 × 1049.53 × 1043.71 × 1045.03 × 1042.28 × 1057.05 × 104
      BEAS@PG ${ t_{ {\rm{ {B} } } } }$5.72 × 1041.67 × 1042.97 × 1041.45 × 1041.43 × 1054.25 × 1037.65 × 1043.43 × 1043.24 × 1048.19 × 1043.46 × 103
      Speedup ${ t_{ {\rm{ {P} } } }/t_{ {\rm{ {B} } } } }$14.282.213.441.071.054.881.241.081.552.7920.39
      QueriesQ12Q13Q14Q15Q16Q17Q18Q19Q20Q21Q22
      PostgreSQL ${ t_{ {\rm{ {P} } } } }$6.42 × 1041.79 × 1051.74 × 1045.10 × 1044.46 × 1046.04 × 1032.83 × 1057.27 × 1031.06 × 1052.66 × 1057.74 × 103
      BEAS@PG ${ t_{ {\rm{ {B} } } } }$8.58 × 1031.20 × 1051.10 × 1042.20 × 1043.02 × 1041.49 × 1022.38 × 1051.29 × 1032.70 × 1031.04 × 1055.70 × 103
      Speedup ${ t_{ {\rm{ {P} } } }/t_{ {\rm{ {B} } } } }$7.481.491.582.321.4840.461.195.6539.162.551.36

      Table 4.  Average query template evaluation time on AIRCA (ms)

      QueriesQ1Q2Q3Q4Q5Q6Q7Q8Q9Q10
      PostgreSQL ${ t_{{\rm{{P}}}} }$7.00 × 1037.2 × 1032.06 × 1049.04 × 1044.44 × 1049.41 × 1041.36 × 1051.44 × 1059.41 × 1041.38 × 105
      BEAS@PG ${ t_{{\rm{{B}}}} }$1.221.190.642.632.616.344.286.696.024.98
      Speedup ${ t_{{\rm{{P}}}}/t_{{\rm{{B}}}} }$5.75 × 1036.04 × 1033.21 × 1043.44 × 1041.70 × 1041.48 × 1043.19 × 1042.15 × 1041.57 × 1042.77 × 104
      QueriesQ11Q12Q13Q14Q15Q16Q17Q18Q19Q20
      PostgreSQL ${ t_{{\rm{{P}}}} }$1.46 × 1059.46 × 1041.49 × 1051.56 × 10510.63 × 1044.46 × 1044.65 × 1044.22 × 1044.25 × 1049.29 × 104
      BEAS@PG ${ t_{{\rm{{B}}}} }$7.2411.426.558.511.6724.9624.32.844.63 × 1027.26 × 102
      Speedup ${ t_{{\rm{{P}}}}/t_{{\rm{{B}}}} }$2.01 × 1048.29 × 1032.29 × 1041.84 × 1049.11 × 1031.79 × 1031.91 × 1031.48 × 10491.861.29 × 102
      QueriesQ21Q22Q23Q24Q25Q26Q27Q28Q29Q30
      PostgreSQL ${ t_{{\rm{{P}}}} }$9.30 × 1048.98 × 1049.44 × 1049.41 × 1041.39 × 1051.47 × 1059.58 × 1041.75 × 1051.85 × 1051.31 × 105
      BEAS@PG ${ t_{{\rm{{B}}}} }$5.43 × 1024.61 × 1027.25 × 1025.23 × 1021.4 × 1031.74 × 1031.36 × 1032.67 × 1043.06 × 1042.62 × 104
      Speedup ${ t_{{\rm{{P}}}}/t_{{\rm{{B}}}} }$1.71 × 1021.94 × 1021.31 × 1021.80 × 10299.5284.4970.356.546.065.02

      Table 5.  Average query template evaluation time on UKMOT (ms)

      QueriesQ1Q2Q3Q4Q5Q6Q7Q8Q9Q10
      PostgreSQL $ t_{{\rm{{P}}}} $1.37 × 1056.47 × 1047.80 × 1044.62 × 1054.65 × 1053.86 × 1055.8 × 1053.91 × 1055.72 × 1055.99 × 105
      BEAS@PG $ t_{{\rm{{B}}}} $0.552.730.6216.1375.25.621.50 × 1023.78 × 10254.791.89 × 102
      Speedup $ t_{{\rm{{P}}}}/t_{{\rm{{B}}}} $2.52 × 1052.38 × 1041.25 × 1052.86 × 1046.18 × 1036.88 × 1043.88 × 1031.04 × 1031.04 × 1043.18 × 103
      QueriesQ11Q12Q13Q14Q15Q16Q17Q18Q19Q20
      PostgreSQL $ t_{{\rm{{P}}}} $4.09 × 1055.91 × 1057.80 × 1055.91 × 1057.74 × 1051.61 × 1051.59 × 1051.53 × 1053.97 × 1053.93 × 105
      BEAS@PG $ t_{{\rm{{B}}}} $3.89 × 10255.421.89 × 1023.94 × 10296.2430.4965.092.517.4 × 1037.41 × 103
      Speedup $ t_{{\rm{{P}}}}/t_{{\rm{{B}}}} $1.06 × 1031.07 × 1044.13 × 1031.49 × 1038.05 × 1035.28 × 1032.44 × 1036.10 × 10453.5553.14
      QueriesQ21Q22Q23Q24Q25Q26Q27Q28Q29Q30
      PostgreSQL $ t_{{\rm{{P}}}} $3.93 × 1056.01 × 1053.94 × 1055.98 × 1056.26 × 1054.13 × 1056.19 × 1057.04 × 1054.86 × 1056.96 × 105
      BEAS@PG $ t_{{\rm{{B}}}} $9.25 × 1037.51 × 1037.72 × 1039.34 × 1037.62 × 1037.82 × 1037.38 × 1036.42 × 1047.65 × 1046.26 × 104
      Speedup $ t_{{\rm{{P}}}}/t_{{\rm{{B}}}} $42.4880.0751.1264.0882.1452.8783.8110.956.3711.12

      (a) BEAS@PG outperforms PostgreSQL on each and every query on all the three datasets, when all indices are enabled for PostgreSQL. It is 1.11 $\times\!\;10^4$ times faster on average.

      (b) Even though all TPCH queries are unbounded, over 16 GB of TPCH data, BEAS@PG is up to 40.46 times faster than PostgreSQL, and is on average 7.32 times faster. For unbounded queries over AIRCA and UKMOT, BEAS@PG is on average 1.32 $\times\!\;10^3$ and 4.61 $\times\!\;10^3$ times faster than PostgreSQL, respectively, up to 1.48 $\times\!\;10^4$ and 6.10 $\times\!\;10^4$ times.

      (c) For bounded queries, BEAS@PG is 1.79 $\times\!\;10^4$ and 3.66 $\times\!\;10^4$ times faster than PostgreSQL on AIRCA and UKMOT, respectively, up to 3.44 $\times\!\;10^4$ and 2.52 $\times\!\;10^5$ times.

      The results show that with a modest number (and size) of access constraints, BEAS@PG can speed up PostgreSQL on both bounded queries and unbounded queries, when all relevant indices are enabled for PostgreSQL, including those of access constraints and additional indices tailored for PostgreSQL. This verifies the effectiveness of bounded evaluation for generic queries, bounded or not, while the speedup is much larger for bounded queries, as expected.

      Below we report more in-depth evaluation results for BEAS@PG versus PostgreSQL (with additional indices) for bounded queries (Exp-2) and unbounded queries (Exp-3).

      Exp-2: Effectiveness for bounded queries. We next evaluated the impact of datasets D and queries Q on the evaluation time of BEAS@PG and PostgreSQL (with indices enabled), when queries Q are boundedly evaluable.

      Varying |D|. To evaluate the impact of |D|, we partitioned AIRCA and UKMOT datasets by their date attributes (year and month), yielding subsets of sizes from 1 GB to 16 GB, consistent with how we scale up TPCH datasets when testing unbounded queries below in Exp-3. We did not use TPCH here since it has no boundedly evaluable queries.

      As shown in Figs. 4(a) and 4(b), (a) the evaluation time of BEAS@PG is indifferent to the size of D, as expected for boundedly evaluable queries. (b) Bounded query plans work well with large D. Indeed, BEAS@PG took less than 11.67 ms and 3.94$ \times\!10^2 $ ms for all queries over all subsets of AIRCA and UKMOT, respectively, no matter how large the datasets were. In contrast, even on the subsets of AIRCA and UKMOT of size 8 GB, PostgreSQL took 8.45 $\times\!\;10^4$ ms and 3.88 $\times\!\;10^5$ ms, respectively, up to 1.58 $\times\!\;10^5$ ms and 7.80 $\times\!\;10^5$ ms over the full datasets. That is, PostgreSQL is 1.35 $\times\!\;10^4$ and 1.98 $\times\!\;10^3$ slower than BEAS@PG on AIRCA and UKMOT, respectively, even with all relevant indices built. The larger the dataset is, the bigger the gap between PostgreSQL and BEAS@PG is for bounded queries.

      Figure 4.  Effectiveness of bounded evaluation for bounded and unbounded queries

      Varying Q. To evaluate the impact of queries Q, we varied the complexity of Q, measured as the number #Q of joins in the query templates Q, from 0 to 4, while using the entire AIRCA and UKMOT datasets. Note that for each query template, we instantiated 5 queries by setting its parameters with different values (hence these queries share the same query structure and #Q). The evaluation time of each query template is the average of all its instantiated queries.

      The results are reported in Figs. 4(c) and 4(d). We find the following. (a) The complexity of Q has impacts on the performance of both BEAS@PG and PostgreSQL, as expected. They both take longer time for queries with more joins (i.e., #Q). However, (b) BEAS@PG scales much better with the number #Q of joins in Q than PostgreSQL (with indices). For instance, on average BEAS@PG found answers for all queries with #Q = 4 within 11.67 ms on full-sized AIRCA, while PostgreSQL takes 1.56 $\times\!\;10^5$ ms; that is, PostgreSQL is 1.34 $\times\!\;10^4$ times slower than BEAS@PG for large queries.

      Remark. We find that when queries Q incur joins on keys only, PostgreSQL with extra key/foreign key indices built is almost as fast as BEAS@PG (e.g., TPCH Q4). However, as long as Q involves non-key attributes, e.g., many of the AIRCA and UKMOT queries, PostgreSQL performs poorly on big tables, even provided with all indices. Indeed, on average BEAS@PG outperforms PostgreSQL by 8.98 $\times\!\;10^3$ times and 1.76 $\times\!\;10^4$ times for all bounded queries over all subsets of AIRCA and UKMOT, respectively. The gap gets larger when the number of non-key attributes increases.

      By looking into PostgreSQL′s plan and its EXPLAIN output, we find that this is partially due to the following reason. Given an access constraint $ R (\!| X \rightarrow Y, N |\!) $, BEAS@PG fetches only distinct values of the relevant XY attributes, but PostgreSQL fetches entire tuples with irrelevant attributes of R, although those attributes are not needed for answering Q at all, no matter what indices are provided. This led to duplicated (X,Y) values when X is not a key, and the duplication got inflated rapidly by joins, e.g., EXPLAIN output shows that PostgreSQL consistently accesses entire tables when there are non-key attributes.

      Exp-3: Effectiveness for unbounded queries. In the same setting as in Exp-2, we evaluated the impact of D and Q on the performance of unbounded queries by BEAS@PG and PostgreSQL with indices enabled for PostgreSQL.

      Varying |D|. The results on AIRCA, UKMOT and TPCH are in Fig. (4e), (4f) and (4g), respectively. Observe the following.

      (a) BEAS@PG is able to speed up PostgreSQL even for queries that are not bounded under the available access constraints. On average, BEAS@PG is 7.22 $\times\!\;10^2$, 2.29 $\times\!\;10^3$ and 3.43 times faster than PostgreSQL for unbounded queries on AIRCA, UKMOT and TPCH, respectively. This is because while not all relations in these queries are bounded, bounded evaluation can still speed up their “bounded” subqueries, and hence remains faster than PostgreSQL.

      (b) As opposed to evaluating bounded queries, both BEAS@PG and PostgreSQL are sensitive to the size of the datasets when evaluating unbounded queries. However, BEAS@PG scales much better than PostgreSQL, and their performance gap becomes larger when the dataset size increases. For example, when the dataset increases from 1 GB to 16 GB, the average processing time of BEAS@PG increases from 9.53 $\times\!\;10^2$ ms, 1.67 $\times\!\;10^3$ ms and 2.21 $\times\!\;10^3$ ms to 6.09 $\times\!\;10^3$ ms, 1.84 $\times\!\;10^4$ ms and 4.54 $\times\!\;10^4$ ms on AIRCA, UKMOT and TPCH, respectively. In contrast, PostgreSQL increases from 7.50 $\times\!\;10^3$ ms, 2.61 $\times\!\;10^4$ ms and 5.98 $\times\!\;10^3$ ms to 1.04 $\times\!\;10^5$ ms, 4.53 $\times\!\;10^5$ ms and 1.23 $\times\!\;10^5$ ms, respectively, even with all indices built and enabled.

      Note that the speedup for unbounded TPCH queries is not as good as for AIRCA and UKMOT queries. This is because (i) the N's of access constraints $ R (\!| X \rightarrow Y, N |\!) $ over TPCH scale linearly as the dataset gets larger, while those on AIRCA and UKMOT are more stable and independent of the dataset size; and (ii) joins in TPCH queries are mostly key/foreign key joins, and thus the extra key indices built for PostgreSQL can mimic bounded query plans used by BEAS@PG to some extent, reducing their performance gaps.

      Varying Q. Varying the number #Q of joins in the queries, the evaluation time of unbounded queries over AIRCA and UKMOT is reported in Figs. (4h) and (4i), respectively. The results tell us the following. (a) The processing time of BEAS@PG and PostgreSQL increases when the number of joins increases. However, (b) the gap between BEAS@PG and PostgreSQL becomes larger when #Q increases from 0 to 4. For instance, over AIRCA, on average BEAS@PG and PostgreSQL take 17.37 ms and 4.43 $\times\!\;10^4 $ ms, respectively, to answer queries with #Q = 0; and the two take 2.78 $\times\!\;10^4$ ms and 1.64 $\times\!\;10^5$ ms, respectively, when #Q = 4; the results over UKMOT are similar. Note that for bounded queries, the gap between the two is even larger (Exp-2).

      Summary. We find the following. (1) BEAS@PG (PostgreSQL with BEAS built on top) does better than PostgreSQL for each and every query in all cases, even with extra indices built for the latter. On average BEAS@PG improves PostgreSQL by 7.32, 9.58 $\times\!\;10^3$ and 2.06 $\times\!\;10^4$ times for TPCH benchmark of 16 GB, AIRCA and UKMOT, respectively, up to 40.46, 3.44 $\times\!\;10^4$, and 2.52 $\times\!\;10^5$ times in the best case. (2) For queries that are boundedly evaluable, BEAS@PG outperforms PostgreSQL by 1.9 $\times\!\;10^4$ and 3.6 $\times\!\;10^4$ times on AIRCA and UKMOT, respectively. (3) For queries with complicated joins, e.g., joins on non-key attributes (AIRCA and UKMOT queries), BEAS@PG is particularly effective, even for unbounded queries. For example, on average BEAS@PG improves PostgreSQL by 5.97 $\times\!\;10^2$ and 1.90 $\times\!\;10^3$ times for queries that are not boundedly evaluable over AIRCA and UKMOT, respectively. For cases where conventional DBMS does its best, e.g., table scan/aggregation and key-foreign key joins (most TPCH queries), BEAS@PG still does better than PostgreSQL. (4) The storage cost for indices of access schema is modest, accounting for 2.98, 0.01 and 0.25 times of the size of 16 GB TPCH, AIRCA and UKMOT, respectively.

    • The related work is categorized as follows.

      Bounded evaluation. The notion of bounded evaluation was introduced in [5], as an effort to formalize scale independence[19, 30, 31]. The latter aims to guarantee that a bounded amount of work is required to execute all queries in an application, regardless of the size of the underlying data. Under access schema proposed in [19], Fan et al.[5] defines boundedly evaluable RA queries. It establishes the complexity of deciding whether a query is boundedly evaluable, for queries in various fragments of RA, ranging from EXPSPACE-hard to undecidable. Bounded evaluation using views was studied in [32], focusing on its complexity bounds.

      To cope with the undecidability of the bounded evaluability problem, an effective syntax was given for RA in [6] under the set semantics. Based on the syntax, algorithms were developed[6] for checking the bounded evaluability of RA queries Q, and if affirmative, generating a bounded query plan for Q. These issues were also studied in [33] for SPC, using a restricted form of query plans. Based on [6], a prototype BEAS for RA queries was developed[34].

      This work extends the prior work in the following. (1) We define bag access schema, an extension of access schema of [5, 19] to support the bag semantics (Section 2). (2) We identify decidable special cases of the bounded evaluability problem that cover a variety of SQL queries commonly used in practice. (3) We develop an effective syntax for boundedly evaluable RAaggr queries under a bag access schema, supporting nested aggregations (Section 4). Moreover, the syntax allows us to make a larger class of RA queries bounded, improving the result of [6] for RA. (4) We extend BEAS[34] from RA to RAaggr, by seamlessly integrating bounded evaluation with DBMS query optimizers, which is quite different from [6, 34]. These extend DBMS with bounded evaluation, which was not studied in [5, 6, 33, 34].

      Query answering with constrained resources. The objective of this work is to make big data analytics accessible to small companies under constrained resources. For queries that are not boundedly evaluable, an approach is to compute approximate answers under available resources. Approximation techniques have been extensively studied, based on synopsis (e.g.,[35-39]) or dynamic sampling (e.g.,[40-42]). We have proposed a data-driven approximation scheme[17] that computes approximate answers $ Q( {\cal D}_Q) $ to an RAaggr query Q in a dataset $ {\cal D} $, by identifying a fraction $ {\cal D}_Q $ of $ {\cal D} $ under an extension of the access schema of [5]. It ensures a deterministic accuracy bound η: (a) for each tuple $ s \in Q( {\cal D}_Q) $, there exists an exact answer t that is within distance at most η from S, and (b) for each exact answer $ t \in Q( {\cal D}) $, there exists $ s \in Q( {\cal D}_Q) $ within distance η from t.

      This work differs from [17] in that we focus on computing exact answers instead of approximation. The techniques are hence quite different. In particular, a bag access schema carries the multiplicities of tuples to deal with the bag semantics, as opposed to distance bounds in access templates of [17]. This said, this work and [17] are complementary to each other. On one hand, the methods of [17] can be used to compute approximate answers to unbounded queries under constrained resources. On the other hand, the techniques developed in this work can be incorporated into the methods of [17], to improve the accuracy of approximate answers by making use of DBMS optimizers and bounded sub-plans.

      Indices. Hash-based or tree-based, DBMS indices are typically defined at the tuple level[11], to retrieve tuple IDs and fetch full tuples. In contrast, a bag constraint $ R (\!| X \rightarrow Y, N |\!) $ offers a value-based index. Bounded plans fetch distinct partial tuples( Y-values) for each input X-value, and thus reduce duplicated and unnecessary attributes in tuples fetched by DBMS, i.e., reduce data access and intermediate relations. The redundancies get inflated rapidly with joins. Moreover, the cardinality constraints in a bag access schema allow us to determine whether data access is bounded.

      Related to bag access schema is a notion of access patterns, which require a relation to be accessed only by providing certain combinations of attributes, e.g.,[43-45]. As opposed to access patterns, a bag access schema offers cardinality constraints, tuple multiplicity and indices. Moreover, it is not required to cover all the attributes of a relation and hence, allows us to fetch partial tuples and reduce redundancy. Further, this work studies bounded evaluation of RAaggr queries and its integration with DBMS, which were not considered in the prior work on query answering under access patterns.

      Query optimization. There has been a host of work on query optimization in DBMS, including access path selection[46], join optimization[47, 48] and recently, machine leaning methods[49-51]. These focus on access path cost models for, e.g., main-memory concurrent systems[46], heuristics for join[47] and group-by[48] re-ordering, learned indices[50, 51] or optimizers[49, 52-54]. Our algorithms and techniques are complementary to the prior work, to incorporate bounded evaluation into DBMS query optimization.

    • We have presented an approach to extending DBMS with bounded evaluation of SQL queries. The novelty of the work consists of (a) a notion of bag access schema to support the bag semantics of nested aggregations; (b) decidable special cases of the bounded evaluability of RAaggr queries; (c) an effective syntax to characterize boundedly evaluable RAaggr queries; and (d) a framework and its underlying algorithms for integrating bounded evaluation with DBMS. Our experimental study has verified that the approach is promising. Together with the approximation scheme of [17], we hope that this work provides small businesses with a capacity for querying big data under constrained resources.

      One topic for future work is to develop algorithms for discovering bag access schemas by incorporating machine learning techniques. Another topic is to extend bounded evaluation of SQL queries to column-oriented DBMS.

    • The set $ \varXi_{{\cal B}} $ of bounded query plans under a bag access scheme $ {\cal B} $ is inductively defined in Fig. 5.

      Figure 5.  Bounded plans under $ {\cal B} $ (set $ \varXi_{{\cal B}} $)

    • We provide (1) the construction of weighted directed graph $ G(Q, {\cal B}) $ for generating bLAPs, (2) algorithm BAP, and (3) justification for the optimality of bLAPs found by BAP.

      Constructing $ {{G}}({ Q}, {\cal { B}}) $. Let V and E be the node set and edge set of $ G(Q, {\cal B}) $, respectively. They are constructed as follows: (1) for each access constraint $R\left( {\left| {X \to Y,N} \right|} \right) \in {\cal B}$, (a) include uR[X], uR[XY] in V, referred to as BA-node; (b) include (uR[X], uR[XY]) in E, referred to as a fetch-edge; (2) for each relation S, include a node uS in V, referred to as a BR-node; (3) for each sub-query Qs of Q, include $ u_{Q_{s}} $ in V as a BQ-node and $ v_{Z[Q_{s}]} $ in V as a BA-node, where Z[Qs] includes the output attributes of Qs; (4) for each constant attribute A = c in Q, there is a BA-node uA in V, and an additional node $ u_{\emptyset} $; (5) for any two BA-nodes uX and uY, if uX (resp. uY) is not the head (resp. tail) of a fetch-edge, then (uX, uY) is an edge in E if there exist $ A\in X $ and $ B\in Y $ such that $ \Sigma_{Q}\vdash A = B $; (6) $ (u_{\emptyset}, u_{A}) $ is an edge in E for every constant attribute A in Q; (7) for any BA-node uX and BR-node uR, if uX is the tail of a fetch edge and X contains all nontrivial attributes of R, then (uX, uR) is an edge in E; and (8) for any BR-node uR and BQ-node $ u_{Q_{s}} $, $ (u_{R}, u_{Q_{s}}) $ is an edge in E if R is a relation of Qs.

      In graph $ G(Q, {\cal B}) $, fetch-edges carry cardinality N′s in their encoded access constraints $R\left( {\left| {X \to Y,N} \right|} \right) \in {\cal B}$ as their weights; each edge $ (u_{\emptyset}, u_{A}) $ for constant node uA has weight 1; and the other edges have no weight. Graph $ G(Q, {\cal B}) $ has at most $2\left\| B \right\| + |Q|$ nodes and $\left\| B \right\|(\left\| B \right\| + |Q|)$ edges.

      Encoding proofs using ${{G}}({{Q}}, {\cal { B})}$. Traversing a fetch-edge uR[X], uR[XY]) encodes fetch h(ξR[X], φ) for some $\varphi = R\left( {\left| {X \to Y,N} \right|} \right) \in {\cal B}$ that fetches values for Y, where ξX is the bLAP encoded by the traversal from $ u_{\emptyset} $ to uR[X], which retrieves values for R[X] necessary for answering Q; a traversal from a set S of BA-nodes $ u_{X_{1}} $, ···, $ u_{X_{m}} $ to a BA-node $ u_{Y} $ encodes a bLAP $ \xi_{Y} = \pi_{Y}(\xi_{X_{1}}\Join\cdots \Join \xi_{X_{m}}) $ for Y, where $ \xi_{X_{i}}(i\in [1, m]) $ is the bLAP encoded by the traversal from $ u_{\emptyset} $ to $ u_{X_{i}} $ that retrieves values for Xi. Other cases are similar.

      Algorithm 2. BAP

      Input: RAaggr query Q, bounded relation R of Q, bag access schema $ {\cal B} $ and cost function c().

      Output: A bLAP ξR for R under $ {\cal B} $.

      1 construct graph $ G(Q, {\cal B}) $;

      2 $ {{{P}}_{{Q}}}:=\emptyset $; $ {S_{{{vt}}}}:=\emptyset $; $ H:= \emptyset $; $ {{{D}}_{{T}}}[u_{\emptyset}] := 0 $; $ {{{L}}_{{T}}}[u_{\emptyset}] := \emptyset $; $ { {{\rm{GATE}}}}[u_{\emptyset}] := \emptyset $;

      3 for each u in $ G(Q, {\cal B}) $ do

      4 if $ u \neq u_{\emptyset} $ then $ {{{D}}_{{T}}}[u] := +\infty $; $ {{{L}}_{{T}}}[u] := \emptyset $; GATE[u] := L[u];

      5 $ {{{P}}_{{Q}}}.{ {\rm{{push}}}}(u_{\emptyset}) $; //Initialization

      6 while: $ {{{P}}_{{Q}}} \neq \emptyset $ do

      7 u := PQ.pop();//PQ pops out u with minimum DT[u] in PQ

      8 $ {S_{{{vt}}}} := {S_{{{vt}}}} \cup \{u\} $;

      9 for each neighbor v of u that is not in Svt do

      10  $ { {\rm{{GATE}}}}[v] := { {\rm{{GATE}}}}[v]\setminus L[u] $;

      11  if $ (u, v) $ is a fetch-edge and $ {{{D}}_{{T}}}[v] > \varGamma_{{ {\rm{{fetch}}}}}({{{D}}_{{T}}}[u],$ w(u, v)) then

      12   PQ.push(v);

      13   ${{{D}}_{{T}}}[v] :=\; \varGamma\;_{{ {\rm{{fetch}}}}}({{{D}}_{{T}}}[u], w(u,v))$; LT[v]:= {u}

      14  else if v is a BR-node then

      15   PQ.push(v), DT[v]:= DT[u]; LT[v]:={u}

      16  else if v is a BQ-node and $ { {\rm{{GATE}}}}[v] = \emptyset $ then

      17   PQ.push(v); DT[v]:= c(ξv);

          LT[v]:=pre(v) //ξv is the plan for Q encoded by v, composed from predecessors pre(v) of v following the structure of Q

      18  else if $ { {\rm{{GATE}}}}[v] = \emptyset $ then

         //attributes X of v are joined from those of $ { {\rm{{pre}}}}(v)\cap {S_{{{vt}}}} $

      19   PQ.push(v);

      20   $ ({{{D}}_{{T}}}[v],{{{L}}_{{T}}}[v], {S_{{{vt}}}}):={ {\rm{{SC}}}}(u, { {\rm{{pre}}}}(v)\cap {S_{{{vt}}}}, v) $;

          $d:=\varGamma_{\Join}({{{L}}_{{T}}}[v])$;

      21   if LT[v] contains BQ-node uq and DT[v] and $ u_{q}\not \in {{H}} $ then

      22   DT[v] := d; $ {S_{{{vt}}}} := \{u_{\emptyset}, v\} $; $ {{H}}:={{H}}\cup\{v\} $// restart the search

      23 return bLAP that is encoded by the traversal from $ u_{\emptyset} $ to uR recorded in LT[].

      Algorithm BAP. BAP is given as Algorithm 2. It uses (a) GATE[u] to record the condition for visiting u, e.g., for the head uX of a fetch-edge, GATE[u] = X; (b) L[u] to denote the attributes or relations u encoded; (c) DT[u] to denote the cost of the part of bLAP encoded by the search trace from $ u_{\emptyset} $ to u; (d) LT[u] to store the nodes to be visited before visiting u; (e) a priority queue PQ for nodes to explore; and (f) sets Svt of visited nodes and ${{H}}$ of nodes triggered restarts.

      Algorithm BAP starts the search from $ v_{\emptyset} $. It first initializes data structures (lines 2–5). It then iteratively explores nodes in PQ (lines 6–22). The search extends the Dijkstra search with conditional node expansion controlled by GATE[v] and types of the edges (lines 9–22), using c() to calculate the traversal cost. It restarts the search if the output of a sub-query is used as an input for a fetch (i.e., to visit the head of a fetch-edge; determined by SC given as Algorithm 3) with a reduced cost (lines 21–22). It returns bLAP encoded by the search trace if restarts cannot improve its cost (line 23).

      Algorithm 3. SC

      Input: visited vertex u and vertex set Svt, and vertex v to be visited.

      Output: DT[v], LT[v] and updated Svt.

      1 if $ { {\rm{{GATE}}}}[v] \neq \emptyset $ then $ {{{D}}_{{T}}}[v] := \lambda({{{D}}_{{T}}}[u], w(u, v)) $; $ {{{L}}_{{T}}}[v] := \{u\} $;

      2 return DT[v], LT[v], Svt

      3 $ W:=L[v] $; $ H_{v}:=\emptyset $;

      4 while: $ W\neq \emptyset $ do

      5 choose $ u'\in {S_{{{vt}}}}\cap { {\rm{{pre}}}}(v) $ with minimum $ \frac{g_{v}(\{{{\rm{D}}_{\rm{T}}}[v']\,|\,v'\in H_{v}\cup \{u'\}\})}{|W\cap L[u']|} $;

      6 $ W := W\setminus L[u'] $; $ H_{v} := H_{v}\cup \{u'\} $

      7 if (u, v) is a fetch-edge and $ \lambda({{{D}}_{{T}}}[u], w(u, v)) <$ $g_{v}(\{{{{D}}_{{T}}}[v']\,|\,v'\in H_{v}) $ then

      8 $ {{{D}}_{{T}}}[v] := \lambda({{{D}}_{{T}}}[u], w(u, v)) $; $ {{{L}}_{{T}}}[v] := \{u\} $

      9 else

      10 $ {{{D}}_{{T}}}[v] := g_{v}(\{{{{D}}_{{T}}}[v']\,|v'\in H_{v}) $; LT[v] := Hv;

      11 if Hv contains q-vertex then Svt := {v};

      12 return DT[v], LT[v], Svt);

      Optimality of BAP. We say that cost $ {{c}}(\xi_{R}) $ is regular if (a) all parameter functions are monotonically non-decreasing w.r.t. each of their arguments; and (b) $ \varGamma_{\Join} $, $ \varGamma_{-} $, $ \varGamma_{\cup} $, $ \varGamma_{{ {\rm{{gpBy}}}}} $ and $ \varGamma_{{ {\rm{{fetch}}}}} $ are commutative and associative.

      Denote ${\max _{\varphi = R\left( {\left| {X \to Y,N} \right|} \right) \in {\cal B}}}|Y|$ by $ \#{\cal B} $, where |Y| is the number of attributes in Y. Then we have the following.

      Proposition 11. Under any access schema B, BAP finds optimal bLAPs with regular cost functions if $ \#B \leq 1 $.

      Proof sketch. We discuss SPC Q first, followed by RAaggr.

      (1) Q is in SPC. We first prove the following lemmas.

      (a) For each node $ u\in {S_{{vt}}} $, DT[u] is the shortest distance from $ u_{\emptyset} $ to u; and for each unvisited node v, DT[v] is the shortest distance from $ u_{\emptyset} $ when traversing nodes in Svt only.

      (b) For each $ u\in {S_{{vt}}} $, the plan encoded by the traversal from $ u_{\emptyset} $ to u is of cost DT[u] when c() is regular and $ \#{\cal B} = 1 $.

      For if these hold, then when BAP terminates, the encoded plan from $ u_{\emptyset} $ to uR is an LAP for R with minimum c().

      Lemma (a) can be proved by induction on the number of nodes in Svt, along the same line as Dijkstra′s optimality proof, by observing that SC(v, S) always selects the subset $ S_{0}\subseteq S $ with minimum cost w.r.t. c() for visiting v when $ \#{\cal B} = 1 $. Lemma (b) can be verified by the same induction with (i) the observation that DT[u] mimics c() for the encoded bLAP from $ u_{\emptyset} $ to u, and (ii) the correspondence between proofs (traversals in $ G(Q, {\cal B}) $) and bLAPs under $ {\cal B} $ in the proof of Theorem 7 (part of the proof of Lemma (II)).

      (2) Q is in RAaggr. For RAaggr queries, we show that BAP preserves the optimality of SPC with the following lemmas:

      (c) Every sub-query Qs of Q can improve searched LAP once.

      (d) The order of sub-queries to restart does not change the final bLAP (up to equivalence).

      Both lemmas are proved by induction on the number of –, $ \cup $ and gpBy operations in Q, using the condition that $ \varGamma_{\times} $ is associative and commutative and c() is monotonic.□

      One may expect BAP to find optimal bLAP for more cases while remaining in PTIME. This is, however, beyond reach.

      Proposition 12. For each integer $ k > 1 $, it is NP-hard to decide, given any bag access schema $ {\cal B} $ with $ \#B = k $, RAaggr query $ Q\in {\cal L}_{{\cal B}} $, relation atom R of Q and number R, whether there exists a bLAP ξR for R with cost $ {{c}}(\xi_{R}) \leq r $, even when $ {{c}}(\xi_{R}) $ is regular.

      Proof. We show the NP-hardness by reduction from VERTEX COVER (VC), which is NP-complete[1]. An instance of VC consists of a graph G(V, E) and an integer n. Given G and n (in binary form), VC is to decide whether there exists a subset $ S\subseteq V $ such that (1) S covers G, i.e., for any edge $ e\in E $, there exists $ u\in S $ on $ e $; and (2) $ |S| \leq n $.

      Given an instance $G(V = \{v_{1}, \cdots, v_{p}\}, E = \{e_{1}, \cdots, e_{q}\})$ and n of VC, we construct a database schema $ {\cal R} $, a bag access schema $ {\cal B} $ over $ {\cal R} $ with $ \#{\cal B} = k $ ($ k\geq 2 $ is an integer), RAaggr Q over $ {\cal R} $, a bounded relation RB in Q, a real number R, a regular cost function c() such that there exists a bLAP ξ for RB under $ {\cal B} $ with $ {{c}}(\xi)\leq r $ if and only if G has a cover S with $ |S|\leq n $. More specifically, the reduction is given as follows.

      (1) Database schema $ {\cal R} $ consists of 3 relation schemas $ R(A_{0}, A) $, $S(A_{1}, \cdots, A_{q}, I)$, and $T(F_{1}, \cdots, F_{k})$. Here relation R is to encode the vertices of G, S is to represent edges of G, and T will be used to make $ {\cal B} $ with $ \#B = k $.

      (2) Query Q is defined as follows:

      $$\pi_{S[I]}(\sigma_{R_{1}[A_{0}] = 1}(R_{1})\Join \cdots \Join \sigma_{R_{p}[A_{0}] = p}(R_{p}) \Join S)$,$

      where the join condition is $ R_{i}[A] = S[A_{j}] $ if $ v_{i}\in V $ is an end point of $ e_{j} \in E $ in G. Here relation atom $ R_{i}(i\in[1, p]) $ is a renaming of relation schema R. Intuitively, the join condition encodes the edge relation of G, and $ \pi_{S[I]} $ is to ensure that every attribute in S is nontrivial.

      (3) The bag access schema $ {\cal B} $ consists of 3 access constraints:

      ${\varphi _R} = R\left( {\left| {{A_0} \to A,2} \right|} \right)$,

      ${\varphi _S} = S\left( {\left| {\{ {A_1}, \cdots ,{A_q}\} \to I,1} \right|} \right)$, and

      ${\varphi _T} = T\left( {\left| {\emptyset \to \{ {F_1}, \cdots ,{F_k}\} ,1} \right|} \right)$.

      Here φR is to fetch values of attribute R[A] (i.e., vertices of G), and φS is to fetch edges encoded by S[I] using $R[A_{1}, \cdots, A_{q}]$ (i.e., edges of G). Note that $ \#{\cal B} = k $.

      (4) The bounded relation RB is set to be R. One can easily verify that R is bounded under $ {\cal B} $.

      (5) We set r = 2n. Note that this is in PTIME since n in the VC instance is in binary.

      (6) Function c() is instantiated as follows: (i) $\varGamma_{\times}({{c}}(\xi_{1}), {{c}}(\xi_{2})) = c_{1}\times c_{2}$ if $ \xi_{1} \neq \xi_{2} $ and is $ \max({{c}}(\xi_{1}), {{c}}(\xi_{2})) $ otherwise. (ii) $\varGamma_{{ {\rm{{fetch}}}}}(c, N) = c\times N$; and (iii) we simply set all other functions as constants. Note that c() is regular.

      We show that G has a vertex cover S of size at most n if and only if R has a bLAP ξ with cost $ {{c}}(\xi)\leq r $

      $ \Rightarrow $ Assume that G has a vertex cover S with $ |S|\leq n $. We construct a bLAP ξ for R under $ {\cal B} $ as follows: $ \xi = { {\rm{{fetch}}}}(\Join_{i = 1}^{q}\xi_{A_{i}}, \varphi_{S}) $, where $ \xi_{A_{i}} = { {\rm{{fetch}}}}(\{j\}, \varphi_{R}) $ if ei is covered by $ v_{j}\in S $ (when ei is covered by multiple vertices in S, we pick one of them randomly). By the construction of Q and by that S covers all edges of G, ξ is a bLAP for S under $ {\cal B} $. From $|S|\leq n,$ we know that ${{c}}(\xi) = \varGamma_{\times}(\Join_{i = 1}^{q}\xi_{A_{i}})\times 1 \leq 2^{n} = r$.

      $ \Leftarrow $ Assume that R has a bLAP ξ under $ {\cal B} $ with cost $ {{c}}(\xi) \leq 2^{n} $. By the join condition of Q and φS, there exist at most n distinct fetch operations for fetching S[A1], ··· , S[Aq], i.e., S[A1], ··· , S[Aq] can be fetched from at most n numbers 1, 2, ···, p using φR. This gives us a cover S of G as follows: $ S = \{v_{i}(i\in [1, p])\,|\,\exists j\in [1, q], \xi_{A_{j}} = { {\rm{{fetch}}}} (\{i\}, \varphi_{R})\} $. By the construction of Q, S is a cover of G with $ |S|\leq n $.□

    • The authors are supported in part by Royal Society Wolfson Research Merit Award WRM/R1/180014, ERC 652976, EPSRC EP/M025268/1, Shenzhen Institute of Computing Sciences, and Beijing Advanced Innovation Center for Big Data and Brain Computing.

    • This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

      The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

      To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

参考文献 (54)

目录

    /

    返回文章
    返回