Proceedings of the ACM on Management of Data: PACMMOD (PODS): Vol. 2, No. 2. 2024

Full Citation in the ACM Digital Library

PACMMOD Volume 2 Issue 2: Editorial

Floris Geerts
Wim Martens
Matthias Niewerth

We are excited to announce the first issue dedicated to the PODS research track of the Proceedings of the ACM on Management of Data, or PACMMOD, journal. In its current form, this new journal hosts a SIGMOD and a PODS research track. The PODS research track aims to provide a solid scientific basis for methods, techniques, and solutions for the data management challenges that continually arise in our data-driven society. Articles for the PODS track of PACMMOD present principled contributions to modeling, application, system building, and both theoretical and experimental validation in the context of data management. Such articles might be based, among others, on establishing theoretical results, developing new concepts and frameworks that deserve further exploration, providing experimental work that sheds light on the scientific foundations of the discipline, or a rigorous analysis of both widely used and recently developed industry artifacts. At a time when computer science is increasingly data centric, it is essential to promote an active exchange of tools and techniques between principles of database systems and other communities focused on data management. The PODS track thus pays special attention to those papers that help in the urgent process of integrating data management techniques within broader computer science. Articles published in this track will be invited for presentation to the ACM Symposium on Principles of Database Systems (PODS), which is held jointly with SIGMOD each year.

A Dichotomy in the Complexity of Consistent Query Answering for Two Atom Queries With Self-Join

Anantha Padmanabha
Luc Segoufin
Cristina Sirangelo

We consider the dichotomy conjecture for consistent query answering under primary key constraints. It states that, for every fixed Boolean conjunctive query q, testing whether q is certain (i.e. whether it evaluates to true over all repairs of a given inconsistent database) is either PTime or CoNP-complete. This conjecture has been verified for self-join-free and path queries. We show that it also holds for queries with two atoms.

Conjunctive Queries with Negation and Aggregation: A Linear Time Characterization

Hangdong Zhao
Austen Z. Fan
Xiating Ouyang
Paraschos Koutris

In this paper, we study the complexity of evaluating Conjunctive Queries with negation (\cqneg). First, we present an algorithm with linear preprocessing time and constant delay enumeration for a class of CQs with negation called free-connex signed-acyclic queries. We show that no other queries admit such an algorithm subject to lower-bound conjectures. Second, we extend our algorithm to Conjunctive Queries with negation and aggregation over a general semiring, which we call Functional Aggregate Queries with negation (\faqneg). Such an algorithm achieves constant delay enumeration for the same class of queries but with a slightly increased preprocessing time, which includes an inverse Ackermann function. We show that this surprising appearance of the Ackermmann function is probably unavoidable for general semirings but can be removed when the semiring has a specific structure. Finally, we show an application of our results to computing the difference of CQs.

Consistent Query Answering for Primary Keys on Rooted Tree Queries

Paraschos Koutris
Xiating Ouyang
Jef Wijsen

We study the data complexity of consistent query answering (CQA) on databases that may violate the primary key constraints. A repair is a maximal subset of the database satisfying the primary key constraints. For a Boolean query q, the problem fCERTAINTY(q) takes a database as input, and asks whether or not each repair satisfies q. The computational complexity of fCERTAINTY(q) has been established whenever q is a self-join-free Boolean conjunctive query, or a (not necessarily self-join-free) Boolean path query. In this paper, we take one more step towards a general classification for all Boolean conjunctive queries by considering the class of rooted tree queries. In particular, we show that for every rooted tree query q, fCERTAINTY(q) is in FO, NL-hard ∩ LPFL, or coNP-complete, and it is decidable (in polynomial time), given q, which of the three cases applies. We also extend our classification to larger classes of queries with simple primary keys. Our classification criteria rely on query homomorphisms and our polynomial-time fixpoint algorithm is based on a novel use of context-free grammar (CFG).

Containment of Graph Queries Modulo Schema

Víctor Gutiérrez-Basulto
Albert Gutowski
Yazmin A. Ibáñez-García
Filip Murlak

With multiple graph database systems on the market and a new Graph Query Language standard on the horizon, it is time to revisit some classic static analysis problems. Query containment, arguably the workhorse of static analysis, has already received a lot of attention in the context of graph databases, but not so in the presence of schemas. We aim to change this. Because there is no universal agreement yet on what graph schemas should be, we rely on an abstract formalism borrowed from the knowledge representation community: we assume that schemas are expressed in a description logic (DL). We identify a suitable DL that capture both basic constraints on the labels of incident nodes and edges, and more refined schema features such as participation, cardinality, and unary key constraints. Basing upon, and extending, the rich body of work on DLs, we solve the containment modulo schema problem for unions of conjunctive regular path queries (UCRPQs) and schemas whose descriptions do not mix inverses and counting. For two-way UCRPQs (UC2RPQs) we solve the problem under additional assumptions that tend to hold in practice: we restrict the use of concatenation in queries and participation constraints in schemas.

Enumeration for MSO-Queries on Compressed Trees

Markus Lohrey
Markus L. Schmid

We present a linear preprocessing and output-linear delay enumeration algorithm for MSO-queries over trees that are compressed in the well-established grammar-based framework. Time bounds are measured with respect to the size of the compressed representation of the tree. Our result extends previous work on the enumeration of MSO-queries over uncompressed trees and on the enumeration of document spanners over compressed text documents.

From Shapley Value to Model Counting and Back

Ahmet Kara
Dan Olteanu
Dan Suciu

In this paper we investigate the problem of quantifying the contribution of each variable to the satisfying assignments of a Boolean function based on the Shapley value. Our main result is a polynomial-time equivalence between computing Shapley values and model counting for any class of Boolean functions that are closed under substitutions of variables with disjunctions of fresh variables. This result settles an open problem raised in prior work, which sought to connect the Shapley value computation to probabilistic query evaluation.

We show two applications of our result. First, the Shapley values can be computed in polynomial time over deterministic and decomposable circuits, since they are closed under OR-substitutions. Second, there is a polynomial-time equivalence between computing the Shapley value for the tuples contributing to the answer of a Boolean conjunctive query and counting the models in the lineage of the query. This equivalence allows us to immediately recover the dichotomy for Shapley value computation in case of self-join-free Boolean conjunctive queries; in particular, the hardness for non-hierarchical queries can now be shown using a simple reduction from the \#P-hard problem of model counting for lineage in positive bipartite disjunctive normal form.

Generalized Core Spanner Inexpressibility via Ehrenfeucht-Fraïssé Games for FC

Sam M. Thompson
Dominik D. Freydenberger

Despite considerable research on document spanners, little is known about the expressive power of generalized core spanners. In this paper, we use Ehrenfeucht-Fraïssé games to obtain general inexpressibility lemmas for the logic FC (a finite model variant of the theory of concatenation). Applying these lemmas give inexpressibility results for FC that we lift to generalized core spanners. In particular, we give several relations that cannot be selected by generalized core spanners, thus demonstrating the effectiveness of the inexpressibility lemmas. As an immediate consequence, we also gain new insights into the expressive power of core spanners.

On Reporting Durable Patterns in Temporal Proximity Graphs

Pankaj K. Agarwal
Xiao Hu
Stavros Sintos
Jun Yang

Finding patterns in graphs is a fundamental problem in databases and data mining. In many applications, graphs are temporal and evolve over time, so we are interested in finding durable patterns, such as triangles and paths, which persist over a long time. While there has been work on finding durable simple patterns, existing algorithms do not have provable guarantees and run in strictly super-linear time. The paper leverages the observation that many graphs arising in practice are naturally proximity graphs or can be approximated as such, where nodes are embedded as points in some high-dimensional space, and two nodes are connected by an edge if they are close to each other. We work with an implicit representation of the proximity graph, where nodes are additionally annotated by time intervals, and design near-linear-time algorithms for finding (approximately) durable patterns above a given durability threshold. We also consider an interactive setting where a client experiments with different durability thresholds in a sequence of queries; we show how to compute incremental changes to result patterns efficiently in time near-linear to the size of the changes.

Streaming Algorithms with Few State Changes

Rajesh Jayaram
David P. Woodruff
Samson Zhou

In this paper, we study streaming algorithms that minimize the number of changes made to their internal state (i.e., memory contents). While the design of streaming algorithms typically focuses on minimizing space and update time, these metrics fail to capture the asymmetric costs, inherent in modern hardware and database systems, of reading versus writing to memory. In fact, most streaming algorithms write to their memory on every update, which is undesirable when writing is significantly more expensive than reading. This raises the question of whether streaming algorithms with small space and number of memory writes are possible.

We first demonstrate that, for the fundamental F_p moment estimation problem with p ≥ 1, any streaming algorithm that achieves a constant factor approximation must make Ω(n^1-1/p) internal state changes, regardless of how much space it uses. Perhaps surprisingly, we show that this lower bound can be matched by an algorithm which also has near-optimal space complexity. Specifically, we give a (1+ε)-approximation algorithm for F_p moment estimation that use a near-optimal ~O_ε(n^1-1/p) number of state changes, while simultaneously achieving near-optimal space, i.e., for p∈[1,2), our algorithm uses poly(log n,1/ε) bits of space for, while for p>2, the algorithm uses ~O_ε(n^1-1/p) space. We similarly design streaming algorithms that are simultaneously near-optimal in both space complexity and the number of state changes for the heavy-hitters problem, sparse support recovery, and entropy estimation. Our results demonstrate that an optimal number of state changes can be achieved without sacrificing space complexity.

The Complexity of Why-Provenance for Datalog Queries

Marco Calautti
Ester Livshits
Andreas Pieris
Markus Schneider

Datalog is a powerful rule-based language that allows us to express complex recursive queries and has found numerous applications over the years. Explaining why a result to a Datalog query is obtained is an essential task towards explainable and transparent data-intensive applications that rely on Datalog. A standard way of explaining a query result is the so-called why-provenance, which provides information about the witnesses to a query result in the form of subsets of the input database that as a whole can be used to derive that result. To our surprise, despite the fact that the notion of why-provenance for Datalog queries has been around for decades and intensively studied, its computational complexity remains unexplored. Our goal is to fill this gap in the why-provenance literature. Towards this end, we pinpoint the data complexity of why-provenance for Datalog queries and key subclasses thereof. The takeaway of our work is that why-provenance for recursive queries, even if the recursion is limited to be linear, is an intractable problem, whereas for non-recursive queries is highly tractable.

The Moments Method for Approximate Data Cube Queries

Peter Lindner
Sachin Basil John
Christoph Koch
Dan Suciu

We investigate an approximation algorithm for various aggregate queries on partially materialized data cubes. Data cubes are interpreted as probability distributions, and cuboids from a partial materialization populate the terms of a series expansion of the target query distribution. Unknown terms in the expansion are just assumed to be 0 in order to recover an approximate query result. We identify this method as a variant of related approaches from other fields of science, that is, the Bahadur representation and, more generally, (biased) Fourier expansions of Boolean functions. Existing literature indicates a rich but intricate theoretical landscape. Focusing on the data cube application, we start by investigating worst-case error bounds. We build upon prior work to obtain provably optimal materialization strategies with respect to query workloads. In addition, we propose a new heuristic method governing materialization decisions. Finally, we show that well-approximated queries are guaranteed to have well-approximated roll-ups.

Tight Lower Bounds for Directed Cut Sparsification and Distributed Min-Cut

Yu Cheng
Max Li
Honghao Lin
Zi-Yi Tai
David P. Woodruff
Jason Zhang

In this paper, we consider two fundamental cut approximation problems on large graphs. We prove new lower bounds for both problems that are optimal up to logarithmic factors.

The first problem is approximating cuts in balanced directed graphs. In this problem, we want to build a data structure that can provide (1 ± ε)-approximation of cut values on a graph with n vertices. For arbitrary directed graphs, such a data structure requires Ω(n²) bits even for constant ε. To circumvent this, recent works study β-balanced graphs, meaning that for every directed cut, the total weight of edges in one direction is at most β times the total weight in the other direction. We consider the for-each model, where the goal is to approximate each cut with constant probability, and the for-all model, where all cuts must be preserved simultaneously. We improve the previous Ømega(n √β/ε) lower bound in the for-each model to ~Ω (n √β /ε) and we improve the previous Ω(n β/ε) lower bound in the for-all model to Ω(n β/ε²). This resolves the main open questions of (Cen et al., ICALP, 2021).

The second problem is approximating the global minimum cut in a local query model, where we can only access the graph via degree, edge, and adjacency queries. We prove an ΩL(min m, m/ε² k R) lower bound for this problem, which improves the previous ΩL(m/k R) lower bound, where m is the number of edges, k is the minimum cut size, and we seek a (1+ε)-approximation. In addition, we show that existing upper bounds with minor modifications match our lower bound up to logarithmic factors.

The Weisfeiler-Leman Dimension of Conjunctive Queries

Andreas Göbel
Leslie A. Goldberg
Marc Roth

A graph parameter is a function f on graphs with the property that, for any pair of isomorphic graphs G₁ and G₂, f(G₁)=f(G₂). The Weisfeiler--Leman (WL) dimension of f is the minimum k such that, if G₁ and G₂ are indistinguishable by the k-dimensional WL-algorithm then f(G₁)=f(G₂). The WL-dimension of f is ∞ if no such k exists. We study the WL-dimension of graph parameters characterised by the number of answers from a fixed conjunctive query to the graph. Given a conjunctive query φ, we quantify the WL-dimension of the function that maps every graph G to the number of answers of φ in G.

The works of Dvorak (J. Graph Theory 2010), Dell, Grohe, and Rattan (ICALP 2018), and Neuen (ArXiv 2023) have answered this question for full conjunctive queries, which are conjunctive queries without existentially quantified variables. For such queries φ, the WL-dimension is equal to the treewidth of the Gaifman graph of φ.

In this work, we give a characterisation that applies to all conjunctive queries. Given any conjunctive query φ, we prove that its WL-dimension is equal to the semantic extension width sew(φ), a novel width measure that can be thought of as a combination of the treewidth of φ and its quantified star size, an invariant introduced by Durand and Mengel (ICDT 2013) describing how the existentially quantified variables of φ are connected with the free variables. Using the recently established equivalence between the WL-algorithm and higher-order Graph Neural Networks (GNNs) due to Morris et al. (AAAI 2019), we obtain as a consequence that the function counting answers to a conjunctive query φ cannot be computed by GNNs of order smaller than sew(φ).

The majority of the paper is concerned with establishing a lower bound of the WL-dimension of a query. Given any conjunctive query φ with semantic extension width k, we consider a graph F of treewidth k obtained from the Gaifman graph of φ by repeatedly cloning the vertices corresponding to existentially quantified variables. Using a modification due to Furer (ICALP 2001) of the Cai-Fürer-Immerman construction (Combinatorica 1992), we then obtain a pair of graphs χ(F) and ^χ(F) that are indistinguishable by the (k-1)-dimensional WL-algorithm since F has treewidth k. Finally, in the technical heart of the paper, we show that φ has a different number of answers in χ(F) and ^χ(F). Thus, φ can distinguish two graphs that cannot be distinguished by the (k-1)-dimensional WL-algorithm, so the WL-dimension of φ is at least k.

Tight Bounds of Circuits for Sum-Product Queries

Austen Z. Fan
Paraschos Koutris
Hangdong Zhao

In this paper, we ask the following question: given a Boolean Conjunctive Query (CQ), what is the smallest circuit that computes the provenance polynomial of the query over a given semiring? We answer this question by giving upper and lower bounds. Notably, it is shown that any circuit F that computes a CQ over the tropical semiring must have size log |F| ≥ (1-ε) · da-entw for any ε >0, where da-entw is the degree-aware entropic width of the query. We show a circuit construction that matches this bound when the semiring is idempotent. The techniques we use combine several central notions in database theory: provenance polynomials, tree decompositions, and disjunctive Datalog programs. We extend our results to lower and upper bounds for formulas (i.e., circuits where each gate has outdegree one), and to bounds for non-Boolean CQs.

On Density-based Local Community Search

Yizhou Dai
Miao Qiao
Rong-Hua Li

Local community search (LCS) finds a community in a given graph G local to a set R of seed nodes by optimizing an objective function. The objective function f(S) for an induced subgraph S encodes the set inclusion criteria of R to a classic community measurement of S such as the conductance and the density. An ideal algorithm for optimizing f(S) is strongly local, that is, the complexity is dependent on R as opposed to G. This paper formulates a general form of objective functions for LCS using configurations and then focuses on a set C of density-based configurations, each corresponding to a density-based LCS objective function. The paper has two main results. i) A constructive classification of C: a configuration in C has a strongly local algorithm for optimizing its corresponding objective function if and only if it is in C_L ⊆ C. ii) A linear programming-based general solution for density-based LCS that is strongly local and practically efficient. This solution is different from the existing strongly local LCS algorithms, which are all based on flow networks.

Verification of Unary Communicating Datalog Programs

C. Aiswarya
Diego Calvanese
Francesco Di Cosmo
Marco Montali

We study verification of reachability properties over Communicating Datalog Programs (CDPs), which are networks of relational nodes connected through unordered channels and running Datalog-like computations. Each node manipulates a local state database (DB), depending on incoming messages and additional input DBs from external services. Decidability of verification for CDPs has so far been established only under boundedness assumptions on the state and channel sizes, showing at the same time undecidability of reachability for unbounded states with only two unary relations or unbounded channels with a single binary relation. The goal of this paper is to study the open case of CDPs with bounded states and unbounded channels, under the assumption that channels carry unary relations only. We discuss the significance of the resulting model and prove the decidability of verification of variants of reachability, captured in fragments of first-order CTL. We do so through a novel reduction to coverability problems in a class of high-level Petri Nets that manipulate unordered data identifiers. We study the tightness of our results, showing that minor generalizations of the considered reachability properties yield undecidability of verification, both for CDPs and the corresponding Petri Net model.

Evaluating Datalog over Semirings: A Grounding-based Approach

Hangdong Zhao
Shaleen Deep
Paraschos Koutris
Sudeepa Roy
Val Tannen

Datalog is a powerful yet elegant language that allows expressing recursive computation. Although Datalog evaluation has been extensively studied in the literature, so far, only loose upper bounds are known on how fast a Datalog program can be evaluated. In this work, we ask the following question: given a Datalog program over a naturally-ordered semiring σ, what is the tightest possible runtime? To this end, our main contribution is a general two-phase framework for analyzing the data complexity of Datalog over σ: first ground the program into an equivalent system of polynomial equations (i.e. grounding) and then find the least fixpoint of the grounding over σ. We present algorithms that use structure-aware query evaluation techniques to obtain the smallest possible groundings. Next, efficient algorithms for fixpoint evaluation are introduced over two classes of semirings: (1) finite-rank semirings and (2) absorptive semirings of total order. Combining both phases, we obtain state-of-the-art and new algorithmic results. Finally, we complement our results with a matching fine-grained lower bound.

When View- and Conflict-Robustness Coincide for Multiversion Concurrency Control

Brecht Vandevoort
Bas Ketsman
Frank Neven

A DBMS allows trading consistency for efficiency through the allocation of isolation levels that are strictly weaker than serializability. The robustness problem asks whether, for a given set of transactions and a given allocation of isolation levels, every possible interleaved execution of those transactions that is allowed under the provided allocation, is always safe. In the literature, safe is interpreted as conflict-serializable (to which we refer here as conflict-robustness). In this paper, we study the view-robustness problem, interpreting safe as view-serializable. View-serializability is a more permissive notion that allows for a greater number of schedules to be serializable and aligns more closely with the intuitive understanding of what it means for a database to be consistent. However, view-serializability is more complex to analyze (e.g., conflict-serializability can be decided in polynomial time whereas deciding view-serializability is NP-complete). While conflict-robustness implies view-robustness, the converse does not hold in general. In this paper, we provide a sufficient condition for isolation levels guaranteeing that conflict- and view-robustness coincide and show that this condition is satisfied by the isolation levels occurring in Postgres and Oracle: read committed (RC), snapshot isolation (SI) and serializable snapshot isolation (SSI). It hence follows that for these systems, widening from conflict- to view-serializability does not allow for more sets of transactions to become robust. Interestingly, the complexity of deciding serializability within these isolation levels is still quite different. Indeed, deciding conflict-serializability for schedules allowed under RC and SI remains in polynomial time, while we show that deciding view-serializability within these isolation levels remains NP-complete.

Expected Shapley-Like Scores of Boolean functions: Complexity and Applications to Probabilistic Databases

Pratik Karmakar
Mikaël Monet
Pierre Senellart
Stephane Bressan

Shapley values, originating in game theory and increasingly prominent in explainable AI, have been proposed to assess the contribution of facts in query answering over databases, along with other similar power indices such as Banzhaf values. In this work we adapt these Shapley-like scores to probabilistic settings, the objective being to compute their expected value. We show that the computations of expected Shapley values and of the expected values of Boolean functions are interreducible in polynomial time, thus obtaining the same tractability landscape. We investigate the specific tractable case where Boolean functions are represented as deterministic decomposable circuits, designing a polynomial-time algorithm for this setting. We present applications to probabilistic databases through database provenance, and an effective implementation of this algorithm within the ProvSQL system, which experimentally validates its feasibility over a standard benchmark.

Chase Termination Beyond Polynomial Time

Philipp Hanisch
Markus Krötzsch

The chase is a widely implemented approach to reason with tuple-generating dependencies (tgds), used in data exchange, data integration, and ontology-based query answering. However, it is merely a semi-decision procedure, which may fail to terminate. Many decidable conditions have been proposed for tgds to ensure chase termination, typically by forbidding some kind of "cycle'' in the chase process. We propose a new criterion that explicitly allows some such cycles, and yet ensures termination of the standard chase under reasonable conditions. This leads to new decidable fragments of tgds that are not only syntactically more general but also strictly more expressive than the fragments defined by prior acyclicity conditions. Indeed, while known terminating fragments are restricted to PTime data complexity, our conditions yield decidable languages for any k- ExpTime. We further refine our syntactic conditions to obtain fragments of tgds for which an optimised chase procedure decides query entailment in PSpace or k- ExpSpace, respectively.

Continual Release of Differentially Private Synthetic Data from Longitudinal Data Collections

Mark Bun
Marco Gaboardi
Marcel Neunhoeffer
Wanrong Zhang

Motivated by privacy concerns in long-term longitudinal studies in medical and social science research, we study the problem of continually releasing differentially private synthetic data from longitudinal data collections. We introduce a model where, in every time step, each individual reports a new data element, and the goal of the synthesizer is to incrementally update a synthetic dataset in a consistent way to capture a rich class of statistical properties. We give continual synthetic data generation algorithms that preserve two basic types of queries: fixed time window queries and cumulative time queries. We show nearly tight upper bounds on the error rates of these algorithms and demonstrate their empirical performance on realistically sized datasets from the U.S. Census Bureau's Survey of Income and Program Participation.

Postulates for Provenance: Instance-based provenance for first-order logic

Bart Bogaerts
Maxime Jakubowski
Jan Van den Bussche

Instance-based provenance is an explanation for a query result in the form of a subinstance of the database. We investigate different desiderata one may want to impose on these subinstances. Concretely we consider seven basic postulates for provenance. Six of them relate subinstances to provenance polynomials, three-valued semantics, and Halpern-Pearl causality. Determinism of the provenance mechanism is the seventh basic postulate. Moreover, we consider the postulate of minimality, which can be imposed with respect to any set of basic postulates. Our main technical contribution is an analysis and characterisation of which combinations of postulates are jointly satisfiable. Our main conceptual contribution is an approach to instance-based provenance through three-valued instances, which makes it applicable to first-order logic queries involving negation.

Join Size Bounds using lp-Norms on Degree Sequences

Mahmoud Abo Khamis
Vasileios Nakos
Dan Olteanu
Dan Suciu

Estimating the output size of a query is a fundamental yet longstanding problem in database query processing. Traditional cardinality estimators used by database systems can routinely underestimate the true output size by orders of magnitude, which leads to significant system performance penalty. Recently, upper bounds have been proposed that are based on information inequalities and incorporate sizes and max-degrees from input relations, yet their main benefit is limited to cyclic queries, because they degenerate to rather trivial formulas on acyclic queries.

We introduce a significant extension of the upper bounds, by incorporating l_p-norms of the degree sequences of join attributes. Our bounds are significantly lower than previously known bounds, even when applied to acyclic queries. These bounds are also based on information theory, they come with a matching query evaluation algorithm, are computable in exponential time in the query size, and are provably tight when all degrees are ''simple''.

Topology-aware Parallel Joins

Xiao Hu
Paraschos Koutris

We study the design and analysis of parallel join algorithms in a topology-aware computational model. In this model, the network is modeled as a directed graph, where each edge is associated with a cost function that depends on the data transferred between the two endpoints and the link bandwidth. The computation proceeds in synchronous rounds and the cost of each round is measured as the maximum cost over all the edges in the network. Our main result is an asymptotically optimal join algorithm over symmetric tree topologies. The algorithm generalizes prior topology-aware protocols for set intersection and cartesian product to a binary join over an arbitrary input distribution with possible data skew.

Fast Matrix Multiplication for Query Processing

Xiao Hu

This paper studies how to use fast matrix multiplication to speed up query processing. As observed, computing a two-table join and then projecting away the join attribute is essentially the Boolean matrix multiplication problem, which can be significantly improved with fast matrix multiplication. Moving beyond this basic two-table query, we introduce output-sensitive algorithms for general join-project queries using fast matrix multiplication. These algorithms have achieved a polynomially large improvement over the classic Yannakakis framework. To the best of our knowledge, this is the first theoretical improvement for general acyclic join-project queries since 1981.

Combined Approximations for Uniform Operational Consistent Query Answering

Marco Calautti
Ester Livshits
Andreas Pieris
Markus Schneider

Operational consistent query answering (CQA) is a recent framework for CQA based on revised definitions of repairs, which are built by applying a sequence of operations (e.g., fact deletions) starting from an inconsistent database until we reach a database that is consistent w.r.t. the given set of constraints. It has been recently shown that there is an efficient approximation for computing the percentage of repairs that entail a given query when we focus on primary keys, conjunctive queries, and assuming the query is fixed (i.e., in data complexity). However, it has been left open whether such an approximation exists when the query is part of the input (i.e., in combined complexity). We show that this is the case when we focus on self-join-free conjunctive queries of bounded generelized hypertreewidth. We also show that it is unlikely that efficient approximation schemes exist once we give up one of the adopted syntactic restrictions, i.e., self-join-freeness or bounding the generelized hypertreewidth. Towards the desired approximation, we introduce a counting complexity class, called SpanTL, show that each problem in it admits an efficient approximation scheme by using a recent approximability result about tree automata, and then place the problem of interest in SpanTL.

Distinct Shortest Walk Enumeration for RPQs

Claire David
Nadime Francis
Victor Marsault

We consider the Distinct Shortest Walks problem. Given two vertices s and t of a graph database D and a regular path query, we want to enumerate all walks of minimal length from s to t that carry a label that conforms to the query. Usual theoretical solutions turn out to be inefficient when applied to graph models that are closer to real-life systems, in particular because edges may carry multiple labels. Indeed, known algorithms may repeat the same answer exponentially many times. We propose an efficient algorithm for graph databases with multiple labels. The preprocessing runs in O(DxA) and the delay between two consecutive outputs is in O(λxA), where A is a nondeterministic automaton representing the query and L is the minimal length. The algorithm can handle epsilon-transitions in A or queries given as regular expressions at no additional cost.

Layered List Labeling

Michael A. Bender
Alex Conway
Martin Farach-Colton
Hanna Komlós
William Kuszmaul

The list-labeling problem is one of the most basic and well-studied algorithmic primitives in data structures, with an extensive literature spanning upper bounds, lower bounds, and data management applications. The classical algorithm for this problem, dating back to 1981, has amortized cost O(log bn). Subsequent work has led to improvements in three directions: low-latency (worst-case) bounds; high-throughput (expected) bounds; and (adaptive) bounds for important workloads.

Perhaps surprisingly, these three directions of research have remained almost entirely disjoint---this is because, so far, the techniques that allow for progress in one direction have forced worsening bounds in the others. Thus there would appear to be a tension between worst-case, adaptive, and expected bounds. List labeling has been proposed for use in databases at least as early as PODS'99, but a database needs good throughput, response time, and needs to adapt to common workloads (e.g., bulk loads), and no current list-labeling algorithm achieve good bounds for all three.

We show that this tension is not fundamental. In fact, with the help of new data-structural techniques, one can actually combine any three list-labeling solutions in order to cherry-pick the best worst-case, adaptive, and expected bounds from each of them.

On the Feasibility of Forgetting in Data Streams

A. Pavan
Sourav Chakraborty
N. V. Vinodchandran
Kuldeep S Meel

In today's digital age, it is becoming increasingly prevalent to retain digital footprints in the cloud indefinitely. Nonetheless, there is a valid argument that entities should have the authority to decide whether their personal data remains within a specific database or is expunged. Indeed, nations across the globe are increasingly enacting legislation to uphold the "Right To Be Forgotten" for individuals. Investigating computational challenges, including the formalization and implementation of this notion, is crucial due to its relevance in the domains of data privacy and management.

This work introduces a new streaming model: the 'Right to be Forgotten Data Streaming Model' (RFDS model). The main feature of this model is that any element in the stream has the right to have its history removed from the stream. Formally, the input is a stream of updates of the form (a, Δ) where Δ ∈ {+, ⊥} and a is an element from a universe U. When the update Δ=+ occurs, the frequency of a, denoted as f_a, is incremented to f_a +1. When the update Δ=⊥, occurs, f_a is set to 0. This feature, which represents the forget request, distinguishes the present model from existing data streaming models.

This work systematically investigates computational challenges that arise while incorporating the notion of the right to be forgotten. Our initial considerations reveal that even estimating F₁ (sum of the frequencies of elements) of the stream is a non-trivial problem in this model. Based on the initial investigations, we focus on a modified model which we call α-RFDS where we limit the number of forget operations to be at most α fraction. In this modified model, we focus on estimating F₀ (number of distinct elements) and F₁. We present algorithms and establish almost-matching lower bounds on the space complexity for these computational tasks.

Bag Semantics Conjunctive Query Containment. Four Small Steps Towards Undecidability.

Jerzy Marcinkowski
Mateusz Orda

Query Containment Problem (QCP) is one of the most fundamental decision problems in database query processing and optimization.

Complexity of QCP for conjunctive queries has been fully understood since 1970s. But, as Chaudhuri and Vardi noticed in their classical 1993 paper this understanding is based on the assumption that query answers are sets of tuples, and it does not transfer to the situation when multi-set (bag) semantics is considered.

Now, 30 years later, decidability of QCP for bag semantics remains an open question, one of the most intriguing open questions in database theory.

In this paper we show a series of undecidability results for some generalizations of this problem. We show, for example, that the problem whether, for given two boolean conjunctive queries φ_s and φ_b, and a linear function F, the inequality F(φ_s(D)) =< φ_b(D) holds for each database instance D, is undecidable.

Minimally Factorizing the Provenance of Self-join Free Conjunctive Queries

Neha Makhija
Wolfgang Gatterbauer

We consider the problem of finding the minimal-size factorization of the provenance of self-join-free conjunctive queries, i.e.,we want to find a formula that minimizes the number of variable repetitions. This problem is equivalent to solving the fundamental Boolean formula factorization problem for the restricted setting of the provenance formulas of self-join free queries. While general Boolean formula minimization is Σ^p₂-complete, we show that the problem is NP-Complete in our case. Additionally, we identify a large class of queries that can be solved in PTIME, expanding beyond the previously known tractable cases of read-once formulas and hierarchical queries.

We describe connections between factorizations, Variable Elimination Orders (VEOs), and minimal query plans. We leverage these insights to create an Integer Linear Program (ILP) that can solve the minimal factorization problem exactly. We also propose a Max-Flow Min-Cut (MFMC) based algorithm that gives an efficient approximate solution. Importantly, we show that both the Linear Programming (LP) relaxation of our ILP, and our MFMC-based algorithm are always correct for all currently known PTIME cases. Thus, we present two unified algorithms (ILP and MFMC) that can both recover all known PTIME cases in PTIME, yet also solve NP-Complete cases either exactly (ILP) or approximately (MFMC), as desired.

When is Shapley Value Computation a Matter of Counting?

Meghyn Bienvenu
Diego Figueira
Pierre Lafourcade

The Shapley value provides a natural means of quantifying the contributions of facts to database query answers. In this work, we seek to broaden our understanding of Shapley value computation (SVC) in the database setting by revealing how it relates to Fixed-size Generalized Model Counting (FGMC), which is the problem of computing the number of sub-databases of a given size and containing a given set of assumed facts that satisfy a fixed query. Our focus will be on explaining the difficulty of SVC via FGMC, and to this end, we identify general conditions on queries which enable reductions from FGMC to SVC. As a byproduct, we not only obtain alternative explanations for existing hardness results for SVC, but also new complexity results. In particular, we establish FP-#P complexity dichotomies for constant-free unions of connected CQs and connected homomorphism-closed graph queries. We also consider some variants of the SVC problem, by disallowing assumed facts or quantifying the contributions of constants rather than facts.

Query Optimization by Quantifier Elimination

Christoph Koch
Peter Lindner

Query optimizers have a limited arsenal of techniques for optimizing nested queries. In this paper, we develop a new approach for query optimization based on quantifier elimination. Quantifier elimination is a well-established tool for proving the decidability of logical theories. Here, however, we show that it can be turned into an effective query optimization technique that may yield asymptotic improvements in query processing efficiency. In addition, the technique establishes a foundation for certain well-known but previously little-understood aggregation based techniques for optimizing nested queries.

Consistency of Relations over Monoids

Albert Atserias
Phokion G Kolaitis

The interplay between local consistency and global consistency has been the object of study in several different areas, including probability theory, relational databases, and quantum information. For relational databases, Beeri, Fagin, Maier, and Yannakakis showed that a database schema is acyclic if and only if it has the local-to-global consistency property for relations, which means that every collection of pairwise consistent relations over the schema is globally consistent. More recently, the same result has been shown under bag semantics. In this paper, we carry out a systematic study of local vs. global consistency for relations over positive commutative monoids, which is a common generalization of ordinary relations and bags. Let K be an arbitrary positive commutative monoid. We begin by showing that acyclicity of the schema is a necessary condition for the local-to-global consistency property for K-relations to hold. Unlike the case of ordinary relations and bags, however, we show that acyclicity is not always sufficient. After this, we characterize the positive commutative monoids for which acyclicity is both necessary and sufficient for the local-to-global consistency property to hold; this characterization involves a combinatorial property of monoids, which we call the transportation property. We then identify several different classes of monoids that possess the transportation property. As our final contribution, we introduce a modified notion of local consistency of K-relations, which we call pairwise consistency up to the free cover. We prove that, for all positive commutative monoids K, even those without the transportation property, acyclicity is both necessary and sufficient for every family of K-relations that is pairwise consistent up to the free cover to be globally consistent.

History-Independent Dynamic Partitioning: Operation-Order Privacy in Ordered Data Structures

Michael A. Bender
Martín Farach-Colton
Michael T. Goodrich
Hanna Komlós

A data structure is history independent if its internal representation reveals nothing about the history of operations beyond what can be determined from the current contents of the data structure. History independence is typically viewed as a security or privacy guarantee, with the intent being to minimize risks incurred by a security breach or audit. Despite widespread advances in history independence, there is an important data-structural primitive that previous work has been unable to replace with an equivalent history-independent alternative---dynamic partitioning. In dynamic partitioning, we are given a dynamic set S of ordered elements and a size-parameter B, and the objective is to maintain a partition of S into ordered groups, each of size Θ(B). Dynamic partitioning is important throughout computer science, with applications to B-tree rebalancing, write-optimized dictionaries, log-structured merge trees, other external-memory indexes, geometric and spatial data structures, cache-oblivious data structures, and order-maintenance data structures. The lack of a history-independent dynamic-partitioning primitive has meant that designers of history-independent data structures have had to resort to complex alternatives. In this paper, we achieve history-independent dynamic partitioning. Our algorithm runs asymptotically optimally against an oblivious adversary, processing each insert/delete with O(1) operations in expectation and O(B log N/loglog N) with high probability in set size N.

Simple & Optimal Quantile Sketch: Combining Greenwald-Khanna with Khanna-Greenwald

Elena Gribelyuk
Pachara Sawettamalya
Hongxun Wu
Huacheng Yu

Estimating the ε-approximate quantiles or ranks of a stream is a fundamental task in data monitoring. Given a stream x_1,..., x_n from a universe \mathcalU with total order, an additive-error quantile sketch \mathcalM allows us to approximate the rank of any query y\in \mathcalU up to additive ε n error. In 2001, Greenwald and Khanna gave a deterministic algorithm (GK sketch) that solves the ε-approximate quantiles estimation problem using O(ε^-1 łog(ε n)) space \citegreenwald2001space ; recently, this algorithm was shown to be optimal by Cormode and Vesleý in 2020 \citecormode2020tight. However, due to the intricacy of the GK sketch and its analysis, over-simplified versions of the algorithm are implemented in practical applications, often without any known theoretical guarantees. In fact, it has remained an open question whether the GK sketch can be simplified while maintaining the optimal space bound. In this paper, we resolve this open question by giving a simplified deterministic algorithm that stores at most (2 + o(1))ε^-1 łog (ε n) elements and solves the additive-error quantile estimation problem; as a side benefit, our algorithm achieves a smaller constant factor than the \frac11 2 ε^-1 łog(ε n) space bound in the original GK sketch~\citegreenwald2001space. Our algorithm features an easier analysis and still achieves the same optimal asymptotic space complexity as the original GK sketch. Lastly, our simplification enables an efficient data structure implementation, with a worst-case runtime of O(łog(1/ε) + łog łog (ε n)) per-element for the ordinary ε-approximate quantile estimation problem. Also, for the related "weighted'' quantile estimation problem, we give efficient data structures for our simplified algorithm which guarantee a worst-case per-element runtime of O(łog(1/ε) + łog łog (ε W_n/w_\textrmmin )), achieving an improvement over the previous upper bound of \citeassadi2023generalizing.

TypeQL: A Type-Theoretic & Polymorphic Query Language

Christoph Dorn
Haikal Pribadi

Relational data modeling can often be restrictive as it provides no direct facility for modeling polymorphic types, reified relations, multi-valued attributes, and other common high-level structures in data. This creates many challenges in data modeling and engineering tasks, and has led to the rise of more flexible NoSQL databases, such as graph and document databases. In the absence of structured schemas, however, we can neither express nor validate the intention of data models, making long-term maintenance of databases substantially more difficult. To resolve this dilemma, we argue that, parallel to the role of classical predicate logic for relational algebra, contemporary foundations of mathematics rooted in type theory can guide us in the development of powerful new high-level data models and query languages. To this end, we introduce a new polymorphic entity-relation-attribute (PERA) data model, grounded in type-theoretic principles and accessible through classical conceptual modeling, with a near-natural query language: TypeQL. We illustrate the syntax of TypeQL as well as its denotation in the PERA model, formalize our model as an algebraic theory with dependent types, and describe its stratified semantics.

A faster FPRAS for #NFA

Kuldeep S. Meel
Sourav Chakraborty
Umang Mathur

Given a non-deterministic finite automaton (NFA) A with m states, and a natural number n (presented in unary), the #NFA problem asks to determine the size of the set L(A,n) of words of length n accepted by A. While the corresponding decision problem of checking the emptiness of L(A,n) is solvable in polynomial time, the #NFA problem is known to be #P-hard. Recently, the long-standing open question --- whether there is an FPRAS (fully polynomial time randomized approximation scheme) for #NFA --- was resolved by Arenas, Croquevielle, Jayaram, and Riveros in [ACJR19]. The authors demonstrated the existence of a fully polynomial randomized approximation scheme with a time complexity of ~O(m¹⁷ n¹⁷ • 1/ε¹⁴ • log (1/δ)), for a given tolerance ε and confidence parameter δ.

Given the prohibitively high time complexity in terms of each of the input parameters, and considering the widespread application of approximate counting (and sampling) in various tasks in Computer Science, a natural question arises: is there a faster FPRAS for #NFA that can pave the way for the practical implementation of approximate #NFA tools? In this work, we answer this question in the positive. We demonstrate that significant improvements in time complexity are achievable, and propose an FPRAS for #NFA that is more efficient in terms of both time and sample complexity.

A key ingredient in the FPRAS due to Arenas, Croquevielle, Jayaram, and Riveros [ACJR19] is inter-reducibility of sampling and counting, which necessitates a closer look at the more informative measure --- the number of samples maintained for each pair of state q and length i <= n. In particular, the scheme of [ACJR19] maintains O(m⁷/n⁷ ε⁷ ) samples per pair of state and length. In the FPRAS we propose, we systematically reduce the number of samples required for each state to be only poly-logarithmically dependent on m, with significantly less dependence on n and ε, maintaining only ~O(n⁴/ε²) samples per state. Consequently, our FPRAS runs in time ~O((m²n¹⁰ + m³n⁶) • 1/ε⁴ • log²(1/δ)). The FPRAS and its analysis use several novel insights. First, our FPRAS maintains a weaker invariant about the quality of the estimate of the number of samples for each state q and length i <= n. Second, our FPRAS only requires that the distribution of the samples maintained is close to uniform distribution only in total variation distance (instead of maximum norm). We believe our insights may lead to further reductions in time complexity and thus open up a promising avenue for future work towards the practical implementation of tools for approximate #NFA.

Counting Answers to Unions of Conjunctive Queries: Natural Tractability Criteria and Meta-Complexity

Jacob Focke
Leslie Ann Goldberg
Marc Roth
Stanislav Zivný

We study the problem of counting answers to unions of conjunctive queries (UCQs) under structural restrictions on the input query. Concretely, given a class C of UCQs, the problem #UCQ (C) provides as input a UCQ Ψ ∈ C and a database D and the problem is to compute the number of answers of Ψ in D.

Chen and Mengel [PODS'16] have shown that for any recursively enumerable class C, the problem #UCQ (C) is either fixed-parameter tractable or hard for one of the parameterised complexity classes W[1] or #W[1]. However, their tractability criterion is unwieldy in the sense that, given any concrete class C of UCQs, it is not easy to determine how hard it is to count answers to queries in C. Moreover, given a single specific UCQ Ψ, it is not easy to determine how hard it is to count answers to Ψ.

In this work, we address the question of finding a natural tractability criterion: The combined conjunctive query of a UCQ Ψ=φ₁ ∨ ... ∨ φ_l is the conjunctive query ^ Ψ = φ_1 ∧ ... ∧ φ_l. We show that under natural closure properties of C, the problem #UCQ (C) is fixed-parameter tractable if and only if the combined conjunctive queries of UCQs in C, and their contracts, have bounded treewidth. A contract of a conjunctive query is an augmented structure, taking into account how the quantified variables are connected to the free variables --- if all variables are free, then a conjunctive query is equal to its contract; in this special case the criterion for fixed-parameter tractability of #UCQ (C) thus simplifies to the combined queries having bounded treewidth.

Finally, we give evidence that a closure property on C is necessary for obtaining a natural tractability criterion: We show that even for a single UCQ Ψ, the meta problem of deciding whether #UCQ (Ψ) can be solved in time O(|D|^d) is NP-hard for any fixed d ≥ 1. Moreover, we prove that a known exponential-time algorithm for solving the meta problem is optimal under assumptions from fine-grained complexity theory. As a corollary of our reduction, we also establish that approximating the Weisfeiler-Leman-Dimension of a UCQ is NP-hard.