SIGMOD 2024: Keynote Talks
Keynote Speaker 1: Ricardo Baeza-Yates, Institute for Experiential AI, Northeastern University
The Limitations of Data, ML & Us
Abstract
Machine learning (ML), particularly deep learning, is being used everywhere. However, not always is used well, ethically and scientifically. In this talk we first do a deep dive in the limitations of supervised ML and data, its key component. We cover small data, datification, bias, predictive optimization issues, evaluating success instead of harm, and pseudoscience, among other problems. The second part is about our own limitations using ML, including different types of human incompetence: cognitive biases, unethical applications, no administrative competence, copyright violations, misinformation, and the impact on mental health. In the final part we discuss regulation on the use of AI and responsible AI principles, that can mitigate the problems outlined above.
Bio
Ricardo Baeza-Yates is Director of Research at the Institute for Experiential AI of Northeastern University, as well as part-time professor at the Dept. of Computer Science of University of Chile. Before, he was VP of Research at Yahoo Labs, based in Barcelona, Spain, and later in Sunnyvale, California, from 2006 to 2016. He is co-author of the best-seller Modern Information Retrieval textbook published by Addison-Wesley in 1999 and 2011 (2nd ed), that won the ASIST 2012 Book of the Year award. From 2002 to 2004 he was elected to the Board of Governors of the IEEE Computer Society and between 2012 and 2016 was elected for the ACM Council. In 2009 he was named ACM Fellow and in 2011 IEEE Fellow, among other awards and distinctions. He obtained a Ph.D. in CS from the University of Waterloo, Canada, and his areas of expertise are responsible AI, web search and data mining plus data science and algorithms in general.
Keynote Speaker 2: Luna Dong, Meta Reality Labs
The Journey to a Knowledgeable Assistant with Retrieval-Augmented Generation (RAG)
Abstract
For decades, multiple communities (Database, Information Retrieval, Natural Language Processing, Data Mining, AI) have pursued the mission of providing the right information at the right time. Efforts span web search, data integration, knowledge graphs, question answering. Recent advancements in Large Language Models (LLMs) have demonstrated remarkable capabilities in comprehending and generating human language, revolutionizing techniques in every front. However, their inherent limitations such as factual inaccuracies and hallucinations make LLMs less suitable for creating knowledgeable and trustworthy assistants.
This talk describes our journey in building a knowledgeable AI assistant by harnessing LLM techniques. We start with our findings from a comprehensive set of experiments to assess LLM reliability in answering factual questions and analyze performance variations across different knowledge types. Next, we describe our federated Retrieval-Augmented Generation (RAG) system that integrates external information from both the web and knowledge graphs for trustworthy text generation on real-time topics like stocks and sports, as well as on torso-to-tail entities like local restaurants. Additionally, we brief our explorations on extending our techniques towards multi-modal, contextualized, and personalized Q&A. We will share our techniques, our findings, and the path forward, highlighting how we are leveraging and advancing the decades of work in this area.
Bio
Xin Luna Dong is a Principal Scientist at Meta Reality Labs, leading the ML efforts in building an intelligent personal assistant. She has spent more than a decade building knowledge graphs, such as the Amazon Product Graph and the Google Knowledge Graph. She has co-authored books "Machine Knowledge: Creation and Curation of Comprehensive Knowledge Bases" and “Big Data Integration”. She was named an ACM Fellow and an IEEE Fellow for "significant contributions to knowledge graph construction and data integration", awarded the VLDB Women in Database Research Award and VLDB Early Career Research Contribution Award. She serves in the PVLDB advisory committee, was a member of the VLDB endowment, a PC co-chair for KDD’2022 ADS track, WSDM’2022, VLDB’2021, and Sigmod’2018.
Keynote Speaker 3: Peter Boncz, CWI Amsterdam and MotherDuck
Making Data Management Better with Vectorized Query Processing
Abstract
Vectorized query processing is a query processing technique that 20 years ago was introduced in the Vectorwise database system, originally developed in the CWI database architectures research group. Nowadays, it is employed by most analytical data systems, from recent systems like DuckDB, Clickhouse, Velox, DataFusion and Polars to analytical cloud service such as Snowflake, BigQuery, Databricks and MotherDuck. This keynote will re-cap the original idea of vectorized query processing, tell how the technique evolved over two decades and how and why it was adopted by industry. The keynote will conclude with a reflection on the purpose of doing research ("making data management better").
Bio
Peter Boncz holds appointments as tenured researcher at CWI and professor at VU University Amsterdam. His academic background is in database systems, with the open-source column-store MonetDB the outcome of his PhD. He has a track record in bridging the gap between academia and commercial application, founding multiple startups. In 2008 he co-founded Vectorwise around the analytical database system by the same name, which pioneered vectorized query execution, and lightweight data compression; which have been adopted broadly in analytical database systems. He created the Linked Data Benchmark Council (LDBC), a non-profit advancing graph database technology. He is currently on sabbatical at MotherDuck, a startup that is connecting DuckDB - the latest database system born at CWI - to the cloud.