Accepted Tutorials - PAKDD 2024

Tutorial T1 – Addressing Geographical Biases in Language Models: A Practical Tutorial

Speakers: Rémy Decoupes (Inrae, TETIS, France), Maguelonne Teisseire (Inrae, TETIS, France)

Abstract:
As language models become indispensable tools in professional domains such as writing, coding, and learning, it is paramount to address inherent biases. Within Natural Language Processing (NLP), biases stemming from data, annotation, representation, models, and research design have been extensively studied. This tutorial delves into the specific realm of biases related to geographical knowledge.

Focusing on the intersection of geography and language models, the study reveals the proclivity of language models to misrepresent spatial information, resulting in distortions in geographical distances. To assess these disparities, the tutorial introduces four indicators that compare both geographical and semantic distances. These indicators serve as practical metrics for evaluating the geographical biases present in language models.

The practical session will include ten widely utilized language models that will be analyzed with the introduced indicators. The tutorial aims to equip practitioners and researchers with a comprehensive understanding of how geographical biases manifest in language models and provides practical insights into addressing these biases for the sake of accurate and equitable linguistic representations. Through the exploration of real-world experiments, this tutorial offers actionable steps and methodologies to mitigate geographical distortions, promoting fairness and effectiveness in language model applications.

The people who will follow the tutorial will not have to install anything on their machine, all practical work will take place within notebooks via Google Colab links shared by the presenters.

Tutorial T2 – Beyond Point Estimates: Theory & Applications of Uncertainty Estimations

Speakers & Organizers: Arunita Das (Amazon, Bengaluru, KA, India), Sanyog Dewani (Amazon, Gurgaon, HR, India)
Organizers: Srujana Merugu (Amazon, Bengaluru, KA, India), Gokul Swamy (Amazon, Seattle, WA, USA)

Abstract:
Most ML models are trained to optimize the maximum likelihood estimate(MLE) resulting in point estimates for the prediction to drive business critical decisions. Despite being trained on sizeable data with good performance on validation samples, most of these models exhibit inordinately high variance for certain predictions, which leads to erroneous decision making, thereby reducing trust in the model. Modeling of uncertainty and utilizing uncertainty estimates along with model predictions is a crucial step to solve this issue, build robust ML models leading to improvement in model performance. Uncertainty-augmented decision making can prevent bot or adversarial attacks and additionally create avenues for human-in-the-loop decision making. In this tutorial, we will first build upon the fundamentals of uncertainty estimation and then present SOTA research on a) uncertainty estimation algorithms, b) evaluation metrics of uncertainty estimation methods c) applications of uncertainty estimates with a special focus on supervised problems. Our emphasis is to provide pointers to the theoretical settings with utmost clarity on the intuition behind them. The tutorial will also include a hands on session where we will present building of uncertainty estimation models for different real-world ML problems using publicly available datasets.

Tutorial T3 – Machine Learning for Streaming Data

Speakers: Heitor Murilo Gomes (School of Engineering and Computer Science, Victoria University of Wellington, New Zealand), Yibin Sun (AI Institute, University of Waikato, New Zealand)

Abstract:
Machine learning for data streams (MLDS) has been a significant research area since the late 90s, with increasing adoption in industry over the past few years. Despite commendable efforts in opensource libraries, a gap persists between pioneering research and accessible tools, presenting challenges for practitioners, including experienced data scientists, in implementing and evaluating methods in this complex domain. Our tutorial addresses this gap with a dual focus. We discuss advanced research topics such as partially delayed labeled streams while providing practical demonstrations of their implementation and assessment using Python. By catering to both researchers and practitioners, this tutorial aims to empower users in designing, conducting experiments, and extending existing methodologies. We are going to set up a webpage with the presentation and material. Previous tutorials and slides by the organisers can be found in the link below. This particular tutorial has not been presented elsewhere yet.
http://heitorgomes.com/talks/

Tutorial T4 – Heterogeneity in Federated Learning

Speakers: Jiaqi Wang (The Pennsylvania State University, US), Fenglong Ma (The Pennsylvania State University, US)

Abstract:
Federated learning is a distributed machine learning paradigm, which enables multiple participants to cooperate in training machine learning models without sharing data. Heterogeneity is one of the main challenges in federated learning. To solve this challenge, in this tutorial, we will cover the state-of-theart federated learning techniques to handle the heterogeneity issue. In particular, we focus on the following three aspects: (1) providing a comprehensive review of heterogeneity challenges in federated learning from three perspectives, including data heterogeneity, model heterogeneity, and system heterogeneity; (2) introducing cutting-edge techniques to solve the heterogeneity issue in federated learning from both algorithm and application perspectives; and (3) identifying open challenges and proposing convincing future research directions in heterogeneous federated learning. We believe this is an emerging and potentially high-impact topic in distributed machine learning, which will attract both researchers and practitioners from academia and industry.

Tutorial T5 – Effective Model Reduction for Edge Intelligence

Speakers: Ting-An Chen (Graduate Institute of Electrical Engineering, National Taiwan Univ., Taiwan / Institute of Information Science, Academia Sinica, Taiwan), De-Nian Yang (Institute of Information Science, Academia Sinica, Taiwan), Ming-Syan Chen (Department of Electrical Engineering, National Taiwan Univ., Taiwan)

Abstract:
The tutorial explores methods for effective model reduction for the vision and language inference models, emphasizing challenges posed by the large sizes when the model is deployed on resourceconstrained devices. We shall introduce effective model reduction strategies, such as structure simplification, pruning, and quantization, to enable the models to be applicable to edge devices. Recent advancements in quantization techniques for non-iid data are highlighted for their contributions to real-world scenarios including domain shift, class imbalance and streaming. It is envisioned that techniques for model reduction are of growing importance as the computing power increases, and the energy saving gains more emphases for years to come. The tutorial provides a comprehensive understanding of model reduction for edge intelligence on large models along with the insights into recent research advancements and practical applications.

Tutorial T6 – Data Mining in Big Dynamic Networks

Speaker: Prof. Ernst C. Wit (Università della Svizzera italiana, Switzerland)

Abstract:
Many automatic monitoring systems generate big dynamic network data, also called relational data: from invasive species diffusion across the globe (10-100K), bike-sharing rides between bike stations (100K-1M) to patent citations of novel technologies (10M-100M). The aim in analysing these data is typically to discover what drives the interactions to find effective strategies, respectively, to control invasive species, to predict bike sharing at any location at any time, to develop technological innovation.

This workshop explores the advancements in relational event modelling (REM) within the context of time-stamped relational data, commonly generated by email exchanges and social media interactions.

The session begins with an introduction to REMs, emphasizing their application in identifying drivers of processes involving temporally ordered events. It delves into the extension of traditional network statistics in REMs, encompassing degree-based metrics and intensity-based counterparts, along with distinguishing short- and long-term network dynamics. The workshop progresses to mixed effect additive REMs, demonstrating how to integrate non-linear specifications and time-varying covariate influences. Reciprocity and triadic effects are revisited with a focus on dynamic structures, challenging assumptions of stability over time. Global covariates, previously challenging in traditional REMs, are addressed, allowing the inclusion of factors like weather or time-of-day. The workshop concludes by exploring strategies to efficiently apply REMs to huge datasets, overcoming computational complexities in ways that involve modern machine learning techniques.

Each session involves a concise explanatory segment followed by an extensive hands-on computer practical, encouraging participants to bring their laptops with Rstudio pre-installed.

Tutorial T7 – PAMI: An Open Source Python Library for Pattern Mining in Big Data

Speaker: RAGE Uday Kiran (The University of Aizu, Japan)

Abstract:
This tutorial offers a comprehensive exploration of pattern mining, focusing on the newly developed PAMI library, designed to unearth valuable insights hidden within big data. Beginning with an introduction to pattern mining’s significance and the variety of patterns it uncovers, the tutorial delves into database types and associated patterns, followed by an overview of existing tools like SPMF and FIMI. Subsequently, it introduces PAMI’s features, algorithms, and cross-platform compatibility, detailing its application in discovering patterns within diverse datasets, exemplified through the analysis of Japan’s air pollution data. Moreover, the tutorial illustrates how PAMI seamlessly integrates with prominent machine learning libraries such as TensorFlow, Scikit-learn, and PyTorch, showcasing its synergy with machine learning for enhanced knowledge discovery. Concluding with insights into future research directions, this tutorial equips participants with practical skills and insights to harness pattern mining for unlocking actionable intelligence from big data.