Performance and Interpretability Evaluations of Multimodal, Multipurpose, Massive-Scale Models Workshop

Overview

With the increasing use of machine-learning driven algorithmic judgements to support or automate real-world systems and decision making, it is critical to understand the strengths and weaknesses beyond traditional aggregate performance evaluations. This is particularly critical for massive-scale, multimodal, or multipurpose (adapted or adaptable to multiple downstream tasks) models that are rapidly becoming the foundation for many applications in computational linguistics (e.g., BERT, GPT2, GPT3, CLIP). Especially as these large, pretrained models are being adapted for use or the foundation of models in a wide-ranging variety of applications, we need evaluations beyond aggregate performance for informed use and to identify and mitigate biases.

We need to understand not only how a model performs overall but also why or under what circumstances the model will perform unreliably, and how interpretable the model and model outputs are in context of its use. Additionally, as new computational paradigms such as quantum computing are introduced, what new insights, challenges, and capabilities do models using these paradigms bring to model evaluations? Can we leverage these advances for more comprehensive evaluations?

This workshop will focus on:

Evaluations that encompass multiple principles of responsible AI/ML (e.g., robustness, accountability, fairness, transparency, interpretability) for multimodal, multipurpose, and massive-scale models; and
Research that examine the differences in evaluations for narrow (single modality, single task) and multipurpose or multimodal models, and nontraditional methods, measures, or metrics that can be leveraged.

The workshop will include keynote talks, a panel discussion on open challenges and pathways forward for trusted and responsible evaluations of massive-scale, multimodal, or multipurpose and, a series of contributed talks highlighting paper submissions and a poster session to engage authors and attendees in more detailed discussions.

Organizers

Maria Glenski, Pacific Northwest National Laboratory
Yulia Tsvetkov, University of Washington
Megan Kohagen, Quantinuum
Vidhisha Balachandran, Carnegie Mellon University

Workshop Email: mmmpie.workshop@gmail.com

Call for Papers

We welcome submissions that focus on performance and interpretability analyses or evaluation methods for multimodal, multipurpose, or massive-scale models. Topics include but are not limited to:

Multimodal or Massive-Scale (Single Task Models): Analyses that examine differences in evaluations of narrow (single modality, single task) and multipurpose AI and the limitations/challenges of existing metrics/benchmarks.
Multipurpose Models or Emergent Behavior (Models supporting Multiple Tasks): Analyses that evaluate performance or behavior of multipurpose models (e.g., foundation models, neural platforms) that can support multiple tasks across single or multiple modalities. This includes zero-shot, few-shot, and finetuned tasks as well as emergent behavior.
Nontraditional Evaluation/Interpretability Methods: Nontraditional methods, measures, or metrics targeting limitations of existing evaluation and interpretability of multimodality, multiple tasks (multipurpose), or emergent behavior.

Submissions may incorporate the following:

Evaluations that encompass multiple principles of responsible AI/ML (e.g., robustness, accountability, fairness, transparency, interpretability) for multimodal, multipurpose, or massive-scale models
Domain-driven evaluations for performance or interpretability needs of different use cases (e.g., commercial, academic) and users (e.g., researchers, domain scientists, practitioners, students)

The workshop will be organized around the complexity (multimodal, multipurpose, or massive-scale) of AI models under analysis or for which nontraditional methods of evaluation are developed to support, or traditional methods of evaluation/interpretability are extended.

Special Themes for Position Papers “Multimodal, Multipurpose, or Massive-scale models and Beyond”

We are delighted to seek submissions for special themes that reflect progress and future directions in this area. These themes will include presentations and panel discussions around open questions, major obstacles, and integration of new techniques for NLP. These themes lend themselves to position papers, however, contributions with empirical evidence are encouraged as well.

Quantum Natural Language Processing: How does or will quantum natural language processing offer insight for evaluation and interpretability with NLP models? How does it align or not align with current evaluation and interpretability.
Interpretability-Performance Tradeoffs: How should developers, end users, or evaluation hndle complementary or competing constraints when considering both performance and interpretability/explainability needs and requirements.

Topics

We encourage submissions from a wide range of topics which include but are not limited to:

Interpretability for large language models
Multi-modal applications and models
Efficiency in large model paradigm
Multi faceted evaluation of large models
Quantum Natural Language Processing
Domain specific adaptations of large language models

Submission Requirements

The workshop will accept ARR-reviewed papers as well as direct submissions to the SoftConf portal (https://www.softconf.com/coling2022/PIEM3SM/).

Any ARR-reviewed paper that has all of its reviews and meta-reviews available by the workshop submission deadline (~~July 15, 2022~~ July 24, 2022), can be committed to the workshop. Submissions from ARR cannot be modified except that they can be associated with an author response.
Any non-ARR paper can be directly submitted to the workshop’s submission portal by ~~July 15, 2022~~ July 24, 2022.

Format: Submissions must follow COLING 2022 formatting: submissions of up to nine (9) pages maximum, excluding references, for long papers, and up to four (4) pages, excluding references, for short papers (including position papers submitted to address the special themes). For both long and short papers, the abstract should be no more than 200 words.

Important Dates:

All deadlines are midnight UTC-12, anywhere on earth.

Papers Due (via Softconf): ~~July 15, 2022 (Friday)~~ July 24, 2022
Commitment of ARR Reviews by: ~~July 15, 2022 (Friday)~~ July 24, 2022
Notification of Acceptance: September 7, 2022 (Wednesday)
Camera-ready papers due: September 14, 2022 (Wednesday)
Workshop date: October 17, 2022
COLING Conference dates: October 12-17, 2022

Workshop Schedule

Pacific (PDT)	Eastern (EDT)	British Summer Time (BST)	Indian Standard Time (IST)	Korea Standard Time (KST)
3:00 PM	6:00 PM	11:00 PM	3:30 AM	7:00 AM	Workshop Opening & Introductions
3:30 PM	6:30 PM	11:30 PM	4:00 AM	7:30 AM	Keynote
4:30 PM	7:30 PM	12:30 AM	5:00 AM	8:30 AM	Break (15 minutes)
4:45 PM	7:45 PM	12:45 AM	5:15 AM	8:45 AM	Panel on Challenges and Opportunities in Evaluation and Interpretability of Multimodal, Multipurpose, and Massive-Scale Models
5:45 PM	8:45 PM	1:45 AM	6:15 AM	9:45 AM	Break (15 minutes)
6:00 PM	9:00 PM	2:00 AM	6:30 AM	10:00 AM	Introduction of Accepted Papers Lightning Talks
6:10 PM	9:10 PM	2:10 AM	6:40 AM	10:10 AM	Lightning Talk - On the Effects of Video Grounding on Language Models
6:20 PM	9:20 PM	2:20 AM	6:50 AM	10:20 AM	Lightning Talk - Rethinking Task Sampling for Few-shot Vision-Language Transfer Learning
6:30 PM	9:30 PM	2:30 AM	7:00 AM	10:30 AM	Lightning Talk - Pixel-Level BPE for Auto-Regressive Image Generation
6:40 PM	9:40 PM	2:40 AM	7:10 AM	10:40 AM	Lightning Talk - Cost-Effective Language Driven Image Editing with LX-DRIM
6:50 PM	9:50 PM	2:50 AM	7:20 AM	10:50 AM	Lightning Talk - Shapes of Emotions: Multimodal Emotion Recognition in Conversations via Emotion Shifts
7:00 PM	10:00 PM	3:00 AM	7:30 AM	11:00 AM	Lightning Talk - Analyzing BERT Cross-lingual Transfer Capabilities in Continual Sequence Labeling
7:10 PM	10:10 PM	3:10 AM	7:40 AM	11:10 AM	Workshop Closing