Feb 25, 2024 10:00 AM

LLM Summarization, easy task - impossible evaluation

Join us for an engaging session on improving large language models (LLMs) for autonomous agents. We'll explore the challenges of making AI summaries both accurate and trustworthy, crucial for building smarter, more reliable agents. This discussion promises fresh insights into advancing AI's ability to evaluate and refine its own outputs, a key step towards more independent and effective autonomous systems.

Agenda:

👋 Quick Welcome (5 mins) - Kicking off the session with a warm greeting and what to expect.

🧑‍💼 Brief Introduction from Participants (1 min each) - Participants share their name, location, occupation, and one expectation from the call.

🤖 Overview of Summarization as a Theme for the Session (5 mins) - Diving into why we're focusing on Summarization and the challenges of evaluating large language model summaries.

📄 Overview of the Papers Discussed (5 mins) - Quick summary of the key papers we'll be exploring:

‍

📑 Paper 1: "Identifying Factual Inconsistency in Summaries: Towards Effective Utilization of Large Language Model"

https://arxiv.org/abs/2402.12821

‍

🌐 Paper 2: "TrustScore: Reference-Free Evaluation of LLM Response Trustworthiness"

https://arxiv.org/abs/2402.12545

‍

🔄 Bonus Reading: "How to evaluate a summarization task"

https://cookbook.openai.com/examples/evaluation/how_to_eval_abstractive_summarization

‍

🗣️ Introduction by the Community Agent (5 mins) - A brief introduction to the session by the AI agent, setting the stage for the discussions.

❓ Q&A from Research Agent (15 mins) - An interactive Q&A session, with questions prepared by the AI agent to stimulate discussion and deepen understanding.

🎉 After-party (15 mins) - An informal chat to relax, network, and discuss AI, life, and everything in between.

‍

Open-source non-profit research lab

While you are here, join our community!

Evaluate autonomous agents, collaborate on real-world business process challenges, and drive the industry forward with benchmarks

🛠️

Evaluate your agent solutions against real world problems, instead of research datasets and fancy demos

💸

We are non-profit but your agents will be most likely sponsored by real world companies that end up utilizing them in their operations

🎯

Win prizes for solving and automating challenging business processes funded by sponsors

🤝

Contribute to non-profit 501(c)(3), open-source solution for testing agent applications and support independent ecosystem of tools