← All articles
AI Use Cases · 18 Feb, 2025

AI Contract Analysis: Qwen Outperforms Llama in Obligation Identification

Contract analysis is a critical, yet often error-prone, process requiring meticulous attention to detail. While AI models promise to streamline this process, their effectiveness in accurately identifying contractual obligations remains a key concern.

AI Contract Analysis: Qwen Outperforms Llama in Obligation Identification

AI Contract Analysis: Qwen Outperforms Llama in Obligation Identification

  • Contract analysis is a high-stakes, detail-intensive task, and while AI models offer promise, accurately identifying contractual obligations remains a major challenge.

  • This study evaluates the performance of leading AI models like Qwen and Llama, uncovering both their strengths and limitations in handling complex legal language.

  • An analysis of over 100 contracts using a standardized prompt of 122 obligations highlights how models may misinterpret or overlook clauses, emphasizing the need for robust evaluation.

  • The findings point toward the potential of AI in contract analysis, while also identifying key areas for future improvement in obligation recognition and legal language comprehension.

The Innovation Breakthrough

Figure 1: Accuracy rates for each obligation

This research compared Qwen2.5-72B-Instruct and Llama3.3-70B, two leading AI models, in their ability to identify contractual obligations. Qwen demonstrated a higher accuracy rate in identifying contractual obligations, achieving a 79.7% fractional accuracy compared to Llama's 75.0% as shown in Figure 1. This was achieved by analyzing each obligation within a clause, rather than a simple binary correct/incorrect approach. Qwen also showed a better ability to handle complex sentence structures and negative clauses, which often confused Llama

Figure 2: Binary accuracy comparison

Figure 2 compares the binary accuracy rates of Qwen and Llama. A value of 1 represents correct answers, while 0 represents incorrect answers. In this context, 1 only includes responses that are completely accurate.

  • Qwen Average Accuracy Rate:

    63.3%

  • Llama Average Accuracy Rate:

    60.1%

Operational Benefits and Competitive Advantage

By correctly identifying obligations, Qwen can reduce the risk of non-compliance, which can lead to significant financial penalties. The improved accuracy translates to a more efficient contract review process, potentially saving hours of manual review time. Qwen's ability to identify multiple obligations within a single clause ensures a more comprehensive understanding of contractual terms. This enhanced accuracy provides a competitive edge by reducing risk and improving operational efficiency. Additionally, the analysis details which responses users preferred between two artificial intelligences. The collected answers aim to understand user preference trends and compare the effectiveness of the AIs, and the findings are visualized below.

Figure 3 Qwen vs Llama selection rates (Note: In the study, no selection was made in cases where the models' responses were the same or very similar, or where both models produced incorrect answers.)

Technical Implementation

The study utilized a dataset of over 100 contracts, focusing on the models' ability to identify various types of obligations. The evaluation process involved a detailed analysis of each identified obligation, comparing it against the actual contractual terms. The primary limitation of the study is that it focuses on Turkish language contracts, and results may vary in other languages. Let’s discuss the problems encountered with Qwen and Llama:

Problems Encountered with Llama

Identifying Only One Obligation per Contract Clause: Contracts may contain more than one obligation within a single norm. However, Llama usually identifies only one of them. For example:

“...işbu Sözleşme tahtındaki borçlarını ve yükümlülüklerini ifa kabiliyetini olumsuz etkileyebilecek, Devralan'ın Devir tahtındaki menfaatlerine halel getirebilecek kendisi için bağlayıcı olan veya malvarlığını etkileyen bir anlaşma veya belge tahtında temerrüde düşmediğini veya bilgisi dahilinde bu nitelikte bir dava, icra veya iflas takibi bulunmadığını;”

In this provision, although there is both an obligation of declaration of non-default and an obligation of declaration of absence of litigation, Llama only identified the latter and disregarded the former.

Stating Obligations Not Present in the List of Obligation Types: Llama may identify an obligation that is not found in the existing list of obligations. For example:

“bu minvalde Reklam Alanları etrafında yürütülmekte olan diğer inşaat faaliyetlerini engelleyici herhangi bir eylem ve davranışta bulunmayacaktır.”

In this provision, which specifies a non-interference obligation, Llama states that there is an obligation to not create obstacles. This obligation is not listed as such in the obligations that we have identified, word-for-word.

Identifying Obligations with Key Words: Llama focuses on the words in the related provision when determining obligations. It usually states the obligation with the words in the provision except for fixed expressions, and does not consider the meaning. For example:

“Devir Hesabı Devralan nezdinde Devralan adına açılmış ve Devredilen Haklar ve Alacaklar'ın işbu Sözleşme Madde 2.1 (Devir) tahtında açıklandığı şekilde yatırılacağı…”

In the above provision, since the sentence starts with "Devir Hesabı" (Transfer Account), it interprets it as an obligation regarding a transfer account.

Problems Encountered with Qwen

Not Being Proficient in Turkish Grammar: It has difficulties understanding obligations in contracts written in Turkish, it states obligations without understanding complex sentences and negative suffixes. For example:

“...bu Sözleşme'nin akdedilmesinden önce Devralan'a yazılı olarak bildirilmiş olan değişiklikler dışında tadil edilmediklerini…”

Here, Qwen states that there is an obligation of not amending the contract.

Identifying Obligations with Key Words: As with Llama, Qwen focuses on the words rather than the meanings and therefore makes errors frequently. Qwen makes this mistake more frequently than Llama. For example:

“İmza Tarihi itibarıyla Reklam Alanları'nı ve onun çevresini etkileyebilecek riskler ve diğer durumlara ilişkin her türlü gerekli bilgi ve veriyi denetlediğini, incelediğini ve bunlara yönelik gerekli araştırmaları yaptığını kabul, beyan ve taahhüt eder.”

In the example above, because the word "information" appears, Qwen interprets that the relevant provision is related to an obligation to provide information.

Identifying Unrelated Obligations: Sometimes, the artificial intelligence can extract very unrelated obligations. For example:

“...kendisi ile ilgili olarak İmza Tarihi'nde ve Sözleşme süresi boyunca, İşveren'e karşı yukarıda belirtilen beyan ve taahhütlerde bulunmaktadır ve söz konusu beyan ve taahhütlerin YİD'in Süresi boyunca geçerli olacağını kabul, beyan ve taahhüt etmektedir.”

Qwen interprets the relevant provision as an obligation to declare the absence of litigation.

Deriving Obligations from Incomplete Expressions: There are provisions where artificial intelligence derives an obligation even though it doesn't express a meaning. For example:

“herhangi bir durumda herhangi bir Ardıl Alacaklı'nın tazminat veya eski hale iade talebi;”

Even though it does not contain the expression 'any,' when asked, the artificial intelligence gives the response that there is an obligation to compensate.

Key Takeaways

  • Qwen outperformed Llama in contractual obligation identification, achieving higher accuracy in both fractional (79.7%) and binary (63.3%) metrics, showcasing stronger comprehension of complex legal language.

  • Qwen demonstrated better handling of multi-obligation clauses and negative sentence structures, while Llama frequently missed secondary obligations or relied too heavily on surface-level keywords.

  • Both models encountered notable limitations, including misinterpretations due to keyword dependence and issues with Turkish grammar, emphasizing the challenges of applying AI in multilingual legal contexts.

  • Qwen’s improved performance has real business implications, offering potential for reduced compliance risks, faster contract review processes, and more comprehensive analysis of legal documents.

  • This study reinforces the importance of model evaluation in domain-specific AI applications, particularly for high-stakes industries like legal and finance, and calls for continued research and development to enhance AI’s legal reasoning capabilities.

Our Mind

This study highlights the importance of rigorous evaluation of AI models in critical applications like contract analysis. While both models show promise, Qwen's superior performance indicates its potential for more reliable and accurate contract review. This research contributes to the field by providing a detailed analysis of AI model performance in a complex, real-world scenario, and underscores the need for continued development and refinement of AI models for legal and business applications.

AI Use Cases