诸神缄默不语-个人CSDN博文目录
文章目录
1. 司法判决预测2. 通用语料3. 其他集成项目4. 推理5. NLU6. NLG1 QA2 文本摘要
7. 信息抽取1 命名实体识别2 句子边界检测(分句)3 论据挖掘4. 事件抽取
8. 智能合同审查9. 其他任务10. 公平性
1. 司法判决预测
中文:
CAIL2018 刑法
原始论文:CAIL2018: A Large-Scale Legal Dataset for Judgment Prediction Overview of CAIL2018: Legal Judgment Prediction Competition数据下载地址:https://cail.oss-cn-qingdao.aliyuncs.com/CAIL2018_ALL_DATA.zip(对数据的具体介绍除上面的论文外,还可以参考:thunlp/CAIL: Chinese AI & Law Challenge)任务:(分类)预测法条、罪名、刑期
2. 通用语料
多语言:
MultiLegalPile
原始论文:(2023) MultiLegalPile: A 689GB Multilingual Legal Corpus数据下载地址:https://huggingface.co/datasets/joelito/Multi_Legal_Pile项目包含的数据:
https://huggingface.co/datasets/joelito/eurlex_resourceshttps://huggingface.co/datasets/joelito/legal-mc4Pile of Law LexFiles
原始论文:(2023 ACL) LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development
西班牙语:
Spanish Legal Domain Corpora
原始论文:(2021) Spanish Legalese Language Model and Corpora数据下载地址:Spanish Legal Domain Corpora | Zenodo
英语:
CaseHOLD English Harvard Law case corpus (1965-2021)
原始论文:(2021 ICAIL) When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings Pile of Law
原始论文:(2022 NeurIPS) Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset数据下载地址:https://huggingface.co/datasets/pile-of-law/pile-of-law (跨国)LeXFiles and LegalLAMA
原始论文:(2023 ACL) LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model DevelopmentLeXFiles是一组语料,LegalLAMA则是用以评估模型效果的benchmark(参考的是LAMA)已放到transformers上:from datasets import load_dataset
dataset = load_dataset('lexlms/lex_files', name='eu-legislation')
from datasets import load_dataset
dataset = load_dataset('lexlms/legal_lama', name='contract_sections')
中文:
华律网法律咨询数据及论文所需语料库;同时发表的论文:法律咨询文本分类系统设计与研究 The legal consultation data and corpus of the thesis from China law network.Replication Data for: Design and research of legal consultation text classification system. - Data Driven Innovation Research Competition for University of China
葡萄牙语:
https://github.com/alfaneo-ai/brazilian-legal-text-dataset(巴西)
3. 其他集成项目
多语言:
LexGLUE coastalcph/lex-glue: LexGLUE: A Benchmark Dataset for Legal Language Understanding in English
原始论文:(2021) LexGLUE: A Benchmark Dataset for Legal Language Understanding in English LEXTREME
原始论文:(2023) LEXTREME: A Multi-Lingual and Multi-Task Benchmark for the Legal Domain数据下载地址:https://huggingface.co/datasets/joelito/lextreme
还没整理完的:
https://github.com/neelguha/legal-ml-datasets
4. 推理
legalbench
原始论文:(2022) LegalBench: Prototyping a Collaborative Benchmark for Legal Reasoning数据下载地址:https://github.com/HazyResearch/legalbench
英语:
SARA:大概来说就是推理某种情况是否适用于某某法律(美国税法中的9个Section)
原始论文:(2020) A Dataset for Statutory Reasoning in Tax Law Entailment and Question Answering
5. NLU
SemEval 2023 Task 6: LegalEval - Understanding Legal Texts
任务:Rhetorical Roles Labeling,命名实体识别,可解释的司法判决预测 MAUD
原始论文:(2023) MAUD: An Expert-Annotated Legal NLP Dataset for Merger Agreement Understanding数据下载地址:https://drive.google.com/drive/folders/1RujOK2FZKdFSCJ15tqdyd42g8WLsYagj
6. NLG
1 QA
中文:
JEC-QA 法考数据集 https://jecqa.thunlp.org/
原始论文:(2020 AAAI) JEC-QA: A Legal-Domain Question Answering Dataset
越南语
(交通法)(2017 KSE) Question analysis for Vietnamese legal question answering
2 文本摘要
英文:
BillSum
原始论文:(2019 WS) BillSum: A Corpus for Automatic Summarization of US Legislation数据下载地址:billsum · Datasets at Hugging Face VebCL(基于案例引用图实现一句话摘要/抽取重点信息)
原始论文:(2021 CIKM) VerbCL: A Dataset of Verbatim Quotes for Highlight Extraction in Case Law数据下载地址:https://uvaauas.figshare.com/articles/dataset/VerbCL_Dataset/14798878/1
多语言:
EUR-Lex-Sum(24种欧洲官方语言) 原始论文:(2022 EMNLP) EUR-Lex-Sum: A Multi- and Cross-lingual Dataset for Long-form Summarization in the Legal Domain 数据下载地址:dennlinger/eur-lex-sum · Datasets at Hugging FaceMulti-LexSum 原始论文:(2022) Multi-LexSum: Real-World Summaries of Civil Rights Lawsuits at Multiple Granularities 数据集官网:https://multilexsum.github.io/
7. 信息抽取
1 命名实体识别
葡萄牙语(巴西):
CDJUR-BR
原始论文:(2023) CDJUR-BR – A Golden Collection of Legal Document from Brazilian Justice with Fine-Grained Named Entities
2 句子边界检测(分句)
多语言:
MultiLegalSBD(英语、西班牙语、德语、意大利语、葡萄牙语、法语)
原始论文:(2023 ICAIL) MultiLegalSBD: A Multilingual Legal Sentence Boundary Detection Dataset数据下载地址:https://huggingface.co/datasets/rcds/MultiLegalSBD
3 论据挖掘
英语
mining-legal-arguments
原始论文:(2023) Mining Legal Arguments in Court Decisions下载地址:trusthlt/mining-legal-arguments: Mining Legal Arguments in Court Decisions - Data and software
4. 事件抽取
中文
DLEE
原始论文:(2024 Neural Computing and Applications) DLEE: a dataset for Chinese document-level legal event extraction
8. 智能合同审查
英语
(2021 NeurIPS) CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review https://github.com/TheAtticusProject/cuad https://huggingface.co/datasets/theatticusproject/cuad-qa
9. 其他任务
结构化:
DiscoveringTheRationaleOfDecisions(用于抽取判决结果中的rationale。具体干啥的其实我还没看)
原始论文:(2021 ICAIL) Discovering the Rationale of Decisions: Experiments on Aligning Learning and Reasoning数据下载地址见官方GitHub项目:CorSteging/DiscoveringTheRationaleOfDecisions: Discovering the Rationale of Decisions
GENTLE(英语域外评估,包括了法律文书)
原始论文:(2023 ACL) GENTLE: A Genre-Diverse Multilayer Challenge Set for English NLP and Linguistic Evaluation下载地址:gucorpling/gentle: Repository for the GENTLE corpus
10. 公平性
多语言:
FairLex
原始论文:(2022 ACL) FairLex: A Multilingual Benchmark for Evaluating Fairness in Legal Text Processing数据下载地址:coastalcph/fairlex · Datasets at Hugging Face