首页    期刊浏览 2024年12月03日 星期二
登录注册

文章基本信息

  • 标题:Recursive Programs for Document Spanners
  • 本地全文:下载
  • 作者:Liat Peterfreund ; Balder ten Cate ; Ronald Fagin
  • 期刊名称:LIPIcs : Leibniz International Proceedings in Informatics
  • 电子版ISSN:1868-8969
  • 出版年度:2019
  • 卷号:127
  • 页码:1-18
  • DOI:10.4230/LIPIcs.ICDT.2019.13
  • 出版社:Schloss Dagstuhl -- Leibniz-Zentrum fuer Informatik
  • 摘要:A document spanner models a program for Information Extraction (IE) as a function that takes as input a text document (string over a finite alphabet) and produces a relation of spans (intervals in the document) over a predefined schema. A well-studied language for expressing spanners is that of the regular spanners: relational algebra over regex formulas, which are regular expressions with capture variables. Equivalently, the regular spanners are the ones expressible in non-recursive Datalog over regex formulas (which extract relations that constitute the extensional database). This paper explores the expressive power of recursive Datalog over regex formulas. We show that such programs can express precisely the document spanners computable in polynomial time. We compare this expressiveness to known formalisms such as the closure of regex formulas under the relational algebra and string equality. Finally, we extend our study to a recently proposed framework that generalizes both the relational model and the document spanners.
  • 关键词:Information Extraction; Document Spanners; Polynomial Time; Recursion; Regular Expressions; Datalog
国家哲学社会科学文献中心版权所有