Tasks

Document Question Answering

Document Question Answering (also known as Document Visual Question Answering) is the task of answering questions on document images. Document question answering models take a (document, question) pair as input and return an answer in natural language. Models usually rely on multi-modal features, combining text, position of words (bounding-boxes) and image.

Inputs
Question

What is the idea behind the consumer relations efficiency team?

Document Question Answering Model
Output
Answer

Balance cost efficiency with quality customer service

About Document Question Answering

Use Cases

Document Question Answering models can be used to answer natural language questions about documents. Typically, document QA models consider textual, layout and potentially visual information. This is useful when the question requires some understanding of the visual aspects of the document. Nevertheless, certain document QA models can work without document images. Hence the task is not limited to visually-rich documents and allows users to ask questions based on spreadsheets, text PDFs, etc!

Document Parsing

One of the most popular use cases of document question answering models is the parsing of structured documents. For example, you can extract the name, address, and other information from a form. You can also use the model to extract information from a table, or even a resume.

Invoice Information Extraction

Another very popular use case is invoice information extraction. For example, you can extract the invoice number, the invoice date, the total amount, the VAT number, and the invoice recipient.

Inference

You can infer with Document QA models with the πŸ€— Transformers library using the document-question-answering pipeline. If no model checkpoint is given, the pipeline will be initialized with impira/layoutlm-document-qa. This pipeline takes question(s) and document(s) as input, and returns the answer.
πŸ‘‰ Note that the question answering task solved here is extractive: the model extracts the answer from a context (the document).

from transformers import pipeline
from PIL import Image

pipe = pipeline("document-question-answering", model="naver-clova-ix/donut-base-finetuned-docvqa")

question = "What is the purchase amount?"
image = Image.open("your-document.png")

pipe(image=image, question=question)

## [{'answer': '20,000$'}]

Useful Resources

Would you like to learn more about Document QA? Awesome! Here are some curated resources that you may find helpful!

Notebooks

Documentation

The contents of this page are contributed by Eliott Zemour and reviewed by Kwadwo Agyapon-Ntra and Ankur Goyal.

Compatible libraries

Document Question Answering demo
This model can be loaded on the Inference API on-demand.
Models for Document Question Answering
Browse Models (60)

Note A LayoutLM model for the document QA task, fine-tuned on DocVQA and SQuAD2.0.

Note A special model for OCR-free Document QA task. Donut model fine-tuned on DocVQA.

Datasets for Document Question Answering
Browse Datasets (1)

Note Dataset from the 2020 DocVQA challenge. The documents are taken from the UCSF Industry Documents Library.

Spaces using Document Question Answering

Note A robust document question answering application.

Note An application that can answer questions from invoices.

Metrics for Document Question Answering
anls
The evaluation metric for the DocVQA challenge is the Average Normalized Levenshtein Similarity (ANLS). This metric is flexible to character regognition errors and compares the predicted answer with the ground truth answer.
exact-match
Exact Match is a metric based on the strict character match of the predicted answer and the right answer. For answers predicted correctly, the Exact Match will be 1. Even if only one character is different, Exact Match will be 0