Tutorial: Validation of TEI files#

Trafilatura can produce and validate XML documents according to the guidelines of the Text Encoding Initiative (XML-TEI).

Producing TEI files#

In Python:

# load the necessary components
from trafilatura import fetch_url, extract

# download a file
downloaded = fetch_url('https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/')

# extract information as XML TEI and validate the result
result = extract(downloaded, output_format='xmltei', tei_validation=True)

From the command line:

trafilatura --xmltei --validate --URL "https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/"

Validating existing files#

The following code returns True if a document is valid and outputs a message related to the first error impeding validation otherwise:

# load the necessary components
from lxml import etree
from trafilatura.xml import validate_tei

# open a file and parse it
mytree = etree.parse('document-name.xml')

# validate it
validate_tei(mytree)
# returns True or an error message

For more information please refer to this blog post: Validating TEI-XML documents with Python