Core functions

Extraction

trafilatura.core.extract(filecontent, url=None, record_id='0001', no_fallback=False, include_comments=True, output_format='txt', csv_output=False, json_output=False, xml_output=False, tei_output=False, tei_validation=False, target_language=None, include_tables=True, include_formatting=False, date_extraction_params=None, with_metadata=False, url_blacklist={})[source]

Main process for text extraction

trafilatura.core.baseline(filecontent)[source]

Use baseline extraction function targeting JSON metadata and/or text paragraphs

trafilatura.metadata.extract_metadata(filecontent, default_url=None, date_config=None)[source]

Main process for metadata extraction

XML processing

trafilatura.xml.xmltotxt(xmloutput)[source]

Convert to plain text format

trafilatura.xml.validate_tei(tei)[source]

Check if an XML document is conform to the guidelines of the Text Encoding Initiative

Helpers

trafilatura.utils.load_html(htmlobject)[source]

Load object given as input and validate its type (accepted: LXML tree, bytestring and string)

trafilatura.utils.sanitize()[source]

Convert text and discard incompatible and invalid characters