Uses & citations#

Trafilatura is widely recognized for its effectiveness in article extraction, earning accolades as the most efficient open-source library in various benchmarks and academic evaluations. As such, it is used by numerous institutions, integrated into other software packages and cited in research publications, across fields such as linguistics, natural language processing, social science, information science, and large language models. This page compiles projects and publications that reference the library.

It is versatile and can be useful for a wide range of applications. It supports language modeling by providing high-quality text data, aids data mining with efficient web data retrieval, and streamlines information extraction from unstructured content. In SEO and business analytics it gathers online data for insights and in information security, it monitors websites for threat detection.

If you wish to add further references, please edit this page and suggest changes (i.e. fork the repository and submit a pull request).

Notable projects using this software#

Trafilatura has been employed in a variety of contexts and projects. Some of the known uses include academic research (e.g. data-driven studies in linguistics and social sciences), refinement of large language models (LLMs), and further inclusion in other open source packages.

Known institutional users#

Various software repositories#

  • Benson, to turn a list of URLs into mp3s of the contents of each web page

  • CommonCrawl downloader, to derive massive amounts of language data

  • DataTrove, to process, filter and deduplicate text data

  • Ethical ad server on ReadTheDocs (hosting these doc pages)

  • GLAM Workbench for cultural heritage (web archives section)

  • llama-hub, a library of data loaders for large language models

  • LlamaIndex, a data framework for LLM applications

  • Obsei, a text collection and analysis tool

  • Vulristics, a framework for analyzing publicly available information about vulnerabilities

  • Website-to-Chatbot, a personalized chatbot

For more see this list of software using Trafilatura.

Citations in papers#

Trafilatura as a whole#

To reference this software in a publication please cite the following paper:

Reference DOI: 10.18653/v1/2021.acl-demo.15 Zenodo archive DOI: 10.5281/zenodo.3460969
@inproceedings{barbaresi-2021-trafilatura,
  title = {{Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction}},
  author = "Barbaresi, Adrien",
  booktitle = "Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations",
  pages = "122--131",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2021.acl-demo.15",
  year = 2021,
}

Date extraction (htmldate)#

The date extraction component htmldate is referenced in the following publication:

JOSS article Zenodo archive
@article{barbaresi-2020-htmldate,
  title = {{htmldate: A Python package to extract publication dates from web pages}},
  author = "Barbaresi, Adrien",
  journal = "Journal of Open Source Software",
  volume = 5,
  number = 51,
  pages = 2439,
  url = {https://doi.org/10.21105/joss.02439},
  publisher = {The Open Journal},
  year = 2020,
}

Publications citing Trafilatura#

  • Alakukku, L. (2022). “Domain specific boilerplate removal from web pages with entropy and clustering”, Master’s thesis, University of Aalto.

  • Alexandrescu, A., & Butincu, C.N. (2023). Decentralized news-retrieval architecture using blockchain technology. Mathematics, 11(21), 4542.

  • Alhamzeh, A., Bouhaouel, M., Egyed-Zsigmond, E., & Mitrović, J. (2021). “DistilBERT-based Argumentation Retrieval for Answering Comparative Questions”, Proceedings of CLEF 2021 – Conference and Labs of the Evaluation Forum.

  • Bender, M., Bubenhofer, N., Dreesen, P., Georgi, C., Rüdiger, J. O., & Vogel, F. (2022). Techniken und Praktiken der Verdatung. Diskurse–digital, 135-158.

  • Bevendorff, J., Gupta, S., Kiesel, J., & Stein, B. (2023). An Empirical Comparison of Web Content Extraction Algorithms.

  • Book, L. (2023). Evaluating and comparing different key phrase-based web scraping methods for training domain-specific fasttext models, Master’s thesis, KTH Royal Institute of Technology.

  • Bozarth, L., & Budak, C. (2021). “An Analysis of the Partnership between Retailers and Low-credibility News Publishers”, Journal of Quantitative Description: Digital Media, 1.

  • Brandon, C., Doherty, A. J., Kelly, D., Leddin, D., & Margaria, T. (2023). HIPPP: Health Information Portal for Patients and Public. Applied Sciences, 13(16), 9453.

  • Braun, D. (2021). “Automated Semantic Analysis, Legal Assessment, and Summarization of Standard Form Contracts”, PhD Thesis, Technische Universität München.

  • Chen, X., Zeynali, A., Camargo, C., Flöck, F., Gaffney, D., Grabowicz, P., … & Samory, M. (2022). SemEval-2022 Task 8: Multilingual news article similarity. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022) (pp. 1094-1106).

  • Crummett, L. T., & Aslam, M. H. (2023). Diabetes websites lack information on dietary causes, risk factors, and preventions for type 2 diabetes. Frontiers in Public Health, 11, 1159024.

  • De Cesare, A. M. (2023). Assessing the quality of ChatGPT’s generated output in light of human-written texts: A corpus study based on textual parameters. CHIMERA: Revista de Corpus de Lenguas Romances y Estudios Lingüísticos, 10, 179-210.

  • Di Giovanni, M., Tasca, T., & Brambilla, M. (2022). DataScience-Polimi at SemEval-2022 Task 8: Stacking Language Models to Predict News Article Similarity. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022) (pp. 1229-1234).

  • Dumitru, V., Iorga, D., Ruseti, S., & Dascalu, M. (2023). Garbage in, garbage out: An analysis of HTML text extractors and their impact on NLP performance. In 2023 24th International Conference on Control Systems and Computer Science (CSCS) (pp. 403-410). IEEE.

  • Fröbe, M., Hagen, M., Bevendorff, J., Völske, M., Stein, B., Schröder, C., … & Potthast, M. (2021). “The Impact of Main Content Extraction on Near-Duplicate Detection”. arXiv preprint arXiv:2111.10864.

  • Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., … & Leahy, C. (2020). “The Pile: An 800GB Dataset of Diverse Text for Language Modeling”, arXiv preprint arXiv:2101.00027.

  • Gopalakrishnan, S., Chen, V. Z., Dou, W., Hahn-Powell, G., Nedunuri, S., & Zadrozny, W. W. (2023). Text to Causal Knowledge Graph: A Framework to Synthesize Knowledge from Unstructured Business Texts into Causal Graphs. Information, 14(7), 367.

  • Harrando, I., & Troncy, R. (2021). “Explainable Zero-Shot Topic Extraction Using a Common-Sense Knowledge Graph”, In 3rd Conference on Language, Data and Knowledge (LDK 2021). OpenAccess Series in Informatics, Dagstuhl Publishing.

  • Hartmann, S. (2023). Open Corpus Linguistics–or How to overcome common problems in dealing with corpus data by adopting open research practices.

  • Hunter, S. B., Mathews, F., & Weeds, J. (2023). Using hierarchical text classification to investigate the utility of machine learning in automating online analyses of wildlife exploitation. Ecological Informatics, 75, 102076.

  • Hunter, S. B., Oedin, M., Weeds, J., & Mathews, F. (2024). Exploring the potential for online data sources to enhance species threat mapping through the case study of global bat exploitation. Conservation Biology, e14242.

  • Indig, B., Sárközi-Lindner, Z., & Nagy, M. (2022). Use the Metadata, Luke!–An Experimental Joint Metadata Search and N-gram Trend Viewer for Personal Web Archives. In Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities (pp. 47-52).

  • Johannsen, B. (2023). Fußball und safety: Eine framesemantische Perspektive auf Diskurse über trans Sportler* innen. Queere Vielfalt im Fußball, 176.

  • Jung, G., Han, S., Kim, H., Kim, K., & Cha, J. (2022). Extracting the Main Content of Web Pages Using the First Impression Area. IEEE Access, 10, 129958-129969

  • Jung, G., Cha, J. (2023). New Visual Features for HTML Main Content Extraction. Journal of Digital Contents Society.

  • Karabulut, M., & Mayda, İ. (2020). “Development of Browser Extension for HTML Web Page Content Extraction”, In 2020 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA) (pp. 1-6). IEEE.

  • Khusainov, A., Suleymanov, D., Gilmullin, R., Minsafina, A., Kubedinova, L., & Abdurakhmonova, N. “First Results of the “TurkLang-7” Project: Creating Russian-Turkic Parallel Corpora and MT Systems”, In CMCL (pp. 90-101).

  • Kliche, F., Heid, U., Knackstedt, R., & Klupp, T. (2023, September). An educational Gamebook on computational linguistic methods for the development of taxonomies. In Proceedings of the 1st Workshop on Teaching for NLP (pp. 37-43).

  • Küehn, P., Relke, D. N., & Reuter, C. (2023). Common Vulnerability Scoring System Prediction based on Open Source Intelligence Information Sources. Computers & Security, 103286.

  • Kuehn, P., Schmidt, M., & Reuter, C. (2023). ThreatCrawl: A BERT-based Focused Crawler for the Cybersecurity Domain. arXiv preprint arXiv:2304.11960.

  • Laippala, V., Rönnqvist, S., Hellström, S., Luotolahti, J., Repo, L., Salmela, A., … & Pyysalo, S. (2020). “From Web Crawl to Clean Register-Annotated Corpora”, Proceedings of the 12th Web as Corpus Workshop (pp. 14-22).

  • Laippala, V., Salmela, A., Rönnqvist, S., Aji, A. F., Chang, L. H., Dhifallah, A., … & Pyysalo, S. (2022). Towards better structured and less noisy Web data: Oscar with Register annotations. In Proceedings of the Eighth Workshop on Noisy User-generated Text (W-NUT 2022) (pp. 215-221).

  • Luukkonen, R., Komulainen, V., Luoma, J., Eskelinen, A., Kanerva, J., Kupari, H. M., … & Pyysalo, S. (2023). FinGPT: Large Generative Models for a Small Language. arXiv preprint arXiv:2311.05640.

  • Madrid-Morales, D. (2021). “Who Set the Narrative? Assessing the Influence of Chinese Media in News Coverage of COVID-19 in 30 African Countries”, Global Media and China, 6(2), 129-151.

  • Meier-Vieracker, S. (2022). “Fußballwortschatz digital–Korpuslinguistische Ressourcen für den Sprachunterricht.” Korpora Deutsch als Fremdsprache (KorDaF), 2022/01 (pre-print).

  • Meng, K. (2021). “An End-to-End Computational System for Monitoring and Verifying Factual Claims” (pre-print).

  • Miquelina, N., Quaresma, P., & Nogueira, V. B. (2022). Generating a European Portuguese BERT Based Model Using Content from Arquivo. pt Archive. In International Conference on Intelligent Data Engineering and Automated Learning (pp. 280-288). Springer, Cham.

  • Naira, A. M., & Benelallam, I. (2023). Evaluating ESG Impacts in African Cities through Topic-Level Sentiment Analysis. In 2023 10th International Conference on Wireless Networks and Mobile Communications (WINCOM) (pp. 1-6). IEEE.

  • Nguyen, Q.C., et al. (2024). Rosie, a Health Education Question-and-Answer Chatbot for New Mothers: Randomized Pilot Study. JMIR Formative Research, 8(1), e51361.

  • Nissopoulou, T. X. (2023). Web content classification analysis, MSc thesis, International Hellenic University.

  • Nolda, A., Barbaresi, A., & Geyken, A. (2023). Korpora für die lexikographische Beschreibung diatopischer Variation in der deutschen Standardsprache. Korpora in der germanistischen Sprachwissenschaft: Mündlich, schriftlich, multimedial, 29.

  • Öhman, J., Verlinden, S., Ekgren, A., Gyllensten, A. C., Isbister, T., Gogoulou, E., … & Sahlgren, M. (2023). The Nordic Pile: A 1.2 TB Nordic Dataset for Language Modeling. arXiv preprint arXiv:2303.17183.

  • Paster, K., Santos, M. D., Azerbayev, Z., & Ba, J. (2023). Openwebmath: An open dataset of high-quality mathematical web text. arXiv preprint arXiv:2310.06786.

  • Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Alobeidli, H., Cappelli, A., … & Launay, J. (2024). The RefinedWeb dataset for Falcon LLM: Outperforming curated corpora with web data only. Advances in Neural Information Processing Systems, 36.

  • Piskorski, J., Stefanovitch, N., Da San Martino, G., & Nakov, P. (2023). Semeval-2023 task 3: Detecting the category, the framing, and the persuasion techniques in online news in a multi-lingual setup. In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023) (pp. 2343-2361).

  • Pohlmann, J., Barbaresi, A., & Leinen, P. (2023). Platform regulation and “overblocking”–The NetzDG discourse in Germany. Communications, 48(3), 395-419.

  • Rastislav, K. (2024). Backend platformy pro sdílené ověřování faktů (Master’s thesis, České vysoké učení technické v Praze. Vypočetní a informační centrum.)

  • Reina, L. A. V. (2023). NLP Meets Agronomy: Document Classification for Plant Health Surveillance, Master’s thesis.

  • Robertson, F., Lagus, J., & Kajava, K. (2021). “A COVID-19 news coverage mood map of Europe”, Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation (pp. 110-115).

  • Salmela, A. (2022). “Distinguishing Noise and Main Text Content from Web-Sourced Plain Text Documents Using Sequential Neural Networks”, Master’s thesis, University of Turku.

  • Sawczyn, A., Binkowski, J., Janiak, D., Augustyniak, Ł., & Kajdanowicz, T. (2021). “Fact-checking: relevance assessment of references in the Polish political domain”, Procedia Computer Science, 192, 1285-1293.

  • Schamel, T., Braun, D., & Matthes, F. (2022). Structured Extraction of Terms and Conditions from German and English Online Shops. In Proceedings of The Fifth Workshop on e-Commerce and NLP (ECNLP 5) (pp. 181-190).

  • Sutter, T., Bozkir, A. S., Gehring, B., & Berlich, P. (2022). Avoiding the Hook: Influential Factors of Phishing Awareness Training on Click-Rates and a Data-Driven Approach to Predict Email Difficulty Perception. IEEE Access, 10, 100540-100565.

  • Ter-Akopyan, B. (2022). “Identification of Political Leaning in German News”, Master’s thesis, Ludwig Maximilian University of Munich.

  • Varlamov, M., Galanin, D., Bedrin, P., Duda, S., Lazarev, V., & Yatskov, A. (2022). A Dataset for Information Extraction from News Web Pages. In 2022 Ivannikov Ispras Open Conference (ISPRAS) (pp. 100-106). IEEE.

  • Waheed, A., Qunaibi, S., Barradas, D., & Weinberg, Z. (2022). Darwin’s Theory of Censorship: Analysing the Evolution of Censored Topics with Dynamic Topic Models. In Proceedings of the 21st Workshop on Privacy in the Electronic Society (pp. 103-108).

  • Xu, Z., Liu, Z., Yan, Y., Liu, Z., Xiong, C., & Yu, G. (2024). Cleaner Pretraining Corpus Curation with Neural Web Scraping. arXiv preprint arXiv:2402.14652.

  • Zinn, J. O., & Müller, M. (2021). “Understanding discourse and language of risk”, Journal of Risk Research, 1-14.

Publications citing Htmldate#

See citation page of htmldate’s documentation.

Ports#

Go port

go-trafilatura