[ Russian ] [ English ]

An Algorithmic Framework for Precise Main Content Extraction from News Websites

Hamza Mohd Abdelkareem Salem
Innopolis University

This thesis presents the design, implementation, and evaluation of a novel, open-source algorithm for Main Content Extraction (MCE) from web pages. The proposed algorithm operates on the Document Object Model (DOM) tree of an HTML document and employs a multi-criteria heuristic approach to identify the primary content node. It combines three key metrics: the node with the highest number of direct text-containing children, the node with the most text content that lacks text-bearing children, and the node closest to the middle depth of the DOM tree. This methodology is intentionally language-agnostic, relying on structural features rather than linguistic cues, making it particularly effective for multilingual content and languages with complex tokenization.

The algorithm's performance was rigorously evaluated against two established content extraction tools, Readability and Boilerpipe, using metrics including precision, recall, F1-score, and accuracy. Results demonstrate that the proposed MCE algorithm significantly outperforms these existing solutions, achieving near-perfect scores. The work contributes not only a highly accurate and efficient extraction tool but also a standardized benchmark dataset to foster future research. The practical implications are substantial, offering a cost-effective method to enhance Large Language Models (LLMs) and improve global information accessibility by accurately extracting content across diverse languages and web structures.

Литература:

  1. Salem, H., Salloum, H., Mazzara, M. (2024). Mathematical Model and Algorithm for Accurate Main Content Extraction from News Websites. IEEE Access, 12, 12345-12356
  2. Salem, H., Salloum, H., Sabbagh, K., Mazzara, M. (2024). Enhancing News Articles: Automatic SEO Linked Data Injection for Semantic Web Integration. Applied Sciences, 15(3), 1262
  3. Salem, H., Mazzara, M. (2020). Pattern Matching-Based Scraping of News Websites. Journal of Physics: Conference Series, 1694(1), 012011.
  4. Salem, H., Mazzara, M., Elnaffar, S. (2021). Automatically Injecting Semantic Annotations into Online Articles. In International Conference on Advanced Information Networking and Applications. Springer.
  5. Salem, H., Mazzara, M. (2023). Multi-Language Pattern Matching-Based Scraping of News and Articles Websites. In International Conference on Advanced Information Networking and Applications. Springer.
  6. Kohlschutter, C., Fankhauser, P., Nejdl, W. (2010). Boilerplate Detection using Shallow Text Features. In Proceedings of the Third ACM International Conference on Web Search and Data Mining (WSDM '10).
  7. Readability: Mozilla. (2010). Readability.js. [Open Source Software].
  8. Gupta, S., Kaiser, G., Neistadt, D., Grimm, P. (2003). DOM-based Content Extraction of HTML Documents. In Proceedings of the 12th international conference on World Wide Web.
  9. Song, R., Liu, H., Wen, J. R., Ma, W. Y. (2004). Learning Block Importance Models for Web Pages. In Proceedings of the 13th international conference on World Wide Web.
  10. Finn, A., Kushmerick, N., Smyth, B. (2001). Fact or Fiction: Content Classification for Digital Libraries. In Proceedings of the Workshop on Document Analysis and Retrieval.

Слайды доклада

Видео доклада (Youtube)

Supported by Synthesis Group