sales@dataesperto.com +91 8860027901, 9810240168, +91 120 4180384

Contemporary Newspaper Digitization

Newspaper content is a valuable asset and there is strong demand for article segmentation from both non-commercial readers and the media monitoring sector. Newspapers in searchable digital format have become accessible to many readers and the media monitoring companies require content at article level in a wide variety of formats.

Data Esperto has expertise and experience of working with contemporary newspapers and converting their content for use on a variety of digital platforms.

We have amassed a wealth of technical expertise in using state-of-the-art technology and in-house software to satisfy client demand.

Content from the daily press (broadsheets and tabloid), weekly press, weekend supplements, periodicals, and trade press are all converted to digitized output in a variety of formats e.g. Page level PDF input files are converted to custom NITF XML files within 3 to 4 hours of receipt,

Our QA service ensures that critical classifications such as Headline, sub-Headline, byline (author) and body content are delivered with accuracy levels greater than 99.5%.

A typical conversion workflow is described below:

  • PDF files are downloaded from publisher
  • Files are processed to extract all available text, image and structural data
  • Visual proofing tools are used to segment articles, classify elements, edit and correct anomalies identified in extracted content
  • Article content is tagged and formatted in accordance with the output specification
  • Output is parsed and validated in accordance with the DTD
  • Output is checked for quality
  • Output is uploaded to the delivery destination