Skip to main navigation Skip to search Skip to main content

A preliminary study on similarity-preserving digital book identifiers

  • Klemo Vladimir
  • , Marin Silic
  • , Nenad Romic
  • , Goran Delac
  • , Sinisa Srbljic

    Research output: Contributions to collected editions/worksArticle in conference proceedingsResearchpeer-review

    Abstract

    Due to proliferation of digital publishing, e-book catalogs are abundant but noisy and unstructured. Tools for the digital librarian rely on ISBN, metadata embedded into digital files (without accepted standard) and cryptographic hash functions for the identification of coderivative or nearduplicate content. However, unreliability of metadata and sensitivity of hashing to
    even smallest changes prevents efficient detection of coderivative or similar digital books. Focus of the study are books with many versions that differ in certain amount of OCR errors and have a number of sentence-length variations. Identification of similar books is performed using small-sized fingerprints that can be easily shared and compared. We created synthetic datasets to evaluate fingerprinting accuracy while providing standard precision and recall measurements.
    Original languageEnglish
    Title of host publicationProceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities : LaTeCH 2015
    EditorsKalliopi A. Zervanou, Marieke van Erp, Beatrice Alex
    Number of pages6
    Place of PublicationBeijing
    PublisherAssociation for Computational Linguistics (ACL)
    Publication date2015
    Pages78-83
    ISBN (Electronic)978-1-941643-63-1
    Publication statusPublished - 2015
    Event9th Socio-Economic Sciences and Humanities Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities - SIGHUM 2015 - Peking, China
    Duration: 26.07.201530.07.2015
    Conference number: 9
    https://aclanthology.info/volumes/proceedings-of-the-9th-sighum-workshop-on-language-technology-for-cultural-heritage-social-sciences-and-humanities-latech
    https://sighum.wordpress.com/events/latech-2015/

    Bibliographical note

    Funding Information:
    This work was supported in part by the Croatian science foundation through the Recommender System for Service-oriented Architecture research project and in part by Leuphana Universität Lüneburg, DCRL Digital Cultures Research Lab. The authors would like to thank Robert M. Ochshorn and Goran Glavasˇ for their invaluable comments and suggestions and Project Gutenberg for their book collection.

    Publisher Copyright:
    © 2015 Proceedings of the Annual Meeting of the Association for Computational Linguistics.

    Research areas and keywords

    • Digital media

    Fingerprint

    Dive into the research topics of 'A preliminary study on similarity-preserving digital book identifiers'. Together they form a unique fingerprint.

    Cite this