GithubHelp home page GithubHelp logo

gerhobbelt / qiqqa-revengin Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 1.0 111.58 MB

reverse engineering the data stored by Qiqqa (bibtex database, etc.)

License: GNU General Public License v3.0

Shell 0.58% JavaScript 0.38% HTML 9.56% CSS 0.12% C++ 7.43% C 0.18% TeX 81.75%

qiqqa-revengin's Introduction

qiqqa-revengin

reverse engineering the data stored by Qiqqa (bibtex database, etc.)

The original goal was to find ways to update/extend the Qiqqa libraries using tools (scripts) external to Qiqqa itself, so as to provide absent features which I needed. Mostly these would concern bulk operations linking PDFs together in groups, updating BibTeX and other metadata for series of PDFs; perform specific analysis on the imported PDF collection and update individual records accordingly, etc...

IMPORTANT UPDATE @ 2019/JULY/14

Qiqqa has now been published as (GPL3 licensed) open source on GitHub at https://github.com/jimmejardine/qiqqa-open-source as announced here.

This (private) reverse engineering work done in the years past is now public.

Do note that this repo is a bit, ah, "disorganized" in its root directory as this was a "when there's time, work on bloody Qiqqa as it crashed once again, darn it" labor of love. 😄

BASH shell scripts in this repo

Scripts are available to

  • DUMP the Qiqqa (BibTeX / metadata) database. This spits out the entire metadata content, just not what Qiqqa produces via the 'Export' option and most importantly for me, at least: this stuff still works when Qiqqa has already b0rked and is crashing repeatedly on the given library. 1

That's probably the most useful part of this work, apart from another script, which

  • goes through the Qiqqa library and seeks out all encrypted PDFs and PDFs which are not properly readable

    The built-in Qiqqa OCR left a few things to desire, so these PDFs are decrypted and then periodically fed through an external OCR batch process to produce similar PDFs which have the (properly) OCR'd text embedded for easier processing by Qiqqa. As both original PDF and decrypted/OCR-ed PDF have the same filename (which is the SHA1 hash of the original as calculated by Qiqqa), these PDFs should be easy to relate to one another in the Qiqqa database -- that bit of work has not been done yet as it turned out easier to manage the database in other tools, once Qiqqa had done the basic processing: my Qiqqa install (and re-install(s)!) all failed to produce a proper search index for even smaller libraries; seems Qiqqa was suffering from some sort of bit rot there, at least on my box. 😢

Database file format (as discovered via DB inspection)

Qiqqa uses a SQLite3 database with a very simple table structure: there's one table where all the files' BibTeX and other info is dumped in a single row per file, using a json(?) format which is verified against damage and tampering using SHA1 hashes.

See also the Qiqqa database dumper script: /dump_qiqqa_sqlite_database.sh and the workhorse underlying it: /dump_qiqqa_sqlite_database.parse.js

Note that there are several files in the root dir of this repo with an SHA-hash embedded in their name: those sample records have been exttracted from a live Qiqqa DB and used to verify correct operation of the scripts.

Now that Qiqqa is open sourced, a few still open questions can be answered. 😄

Record format:

 SHA1_fingerprint|extension="metadata"|MD5_checksum|info|extra=NULL

where info field WILL span multiple lines per record and has itself a JSON format!

The MD5_checksum field is the MD5 hash of the exact info blob contents, f.e. the example 3dd7bdd1517ad2bd59c0f75aa290d9a3.blob file (binary copy of one info field a.k.a. 'data' column in the LibraryItem sqlite table) hashes to 3dd7bdd1517ad2bd59c0f75aa290d9a3. The MD5 hash is stored in UPPERcase.

(Note that the info blob is essentially a JSON formatted text, where line breaks are encoded as CRLF (not LF only!); this should also be observable in the example blob file 3dd7bdd1517ad2bd59c0f75aa290d9a3.blob when you inspect it with a hex/binary viewer.)

SHA1_fingerprint is the SHA1 hash of the related file contents (DownloadLocation field in the info blob JSON record). This fingerprint is echoed in the info JSON blob record in the Fingerprint field. At the time of this writing, we haven't yet tested what will happen to/with a record where these 'data columns' differ.

Note that an incorrect MD5 hash causes QIQQA to DELETE the entire record upon restart of the application! In other words: one mistake in your encoding/hashing and your entire record will be nuked.

It also seems like QIQQA stores some sort of record count or some other truncation number as following such an encoding/hashing error the entire database still exists (minus the nuked records), but QIQQA doesn't show its contents anymore. This is under investigation at the time of this writing. Weird stuff... |:-S

info field looks something like this:

{
  "FileType": "pdf",
  "Fingerprint": "60835FB1D237D8F3ED73653CC9F935FDD7FA16B1",
  "DateAddedToDatabase": "20170711004707645",
  "DateLastModified": "20170711004707645",
  "DownloadLocation": "C:\\Program Files (x86)\\Qiqqa\\The Qiqqa Manual - LOEX.pdf",
  "BibTex": "@article{qiqqatechmatters\n,\ttitle\t= {TechMatters: “Qiqqa” than you can say Reference Management: A Tool to Organize the Research Process}\n,\tauthor\t= {Krista Graham}\n,\tyear\t= {2014}\n,\tpublication\t= {LOEX Quarterly}\n,\tvolume\t= {40}\n,\tpages\t= {4-6}\n}",
  "Title": null,
  "Authors": null,
  "Year": null,
  "Tags": "help;manual",
  "Comments": null,
  "AutoSuggested_PDFMetadata": true,
  "TitleSuggested": "TechMatters: \"Qiqqa\" than you can say Reference Management: A Tool to Organize the Research Process",
  "AuthorsSuggested": "Krista Graham",
  "YearSuggested": "2013",
  "DateLastRead": null
}

Note the BibTex field in there, which is a JSON-encoded BIBTEX record as entered in QIQQA.

The BIBTEX record from the example above actually reads like this:

@article{qiqqatechmatters
,       title   = {TechMatters: “Qiqqa” than you can say Reference Management: A Tool to Organize the Research Process}
,       author  = {Krista Graham}
,       year    = {2014}
,       publication     = {LOEX Quarterly}
,       volume  = {40}
,       pages   = {4-6}
}

It has a nice format like this because that's what happens when you hand-edit a bibtex record in QIQQA itself.

HOWEVER, an actual bibtex entry may be ANY TEXT, including stuff like this:

 @comment { BIBTEX_SKIP }

or even INVALID BIBTEX data! (Qiqqa versions before 0.79 crashed on some half-baked bibtex-alike entries, such as @delete() - see https://getsatisfaction.com/qiqqa/topics/qiqqa-crash-on-next-startup-after-manual-editing-of-one-or-more-bibtex-records)


Footnotes

  1. I've had many Qiqqa re-installs and re-imports of entire libraries over the years 😢 and the inability of Qiqqa to recover from a given library without losing at least part of the data is disheartening. (Of course, I could have run a backup/export every day, but that would mean another long-running task and severe storage requirements...)

qiqqa-revengin's People

Contributors

gerhobbelt avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Forkers

phillette

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.