Originally on 2010-04-27
The citation dictionary is cached inside each WSGI Invenio daemon
process for speed purposes. It looks like this: (for the demo site)
{18: [96],
74: [92],
77: [85, 86],
78: [79, 91],
79: [91],
81: [82, 83, 87, 89],
84: [85, 88, 91],
91: [92],
94: [80],
95: [77, 86]}
For bigger sites containing 1M of records and having fuller citation
maps, this dictionary can get quite big, e.g. WSGI daemon processes of
the INSPIRE instance eat about 1 GB of RAM.
It would be good to decrease the memory footprint of this citation
dictionary, especially since we are running on a 64-bit OS, where we
may easily consume more bytes to store list elements (of `unsigned
mediumint' type) than necessary.
We should investigate potential local replacements for the list
structure, for example using numpy.array
. We can measure the
memory footprint of various data structures via sys.getsizeof()
or via ps auxw
process sizes, aiming to find a more memory
optimized, yet still fast enough, data structure to represent the
citation dict.
If needed, we can even create a dedicated intbitset-like C extension,
that would be capable of storing recID vectors in a memory-efficient
way. This is arguably the best micro-optimization technique that we
could go for, albeit it would represent a bit more work than reusing
numpy.array
or other some such pre-existing module.
Note that this task is of a micro-optimization kind only, keeping the
overall citation indexer and searcher machinery unchanged, only
changing its internal data structures. The tests will show how much
such a micro-optimization would be worth it. The overall rethinking
of the citation dictionary handling and the inherent memory sharing
procedures would be another task, see some older musings at
[https://twiki.cern.ch/twiki/bin/view/CDS/InvenioScalability].