pmem / pmdk Goto Github PK

View Code? Open in Web Editor NEW

1.3K 79.0 509.0 49.54 MB

Persistent Memory Development Kit

Home Page: https://pmem.io

License: Other

Makefile 2.70% Shell 14.84% C 67.96% Perl 0.82% C++ 5.89% GDB 0.01% Python 7.50% BitBake 0.30%

pmem pmdk

pmdk's People

Contributors

Stargazers

Watchers

Forkers

spawlows plebioda mdalecki ldorau stellarhopper krzycz tomaszkapela mjgrabowski lukasredynk jithinjosepkl bakerdar pbalcer andiry biancagolik zhuguoliang kaizhang666 twkim snam0115 marcinslusarz sudkannan harrybaa jebtang perone artur-malinowski pawelpiatek tgockel mslusarz smartmircoarray xpsair cgvarela jxy859 andrew-armbest xguo sarahjelinek andreas-bluemle xinxinsh ryspol janekmi reallytommy ashaw596 chrisrammy anidzgor neuvenen huwan juncgu nbtdev gbuella dardevelin peluse miglopst wlemkows amesianx mindis nvml-bot wojtuss chandkv rkowalewski changjacob lplewa pittma mramotowski bgbhpe snalli chjs astroynao vitalyvch jlllinn arunksaha hpehduong pphelps kasiawasiuta charsyam believe7028 josefschaeffer sjtu-ddst tianyouli wenwen412 pdeng6 hoangt fabricattachedmemory manjugv capsang michalme robdickinson roblatham00 harrywei bizarte yncxcw wildgarden ushaup hongyunnchen dmitrygx lgodlewski janbest gumi-presentation-by-dzh phoenixlx jiaxiaodong szychows tmenjo uttampawar

pmdk's Issues

libpmempool: do not report inconsistent pool for unknown pool signatures

Hi !

I use the example "manpage.c" of librpmem, the application of client side was executed successfully but the memory pool created has invalid signature.

The client program :

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>

#include <librpmem.h>

#define POOL_SIZE       (32 * 1024 * 1024)
#define NLANES          4
//static unsigned char pool[POOL_SIZE];

int
main(int argc, char *argv[])
{
        int ret;
        unsigned nlanes = NLANES;

        void *pool;
        size_t align = (size_t)sysconf(_SC_PAGESIZE);
        errno = posix_memalign(&pool, align, POOL_SIZE);
        if (errno) {
                perror("posix_memalign");
                return -1;
        }

        /* fill pool_attributes */
        struct rpmem_pool_attr pool_attr;
        memset(&pool_attr, 0, sizeof(pool_attr));

        /* create a remote pool */
        RPMEMpool *rpp = rpmem_create("[email protected]", "pool.set",
                pool, POOL_SIZE, &nlanes, &pool_attr);
        if (!rpp) {
                fprintf(stderr, "rpmem_create: %s\n", rpmem_errormsg());
                return 1;
        }

        /* store data on local pool */
        memset(pool, 0, POOL_SIZE);

        /* make local data persistent on remote node */
        ret = rpmem_persist(rpp, 0, POOL_SIZE, 0);
        if (ret) {
                fprintf(stderr, "rpmem_persist: %s\n", rpmem_errormsg());
                return 1;
        }

        /* close the remote pool */
        ret = rpmem_close(rpp);
        if (ret) {
                fprintf(stderr, "rpmem_close: %s\n", rpmem_errormsg());
                return 1;
        }

        return 0;
}

The debug messages :

$  ./manpage
<librpmem>: <1> [out.c:283 out_init] pid 20017: program: /root/nvml/src/examples/librpmem/manpage
<librpmem>: <1> [out.c:285 out_init] librpmem version 1.1
<librpmem>: <1> [out.c:289 out_init] src version:
<librpmem>: <3> [librpmem.c:63 librpmem_init]
<librpmem>: <3> [librpmem.c:68 librpmem_init] Libfabric is fork safe
<librpmem>: <3> [rpmem.c:454 rpmem_create] target [email protected], pool_set_name pool.set, pool_addr 0x7f4c5adaa000, pool_size 33554432, nlanes 0x7ffc0e6e3948, create_attr 0x7ffc0e6e38d0
<librpmem>: <3> [rpmem.c:361 rpmem_log_args] req create, target [email protected], pool_set_name pool.set, pool_addr 0x7f4c5adaa000, pool_size 33554432, nlanes 4
<librpmem>: <3> [rpmem.c:363 rpmem_log_args] create request:
<librpmem>: <3> [rpmem.c:364 rpmem_log_args]    target: [email protected]
<librpmem>: <3> [rpmem.c:365 rpmem_log_args]    pool set: pool.set
<librpmem>: <4> [rpmem.c:366 rpmem_log_args]    pool addr: 0x7f4c5adaa000
<librpmem>: <4> [rpmem.c:367 rpmem_log_args]    pool size: 33554432
<librpmem>: <3> [rpmem.c:368 rpmem_log_args]    nlanes: 4
<librpmem>: <3> [rpmem.c:394 rpmem_check_args] pool_addr 0x7f4c5adaa000, pool_size 33554432, nlanes 0x7ffc0e6e3948
<librpmem>: <3> [rpmem.c:196 rpmem_common_init] target [email protected]
<librpmem>: <3> [rpmem.c:133 rpmem_get_provider] node 10.1.0.48
<librpmem>: <3> [rpmem.c:104 env_get_bool] name RPMEM_ENABLE_SOCKETS, valp 0x7ffc0e6e37ac
<librpmem>: <3> [rpmem.c:104 env_get_bool] name RPMEM_ENABLE_VERBS, valp 0x7ffc0e6e37a8
<librpmem>: <3> [rpmem.c:219 rpmem_common_init] provider: verbs
<librpmem>: <4> [rpmem.c:233 rpmem_common_init] establishing out-of-band connection
<librpmem>: <4> [rpmem_cmd.c:147 rpmem_cmd_log] executing command 'ssh -T -oBatchMode=yes [email protected] rpmemd'
<librpmem>: <4> [rpmem_ssh.c:319 rpmem_ssh_open] received status: 0
<librpmem>: <3> [rpmem.c:241 rpmem_common_init] out-of-band connection established
<librpmem>: <4> [rpmem_obc.c:494 rpmem_obc_create] sending create request message
<librpmem>: <3> [rpmem_obc.c:502 rpmem_obc_create] create request message sent
<librpmem>: <4> [rpmem_obc.c:503 rpmem_obc_create] receiving create request response
<librpmem>: <3> [rpmem_obc.c:512 rpmem_obc_create] create request response received
<librpmem>: <3> [rpmem.c:377 rpmem_log_resp] req create, resp 0x7ffc0e6e3860
<librpmem>: <3> [rpmem.c:379 rpmem_log_resp] create request response:
<librpmem>: <3> [rpmem.c:380 rpmem_log_resp]    nlanes: 4
<librpmem>: <3> [rpmem.c:381 rpmem_log_resp]    port: 53899
<librpmem>: <3> [rpmem.c:383 rpmem_log_resp]    persist method: General Purpose Server Persistency Method
<librpmem>: <3> [rpmem.c:384 rpmem_log_resp]    remote addr: 0x7fa440001000
<librpmem>: <3> [rpmem.c:288 rpmem_common_fip_init] rpp 0xc3f6d0, req 0x7ffc0e6e3880, resp 0x7ffc0e6e3860, pool_addr 0x7f4c5adaa000, pool_size 33554432, nlanes 0x7ffc0e6e3948
....
<librpmem>: <3> [rpmem.c:318 rpmem_common_fip_init] final nlanes: 4
<librpmem>: <4> [rpmem.c:319 rpmem_common_fip_init] establishing in-band connection
<librpmem>: <3> [rpmem.c:327 rpmem_common_fip_init] in-band connection established
<librpmem>: <3> [rpmem.c:177 rpmem_monitor_thread] arg 0xc3f6d0
<librpmem>: <3> [rpmem.c:616 rpmem_persist] rpp 0xc3f6d0, offset 0, length 33554432, lane 0
<librpmem>: <3> [rpmem.c:584 rpmem_close] rpp 0xc3f6d0
<librpmem>: <4> [rpmem.c:586 rpmem_close] closing out-of-band connection
<librpmem>: <4> [rpmem_obc.c:677 rpmem_obc_close] sending close request message
<librpmem>: <3> [rpmem_obc.c:685 rpmem_obc_close] close request message sent
<librpmem>: <4> [rpmem_obc.c:686 rpmem_obc_close] receiving close request response
<librpmem>: <3> [rpmem_obc.c:695 rpmem_obc_close] close request response received
<librpmem>: <3> [rpmem.c:596 rpmem_close] out-of-band connection closed
<librpmem>: <3> [rpmem.c:343 rpmem_common_fip_fini] rpp 0xc3f6d0
<librpmem>: <4> [rpmem.c:345 rpmem_common_fip_fini] closing in-band connection
<librpmem>: <3> [rpmem.c:349 rpmem_common_fip_fini] in-band connection closed
<librpmem>: <3> [rpmem.c:261 rpmem_common_fini] rpp 0xc3f6d0, join 1
<librpmem>: <3> [librpmem.c:80 librpmem_fini]

In the server side :

$ cat pool.set
PMEMPOOLSET
2G /mnt/pmem0/pool.obj

The check of the memory pool created :

$pmempool check  /mnt/pmem0/pool.obj
invalid signature
pool.obj: not consistent

I would like to know if the check result is excepted by this example or there was an error when using this example ?

Thank you,

FEAT: obj/pool: API for poolset information

There's no way to programmatically get the information about pmemobj pool set. I'm mostly interested in pool size, but information about number of replicas, their types (local/remote) and number/paths of parts could be useful.

device DAX: new function to check pool on device dax

There is a need for function that can do check on all kinds of pool on device DAX

FEAT: zone metadata duplication

Zone metadata duplication

Description

The zone metadata mostly consists of a fixed-size chunk headers array. The contained data forms a linked list of variably-sized chunks. To verify consistency of the data, it's enough to check if the the mentioned list is correct and spans the entire zone. What's impossible however, is correcting any possible errors. To solve this problem, the proposal is to introduce a new chunk type that would store the chunk headers array and all operations on this array would be performed in N+1 places (depending on how many metadata backups would the user want).

The new chunk will be allocated/deallocated on-demand, depending on the user. By default there would be no backups. The location of the chunk would be stored in all of the chunk header arrays, but because that's what this structure is intended to protect, it would also contain a header with a magic field and a checksum that will allow to locate just by traversing all possible chunks.

Modifications to the chunk headers array would be applied first to the backups.

All of the backup chunks will be identified at startup and compared against the master copy. If any differences are found, a correct header array is identified and all of the array copies are updated to match it.

API Changes

New CTL API:

"heap.zone.metadata.ncopies"

Defines how many chunk header copies are required. Allocates or deallocates the backup chunks to match this number.

Implementation details

TBD

FEAT: pvector persistent sanity checks

pvector persistent sanity checks

Description

The pvector is the data structure used for undo logs. Its consistency can be verified by traversing its entire contents - there should be only one zero entry at the end of the vector.

TBD

(should save the number of entries?)
(allocate 2x the vector slots and duplicate data? - would require API change and wouldn't be backward compatible)
(grab one extra lane?)

API Changes

Implementation details

TBD

RFC: multiprocess support

multiprocess support in libpmemobj

Rationale

Most database software is designed from the ground-up to work as a daemon with clients connecting to it to perform operations. This means that multiprocessing is effectively free. Our library is designed to be embedded into software and it heavily uses memory mappings to provide no-copy (compared to traditional databases) interfaces to the user. That design choice makes it much more difficult to manage state of the pool across multiple processes because the library lacks single controlling entity (the daemon) that could serialize access.

While keeping the above in mind, I believe it's a topic worth pursuing because it enables libpmemobj to be used in conjunction with MPI and many different areas of HPC.

Description

There are several big ticket items that need to be addressed for multiprocess support, here are some I could think of:

Detaching runtime state from the pool structure, as currently it's stored in persistent memory. This needs to change so that PMEMobjpool is allocated from the transient heap and filled with relevant data. It might appear that this is a trivial change, but in reality we rely on PMEMobjpool to point to the beginning of the memory pool in so many places that it might be a real difficulty.
Per-process lanes. Currently we have 1024 fixed-size lanes, and it would be a non-trivial effort to increase that number. We need a way to either acquire and release the lanes dynamically across processes or assign certain number of lanes to each process.
Allocation. That's a huge topic in itself, but the best option right now would be to introduce a third layer of allocation. Where right now we have runs (per thread) -> chunks (per process), we would have runs -> chunks -> many chunks (global). Each process would have its own allocator runtime state, but instead of relying on directly grabbing chunks from zones, it would have to request group chunks from the global state. To avoid any complexities, the global state would give out memory linearly and returning chunks to the global state would not be possible.
User-facing locks. Currently we do not expose any kind of multiprocess PMEM locks. At minimum, we would have to introduce shared mutexes and spinlocks. The open question is whether we should allow user to specify the type of those synchronization primitives or always set the mutexes/spinlocks as shared when the pool is open in a multiprocess mode?

Implementation details

TBD

Bibliography

RFC: utilize unused space in pool header to store new fields

For certain pmempool transform operations to be feasible and to facilitate other functionality of libpmempool sync/transform/check the proposal suggest adding the following fields into pool_hdr's unused space:

-       unsigned char unused[3944];     /* must be zero */
+       /* fields utilizing space that was unused prior to version 1.3 */
+       uint64_t offset;        /* offset of data part in the lot, as of 1.3 */
+       uint64_t size;          /* size of data mapping in the lot, as of 1.3 */
+       uint64_t poolsize;      /* size of a pool in the lot, as of 1.3 */
+       uint64_t alignment;     /* data alignment, as of 1.3 */
+       unsigned char unused[3912];     /* must be zero */

Related issues:
pmem/issues#475
pmem/issues#476

FEAT: fixed benchmark time

In some cases it could be more convenient to specify the maximum time of benchmark execution instead of fixed number of operations.
Let's add an option for pmembench to specify the max. time of execution of a single run (i.e. "-t 10s").

Port pmemobj_ctl/PMEMOBJ_CONF to other libraries

PMEMOBJ_CONF/pmemobj_ctl_[set|get] provides nice/consistent interface for influencing pmemobj's behavior at runtime.

Please port it to other libraries and use it instead of multitude of environment variables.

FEAT: Protection Keys support

Use the new memory protection keys feature of the CPU to prevent bugs from scribbling on large pmem pools. See: https://lwn.net/Articles/643797/

No assert for padding check in the checksum utility function

https://github.com/pmem/nvml/blob/master/src/common/util.c#L454

This if statement can only be possible if the (addr % 4) equals (csump % 4) there should be an assert for this check.

Also the function header is saying that the function assumes little endian:
https://github.com/pmem/nvml/blob/master/src/common/util.c#L437
But then later it is using the le32toh and htole64 functions that are architecture specific for little/big endian. Isn't this actually making the function support both endians?

FEAT: Software interleaving

Interleaving potentially can improve performance.
If user doesn't want to use hardware interleaving (because they want to interleave only some of the pools stored on pmem) or hardware doesn't support it, pmemobj could implement it in software.

Example pool set file:
PMEMPOOLSET
INTERLEAVING 2MB
1G /mnt/pmem1/part1
1G /mnt/pmem2/part2

It could be mapped as:
[0, 2MB] of part1
[0, 2MB] of part2
[2MB, 4MB] of part1
[2MB, 4MB] of part2
[4MB, 6MB] of part1
[4MB, 6MB] of part2
...

common: allow cross-compiling

https://github.com/pmem/pmdk/blob/dd14c39/src/common/pmemcommon.inc
assumes the build system is the deployment system using uname to define a target OS.

Proposition: Add an optional HOST parameter allowing to pass the deployment system.

Issue reported by @raphaelcohn: https://groups.google.com/forum/#!msg/pmem/dSR5KOrli8I/rYWWrRVGEQAJ

pmempool info: printing info about many pool files

It would be nice if printing info about many files would be possible.
It can be useful especially when comparing information about many parts of the same pool.

e.g. at now we have to call pmempool info X times to print info about X files:
$pmempool info pool.part1
$pmempool info pool.part2

Following call results in printing info about first pool file only:
$pmempool info pool.part1 pool.part2

Consider concatenating files before printing on stdout (as unix's CAT(1) does).

FEAT: NVML for Windows: wishlist / missing features

This is a super feature request for all the open issues imported from Trello. Not prioritized.

pmempool: add $HOME env. var. support when pointing path to replica in pool.set file

Add support of resolving $HOME env. var. in pool set files.

$cat pool.set
PMEMPOOLSET
20M $HOME/pool

$pmempool create obj pool.set
error: 'pool.set' -- pool.set [incorrect path (must be an absolute one):2]
error: creating pool file failed

Don't build PMDK twice / get rid of pmdk_debug directory

Currently we build two sets of libraries from single source tree: stripped libraries intended for production and libraries with all debugging enabled.

We install release libraries to /usr/lib* and debug libraries to /usr/lib*/pmdk_debug.

Debug libraries are built with:

optimizations disabled
all symbols
logging (PMEM*_LOG_LEVEL, PMEM*_LOG_FILE)
light internal checks (asserts)
heavy internal checks (like "zeroed snapshot check" at the end of transactions or redo log corruption before processing)

The existence of 2 sets of libraries simultaneously built from one tree is quite unusual.

The proposal is to build only one set of libraries, install them to /usr/lib*, expand release builds to include some of the more useful but still lightweight debugging features and get rid of pmdk_debug directory.

Release libraries would be built with all symbols (and distributions would move those symbols to dbginfo packages) and with some lightweight logging.

Debug builds would be developer-only, so distributions would provide only release packages.

obj: pmemobj_modify()

The libpmemobj non-transactional API allows to atomically allocate (and initialize), reallocate, change the type number and eventually to free the object, but there is no simple way to atomically modify the content of an object without using transactions.

The proposal is to add a function that would fill the gap in the existing set of atomic non-transactional operations by providing an ability to atomically modify/update the content of a single object without the need of using transactions.

The new function may be implemented in a couple of ways:

Adding a new entry point:

int pmemobj_modify(PMEMobjpool *pop, PMEMoid *oidp,
    void (*constructor)(PMEMobjpool *pop, void *ptr, void *arg), void *arg);

This would allow to modify the object data, but not to resize or change the type of an object. Although, it may also take a type_num argument, if needed.

In practice, this would be more or less equivalent to the following code:

int
pmemobj_modify(PMEMobjpool *pop, PMEMoid *oidp,
    void (*constructor)(PMEMobjpool *pop, void *ptr, void *arg), void *arg);
{
    TX_BEGIN(pop, ...) {
        TX_ADD(*oidp); /* only one object per tx */
        void *ptr = D_RW(*oidp);
        constructor(ptr, arg); /* constructor cannot be NULL */
    } TX_ONABORT {
        return -1;
    } TX_END;
    return 0;
}

Adding a constructor argument to pmemobj_realloc():

int pmemobj_realloc(PMEMobjpool *pop, PMEMoid *oidp, size_t size, unsigned int type_num,
    void (*constructor)(PMEMobjpool *pop, void *ptr, void *arg), void *arg);

If constructor is not NULL, the library would never attempt to do reallocation in-place, but would always allocate a new object, copy over the user data, and then invoke the user-defined constructor function that can modify the old data and/or initialize the added memory.
Passing the size that is equal to the current allocation size, together with non-NULL constructor pointer will be equivalent to pmemobj_modify().

bench: gnuplot scripts

It would be useful to provide (or auto-generate) gnuplot scripts for NVML benchmarks. This would help to quickly visualize benchmark results.

FEAT: build system based on CMake

Key requirements:

cross-platform build system, supporting Linux and Windows
automated generation of Linux makefiles & VS projects/solutions
build configuration based on compiler version/features, installed libraries, etc.
support for Valgrind, ASan, etc.
proper dependency handling
proper handling of user-defined CFLAGS/CPPFLAGS and distro-specific build flags
out-of-tree build support
parallel compilation
improved parallel test execution
support for out-of-tree test execution
separate debug/non-debug builds
make rpm, make dpkg
improved API/library versioning

FEAT: additional info in build/test logs

Sometimes, to investigate the build/test failure it might be useful to have more info about the build/test environment.

build log:

version of the compiler and other tools
version of the libraries, valgrind, etc.
...

test log (additional info may be optional / dumped only in case of test failure):

version of the kernel
are tests run with superuser privileges?
testconfig.sh / envvars
file system capacity (i.e. amount of PMEM, size of DAX device(s), etc.)
number of CPUs/cores
...

O(n^2) time complexity of adding a lock to transaction

Acquiring a lock in a transaction currently requires iteration of all of the existing locks in order to ensure that situation like this:

TX_BEGIN(pop) {
    TX_BEGIN_LOCK(pop, &lock) {
    } TX_END
    TX_BEGIN_LOCK(pop, &lock) {
    } TX_END
} TX_END

does not create a deadlock.

The container of the current active locks must be changed in order to improve performance of this function.

FEAT: transaction metadata duplication

Transaction metadata duplication

Description

Apart from pvector undo logs, the only information stored in a transaction lane is the state which indicates whether the transaction is committed or aborted.
This proposal suggests adding another state field which would be located at the opposite side of the lane, roughly ~1kilobyte apart from the first one. Both of those fields would have to be updated to change the state of the variable, where:

0:0, 0:1 - aborted
1:1, 1:0 - committed

And this will have to enforced by properly ordering changes to the transaction field.

API Changes

"tx.lanes.state.duplicate"

Defines whether the transaction state field must be duplicated or not. The effects of the change are immediate and as such, this value should only be changed when there are no transactions running.

Implementation details

TBD

obj: PMEMmutex portability

This is not an issue as of now, but it is something that can affect future discussions about portability.

The current implementation of PMEMmutex, PMEMrwlock:
https://github.com/pmem/nvml/blob/master/src/libpmemobj/sync.h#L50
assume that a platform provided mutex, rwlock implementation's data structure fits in 56 bytes ( 64 bytes minus the runid ).
This doesn't hold true with Apple's libpthread:
https://opensource.apple.com/source/libpthread/libpthread-218.1.3/sys/_pthread/_pthread_types.h.auto.html

In this case sizeof(pthread_mutex_t) is 64, and sizeof(pthread_rwlock_t) is 200.

One possible solution for such cases, is to allocate space for a mutex/rwlock in volatile memory. The PMEMmutex struct would contain a pointer, instead of the actual mutex itself, looking somewhat like FreeBSD's pthread types:
https://github.com/freebsd/freebsd/blob/master/sys/sys/_pthreadtypes.h#L69

This might have drawbacks, e.g. such allocated mutexes can show up as memory leaks in the analysis of some tools.

We only have labels for OS:linux and OS:windows, so I didn't use any labels. We don't have POSIX_portabiliy label. yet.

FEAT: Get more info about library version/build via xxx_check_version()

Instead of a simple check if required API version matches the actual library API version, xxx_check_version() functions could provide more information about the library. I.e. it could be used to retrieve the source version (tag/commit), whether it is debug/nondebug build, was the library compiled with Valgrind support enabled, etc.
All this data could be compiled into a static string that is returned in case of a call to xxx_check_version() with major/minor version number set to 0.

Incorrect replication of poolset with devdaxes and OPTION SINGLEHDR

If one creates a poolset file (pool.set) like the following:

PMEMPOOLSET
AUTO /dev/dax0.0
AUTO /dev/dax1.0
REPLICA
AUTO /dev/dax2.0
AUTO /dev/dax3.0
OPTION SINGLEHDR

then:

creates an obj poolset using the file above (with pmempool create obj or using API),
writes a set of chars into master replica parts (e.g. char 'A' put N times right after metadata in part # 0 and at the very beginning - address 0 - of part # 1) using pmemobj_memset_persist and saves the offsets where the data was put
removes master replica (both parts)
resyncs poolset using pmempool sync.

Data read back from addresses (calculated with use of saved offsets) of the resynced poolset is incorrect, not equal to the pattern written in by pmemobj_memset_persist.
BUT if data is written with extra offset > 4k everything works fine.

Found on: pmempool 1.4-rc1 & pmempool 1.4-rc1-71-g70eddc798

FEAT: run metadata duplication

Run metadata duplication

Description

The only reliable information about existence of an allocation are bits in the run bitmap. Even the allocation header cannot be used to determine if an object exists because it isn't zeroed on free and, in some cases, the user data that can be present in the location where an allocation header might have existed, might look like a header.

This proposal suggests adding a chunk flag, valid only for runs, that would indicate whether or not the bitmap is duplicated. The bitmap will be stored at the end of the run data, and will be updated alongside the master copy in a redo log.

This would by disabled by default, and the flag would only be set for new chunks created after the option is set. Likewise, disabling this option wouldn't remove this flag from existing chunks, so they would still have their bitmap duplicated.

API Changes

"heap.zone.run.on_create.flags.duplicate"

Indicates whether the duplicate flag should be set on creation of new chunks.

Implementation details

FEAT: obj: ability to tell pmemobj_open to wait for the file lock

Currently pmemobj_open fails immediately with EAGAIN errno when pool file is locked by another process. Application that expects pool file to be temporarily locked by another process must loop on pmemobj_open as long as it returns NULL and errno is set to EAGAIN. This is not efficient.

I'd like to tell pmemobj_open to wait on the lock instead of failing. It can be implemented as new version of pmemobj_open that accepts flags or a new ctl.

pmempool rm removes non-existent files without force option enabled

$cat pool.set:

PMEMPOOLSET
10M /dev/shm/pool
REPLICA 192.168.0.182 rep.set

$cat rep.set

PMEMPOOLSET
10M /dev/shm/rep.1
10M /dev/shm/rep.2

$pmempool create obj pool.set
Local and remote pool files are created

Remove one part from remote replica:
$rm -f /dev/shm/rep.1

$pmempool rm pool.set

error: cannot remove 'rep.set' on '192.168.0.182': Invalid argument
error: removing 'pool.set' failed: Invalid argument

Part /dev/shm/rep.2 from remote was removed, although man pages states that -f switch is required to ignore nonexistent files and proceed:
$man pmempool-rm

-f, --force

       Remove all specified files, ignore nonexistent files, never
       prompt.

Found on 1.4-rc3-14-g53bdc41

FEAT: obj: ability to block opening remotely replicated pools

I'd like to tell pmemobj to fail opening the pool if pool configuration includes remote replication.
I think there should be pmemobj_open variant which accepts flags. Alternatively it can be implemented as ctl.

Why? In pmemfile we have a problem when we intercept syscall from libc while libc is holding a lock and something in syscall handling calls back into libc which requires taking the same lock. We can manage it for stuff pmemfile, libpmemobj, libpmem do, but it's impossible to handle it for libfabric, libibverbs and anything below.

unit tests: documentation

There is a need for more accurate documentation for unit tests. Actual documentation is missing:
·        number of dax devices needed to run all tests
·        number of nodes needed to run all tests
·        order of dax devices or nodes
·        requirements for HW(master and slaves) and SW

add API to retrieve replica size from a remote poolset file

Introduce an API function which allows retrieving replica size from remote poolset file.
This functionality is required when one want to get the size of a damaged remote replica - currently, we cannot get the size of a remote pool without opening it.
Related issues:
pmem/issues#360

FEAT: transaction metadata suballocator

Transaction metadata suballocator

Description

User currently has no control over the persistent allocations happening inside of a transaction. They happen automatically and from the global pool - this might interfere with users allocation classes and induce additional fragmentation. The initial idea I had was to simply allow the user to remap the entire internal allocation classes map so that it would be possible to simply substitute custom ones for metadata allocations. That had two problems: a) allocation classes map is common for all allocations, meaning that having custom classes only for metadata might be difficult, and b) the user never really knows what are the sizes of metadata allocations - it would force applications to fill the entire alloc class map, which might not be needed for anything else.

The new idea is to allow the user to substitute metadata allocator for a transaction. Each transaction would take an optional argument - instance of transaction metadata suballocator. If provided, each allocation/deallocation would go through it. This includes pvector arrays, snapshot caches and huge snapshots.

The simplest suballocator would be passthrough to pmalloc with custom allocation class (and other flags), where a little bit more complicated and useful one would be a fixed-size linear allocator that resets when the transaction ends.

API Changes

There would be a new structure that defines the suballocator:


struct pobj_tx_alloc {
   int (*alloc)(struct pobj_tx_alloc *alloc, uint64_t *offset, size_t size);
   void (*dealloc)(struct pobj_tx_alloc *alloc, uint64_t *offset);
   void (*on_tx_begin)(struct pobj_tx_alloc *alloc);
   void (*on_tx_end)(struct pobj_tx_alloc *alloc);
};

It could include a type of the thing being allocated and/or a constructor. TBD

The user would be responsible for instantiating the allocator, which can be then passed to pmemobj_tx_begin as one of the varargs, like so:

struct my_super_tx_allocator *mallocator = ...;
pmemobj_tx_begin(pop, NULL, POBJ_PARAM_ALLOCATOR, mallocator);

The user structure must include struct pobj_tx_alloc at offset 0:


struct my_super_tx_allocator {
  struct pobj_tx_alloc base;
  uint64_t *offset;
  char *data;
}

Implementation details

There's a problem with the fact that we internally do not use PMEMoids, which means that the user would not be able to simply implement the alloc function as pmemobj_xalloc(..., &oid, ...);, but would need to use the reserve/publish API and use the pmemobj_set_value() function to set the offset. Not sure if this is a big issue.

Load custom test config files from TEST*

RUNTEST has the nice feature of loading per test group configuration files (which was done for remote tests). However in the odd case you would want to run the test not through RUNTEST - which you probably shouldn't - this functionality is absent. The question is, do we want it there for consistency? And there is still the issue of having this on Windows.

FEAT: libpmempool: check/repair support for pmemobj pools

pmempool and libpmempool must provide support for checking consistency and repair of pmemobj pools.

pmempool info run on remote pool set file

Lets consider the following case: one creates an obj pool set with remote replica(s). In both local (pool.set) and remote (remote.set) poolset files there is an option SINGLEHDR included.
If, after creation of poolsets, one wants to use pmempool info on remote.set, the command returns the following message:

error: opening poolset failed
remote.set: Invalid argument

To get any info about remote poolset parts, one can run pmempool info directly on a single part. But as the SINGLEHDR option is active it is possible to retrieve info about the 1st part only.

Shouldn't pmempool info run on remote poolset file?

pmemblk_check returns wrong value if BTT Info header is corrupted

The pmemblk_check retruns 1 if BTT Info header is corrupted. The problem is that pmem_check calls pmemblk_map_common which writes new layout if BTT Info header consistency checking fails. The new layout is not written to the file because the rdonly flag. However the runtime laidout variable is set to false after writing new layout and in consequence the btt_check immediately returns that BTT is consistent.

Below is simple backtrace when the write_layout is called.

#0  write_layout (bttp=0x6150a0, lane=0, write=0) at ../btt.c:714
pmem/issues#1  0x00007ffff7bd136f in read_layout (bttp=0x6150a0, lane=0) at ../btt.c:990
pmem/issues#2  0x00007ffff7bd17e7 in btt_init (rawsize=1073733632, lbasize=512, parent_uuid=0x10000000018 "\201\067\276\305\361\034C\327Q\f\305E\327X\310pGT", maxlane=8, ns=0x10000000000, ns_cbp=0x7ffff7dd9d80 <ns_cb>) at
../btt.c:1117
pmem/issues#3  0x00007ffff7bceaca in pmemblk_map_common (fd=13, bsize=512, rdonly=1) at ../blk.c:374
pmem/issues#4  0x00007ffff7bcf316 in pmemblk_check (path=0x7fffffffe0a3 "./blk.pool") at ../blk.c:590

lpp: add support for remote replicas

e.g. there is no possibility to check consistency of pool set with remote replicas calling pmempool check CLI or using libpmempool API. Error with invalid argument msg. is returned now.

PMDK does not build with musl

Build output:

alpine-musl-:~/pmdk# make
make -C src all
make[1]: Entering directory '/root/pmdk/src'
make -C libpmem
make[2]: Entering directory '/root/pmdk/src/libpmem'
cc -MD -c -o ../nondebug/libpmem/os_linux.o -std=gnu99 -Wall -Werror -Wmissing-prototypes -Wpointer-arith -Wsign-conversion -Wsign-compare -Wconversion -Wunused-macros -Wmissing-field-initializers -O2 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -std=gnu99 -fno-common -pthread -DSRCVERSION=\"1.3+b2-504-g0124a9715\"   -I../include -I../common/  -fPIC  ../../src/../src/common/os_linux.c
../../src/../src/common/os_linux.c: In function 'os_getenv':
../../src/../src/common/os_linux.c:243:9: error: implicit declaration of function 'secure_getenv' [-Werror=implicit-function-declaration]
  return secure_getenv(name);
         ^~~~~~~~~~~~~
../../src/../src/common/os_linux.c:243:9: error: return makes pointer from integer without a cast [-Werror=int-conversion]
  return secure_getenv(name);
         ^~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
make[2]: *** [../Makefile.inc:292: ../nondebug/libpmem/os_linux.o] Error 1
make[2]: Leaving directory '/root/pmdk/src/libpmem'
make[1]: *** [Makefile:166: libpmem] Error 2
make[1]: Leaving directory '/root/pmdk/src'
make: *** [Makefile:82: all] Error 2

Distribution: Alpine Linux 3.7.0

Found on 1.3+b2-504-g0124a9715

obj: problems in applications using 2 pools

When application uses pmemobj on pool A and anywhere in the transaction calls another module or library that operates on pool B, all transactions on pool B abort immediately.

Prominent example is application using pmemobj for pool A and logging something through pmemfile on pool B - in such case logging will always fail.

pmemobj_tx_begin on different pool should push current transaction on stack and create new transaction on specified pool. pmemobj_tx_end should pop last transaction from stack.

pmempool remote sync: -d flag returns 0 after wrong operation

Consider following local pool set file:

PMEMPOOLSET
20M /root/part.0
20M /root/part.1
REPLICA 192.168.0.181 /root/remotePool.set

and remotePool.set

PMEMPOOLSET
20M /root/remotePart.0
20M /root/remotePart.1

Create above poolsets, then remove whole remote replica and change layout remotePool.set to:

PMEMPOOLSET
15M /root/remotePart.0
15M /root/remotePart.1

Then call pmempool sync with -d flag on pool set file. Return code is 0 which is not correct because we can not synchronize replicas when one of them got smaller size than size of other replicas.

Found on: 1.2-rc2

tests: make pcheck -jN doesn't execute remote tests in parallel

libpmempool rm: if file is not pool or poolset it should not be deleted

Consider following text file "textfile.txt":
"some text"
and then using
pmempool_rm("textfile.txt", 0)
will delete the file. pmempool_rm() should check if it is file or poolset before attempting any actions.

Found on: 1.2+wtp1-227-gac3524e

FEAT: libpmemobj: commit group of transactions

A commit group operation, allowing to open up individual transactions in multiple threads and then to commit them all at the same time (all transactions commit or none of them).
Potential variants:

commit_all - commits all open transactions;
commit_list - list exactly which open transactions are atomically committed.

FEAT: NVML glossary document

Create a document with NVML glossary ("pool", "pool set", etc...). Could be a section in the man pages, or a separate man page.
Also, it looks like some terms are not used consistently in the code and documentation - need to unify that.

rpmem: create and open with unaligned device DAX address

rpmem_create and rpmem_open accept pool_addr which can be address obtained by mmaping device DAX. If pointer is not aligned to internal device DAX alignment rpmem_create and rpmem_open can fail.

Please investigate if we can prevent user from failing badly.

abstraction layer of a pool

Abstraction layer of a pool

Presently, libpmempool/obj/blk/log libraries operate on a pool through a poolset file, which is a text file describing the structure of a pool, or directly on a file with a pool. Some of the implications are:

the libraries have to handle properly both types of arguments (a poolset file and a file with a pool) and
have two paths of execution for them,
operating on a pool structure through a poolset file (i.e. checking/syncing/transforming the poolset, creating an obj/blk/log structure) is very prone to user errors in those text files and incurs overhead on the files verification,
no metadata is stored with poolset files.

The proposed change is to:

introduce an abstraction level of a pool on the bottom of the libpmemobj/blk/log libraries,
make libpmempool the sole provider and administrator of the pool abstraction,
store the structure of a pool separately from the poolset file, in a kind of a database, in location configured by a user, with metadata facilitating operations on a pool,
make pmempool administrating pools through their IDs (given when creating a pool, in a similar way the docker command line tool administrate Docker images and containers), i.e. pmempool operations like create, list, info, add/remove replica, check, sync, convert, rename, etc. take pool ID as an argument,
make poolset file an option for operating on a pool (mainly for creating a pool).

Some benefits:

decoupling pool administration from its utilization,
less error prone during pool modification,
more flexibile pool interface,
adding new features to (lib)pmempool made easier,
simpler code in obj/blk/log libraries,
unified interface and transparent handling for remote replicas.

pmemobj: pool inspection tool/API

Currently there's no way to check how much memory is available, what is the fragmentation, how much memory is wasted by metadata, etc. If there's a (persistent) memory leak, there's no way to quickly evaluate it.

Please add a tool which will help with these tasks and add an API which will let applications get pool statistics (leaks can be debugged already thanks to pmemobj_first / pmemobj_next API)

Bonus points for interactive readline-based tool.

FEAT: obj: suspend & restore (or fast close & fast open) API

I'd like pmemobj to have an API for suspending its operation and giving up file lock and restoring it with as much runtime state retained as possible. From caller perspective it would basically be fast pool close & open API. It doesn't have to be thread-safe (caller would have to deal with that).

This feature can be used to implement application-level "process switching" (partial/simplified multi-process).

libpmemobj: inconsistent mutex usage in non-transactional persistent atomic list API

The non-transactional persistent atomic list API in libpmemobj is not entirely thread-safe. The mutex is only used by some of the API (insert, remove, move), but not for things like the POBJ_LIST_FOREACH macro. All list operations need to be made thread-safe. This will be a big win for programmers using this particular set of NVML functions. Re: https://groups.google.com/forum/#!topic/pmem/cWu9lMqGY6g

pmem / pmdk Goto Github PK

pmdk's People

Contributors

Stargazers

Watchers

Forkers

pmdk's Issues

Zone metadata duplication

Description

API Changes

Implementation details

pvector persistent sanity checks

Description

API Changes

Implementation details

multiprocess support in libpmemobj

Rationale

Description

Implementation details

Bibliography

Transaction metadata duplication

Description

API Changes

Implementation details

Run metadata duplication

Description

API Changes

Implementation details

Transaction metadata suballocator

Description

API Changes

Implementation details

Abstraction layer of a pool

Recommend Projects

Recommend Topics

Recommend Org

Jobs