sashafrey / topicmod Goto Github PK
View Code? Open in Web Editor NEWThis project had been moved to https://github.com/bigartm/bigartm
License: Other
This project had been moved to https://github.com/bigartm/bigartm
License: Other
This is a regression, introduced in alfrey_rpcz branch. If one press Ctrl+C to abort cpp_client during ongoing processing, he gets an error message:
Assertion failed: Successful WSASTARTUP not yet performed (......\src\signaler
.cpp:328)
As far as I see, there is no simple way of solving this on Windows. On Linux this can be handled by calling install_signal_handler() method from src\rpcz\include\rpcz\connection_manager.hpp.
I hope this is an annoying, but harmless error. However, there might be some consequences - for example, leaking socket connections when clients crash, etc. So I suggest we ignore this ticket for now, but keep it open as a reminder.
This operation should store in-memory partitions to disk (to form the "index"). In is an important step towards big data: all processing should not require storing all collection in memory.
This is quite a big and complex fix, which require at least the following things:
CMake should save us a lot of troubles. We already have many issues with our current build system:
There are several troubles:
There is a critical bug in RPCZ which causes a crash in certain scenarios. Given that it is intermittent it' hard for me provide a reprosteps, but the problem is quite straightforward. To explain, I need three pieces of code.
// Any stub method, generated by RPCZ compiler.
void XXXStub::XXXMethod(...) {
::rpcz::rpc rpc;
channel_->call_method(..., &rpc, ...); // (1)
rpc.wait(); // (3)
}
// File rpcz\rpc_channel_impl.cc
void rpc_channel_impl::handle_client_response(...) {
response_context.rpc_->set_status(status::OK); // (2)
response_context.rpc_->sync_event_->signal(); // (4)
}
Assume that the threads were scheduled so that the sequence was as follows:
Now, the last piece of code --- rpc.wait() method:
int rpc::wait() {
status_code status = get_status();
CHECK_NE(status, status::INACTIVE) << "Request must be sent before calling wait()";
if (status != status::ACTIVE) return get_status(); // THIS LINE SHOULD BE REMOVED
sync_event_->wait();
return 0;
}
If threads was scheduled as I described, the status of rpc will be OK, and the rpc::wait() will exit immediately. Therefore, by the moment when handle_client_response() reaches the sync_event_->signal() the sync_event might be disposed, and we get a crash.
I believe the fix is just to remove the following line in rpc::wait() method:
if (status != status::ACTIVE) return get_status();
Currently perplexity is accumulated across all iterations. This is a very stupid limitation. ARTM should support perplexity calculation for the last iteration.
Search for this code:
boost::this_thread::sleep(boost::posix_time::milliseconds(1));
It is happens in many places. Everywhere 1ms should be configurable via config object. Keep 1ms as a default value.
Yep, its stupid, but for some reason Makefile under src/topicmd/ first creates .depends files, just to clean them afterwards. Should be an easy fix :)
Branch https://github.com/sashafrey/topicmod/tree/alfrey_data_loader_streams introduces two new operations: DataLoader.WaitIdle() and Instance.WaitIdle(). A better solution would be to
Придумать такие нормировки для каждого регуляризатора, чтобы значение коэффициента tau=1 естественно интерпретировалось бы как "нормальное", чтобы была формулировка (как в физике), что происходит с моделью "за единицу регуляризации", и чтобы tau=0.1 означало "в 10 раз ослабить регуляризацию", а tau=10 -- "в 10 раз усилить регуляризацию".
topicmd is design as a library that you may use from any other language (at least from Python, Java, mostlikelly C#, and perhaps Matlab).
We should implement an interface from Python, since Python is a very nice environment to conduct research experiments. Everything that c_client does nowadays should be doable from Python (push collection, create model, fetch back the results).
Integration with Python should be done via library called "ctypes".
An example of such integration you may find here:
http://sourceforge.net/p/vft11ccas/code/HEAD/tree/Source/alfrey/Python/
Ask sashafrey for more details.
Currently Merger and Processors will tune models until forever (in a background).
Ideally we should have an option to stop tuning if perplexity (or any other measure) reaches sufficient level. I don't have clear vision on how to achieve this, you are welcome to suggest your design!
DataLoader currently keeps all the data in memory. Suggested change is to
I suggest we use google test framework
https://code.google.com/p/googletest/
(and change the title)
Currently c_interface and cpp_interface require caller to provide it with model_id. In fact, it should be Instance::UpdateModel which keeps track of existing model_id, and generate a new model_id. Such function should be named CreateModel. UpdateModel should be renamed to ReconfigureModel (to match names in c_interface and cpp_interface).
RPCZ (https://code.google.com/p/rpcz/) is introduce in the following branch:
https://github.com/sashafrey/topicmod/compare/alfrey_rpcz
For now we only use C++ version of RPCZ, but it is would be nice to learn how to call RPCZ services from Python.
Currently the concept of 'Topic' is quite unrepresented in the code. Basically, there is a number of topics associated with each model, and after that topic is just an index in some arrays/matrices. This makes it difficult to associate extra information with each topic (for example, a flag 'subject topic' vs 'background topic').
rdkl:
Maybe replace
namespace artm {
enum ExitCodes {
SUCCESS,
FIRST_ERROR_TYPE,
SECOND_ERROR_TYPE
}
Auto-completion.
alfrey:
I have to think about this :)
error LNK1104: cannot open file 'libprotobuf.lib' in VS2010, projects topicmd_tests, topicmd_dll and c_client.
Root cause - for some reason $(VisualStudioVersion) macro doesn't exist.
Workaround - replace "vc$(VisualStudioVersion)" in linker path settings with explicit string "vc10.0".
ArtmRequestBatchTopics method of c_interface is not yet implemented. The goal of this method is to classify a collection of documents (e.g. Batch) with a specific topic model.
Implement this feature:
"Testing document d was further randomly split into two equal parts: the first one Each
was used to estimate parameters θd, and the second one was used in perplexity evaluation."
An uncaught exception in non-main thread will cause terminate(). Such situations are very difficult to debug, because terminate() aborts all threads in uncontrolled manner, causing them NRE, AV and all sorts of weirdness that is impossible to debug.
To avoid this, we should review all threads created in topicmd, and make sure we catch all exceptions (and log a proper error message)
Currently Merger keeps accumulating counters. It never resets them to zero, an do not "exponentially decrease" values, corresponding to "old iterations". This is suboptimal since on the first steps Phi matrix was inacurate we accumulated some noise.
Common for online algorithms is to exponentially decrease counters. You are welcome to suggest an actual implementation details for this feature, and implement it :)
Goad: it is possible to set up several \emph{streams} inside data loader
(for example, one stream for training items, and one for test items).
Models (and quality measures such as perplexity) specify which stream to use for tuning (evaluation).
Currently token-topic matrix uses 'float' as a datatype. This should be perhaps set as a separate typedef ('artm_float'), to make sure we can go to double at any time better precision is needed.
I'll have to think about some details (for example, how do we do with protobuf messages).
Currently the number of processors is hardcoded to one. We need to support multicore machines by adding more processors to the instance. The fix should be fairly simple - just to instantiate more Processor, connect them to the same queues (processor_queue_ and merger_queue_), and it should work! :)
With current implementation processor tunes Theta (distribution of items into topics) each time from scratch, even though it doesn't differ much between iterations (N) and (N+1).
The idea is to ask DataLoader to save theta's, and then pass it back to Processor.
From "An Architecture for Parallel Topic Models" by A. Smola and S. Narayanamurthy
To store the n(t,w) we use the same memory layout as Mallet. That is, we maintain a list of (topic, count) pairs for each word "w" sorting in order of decreasing counts. This allws us to implement a [Gibbs] sampler efficiently (with high probability we do not reach the end of the list) since the most likelly topics occur first.
Currently merger resets the whole model whenever it see new generation. There are many better ways to do this.
Add timeout to prevent operations from hanging forever
For internal documentation it would be nice to have two diagrams:
I suggest to take one from LinearSampling
http://sourceforge.net/p/vft11ccas/code/HEAD/tree/Source/alfrey/Cuda/LinearSampling/Log.h
IMPORTANT: the log originates form http://stackoverflow.com/questions/5028302/small-logger-class. Make sure to talk with the author and figure out if he is OK with re-using his code in open-source project.
Currently request_model_topics in c_interface.cc is implemented via one global variable (string message). This holds the data only for the last request. Ideally the result of request must be stored until dispose_request is invoked.
The task is to implement proper request manager (similar to instance_manager).
Fix .gitignore to avoid adding src/Win32/Debug (and other intermediate files on Windows)
Describe build process (prereqs, etc), branching and integration. A gentle introduction to git would be also nice
Document code style
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.