Comments (10)
Good point. So, temporary file then? I think kenlm does something that deletes the file right away, and we should specify the temporary directory with T assuming "." by default.
from marian-dev.
Assuming you don't care about Windows support (which I got from LGPL code), making a temporary file is pretty easy.
// This function has a windows port, hence the separation.
int
mkstemp_and_unlink(char *tmpl) {
int ret = mkstemp(tmpl);
if (ret != -1) {
UTIL_THROW_IF(unlink(tmpl), ErrnoException, "while deleting delete " << tmpl);
}
return ret;
}
int MakeTemp(const std::string &base) {
std::string name(base);
name += "XXXXXX";
name.push_back(0);
int ret;
UTIL_THROW_IF(-1 == (ret = mkstemp_and_unlink(&name[0])), ErrnoException, "while making a temporary based on " << base);
return ret;
}
// If it's a directory, add a /. This lets users say -T /tmp without creating
// /tmpAAAAAA
void NormalizeTempPrefix(std::string &base) {
if (base.empty()) return;
if (base[base.size() - 1] == '/') return;
struct stat sb;
// It's fine for it to not exist.
if (-1 == stat(base.c_str(), &sb)) return;
if (
#if defined(_WIN32) || defined(_WIN64)
sb.st_mode & _S_IFDIR
#else
S_ISDIR(sb.st_mode)
#endif
) base += '/';
}
from marian-dev.
I've run into this issue already - took me a bit to figure why those two systems decided not to converge...
from marian-dev.
It's high up on my TODO list.
from marian-dev.
I took a look at the code. It's easy to get a FILE * or an int fd but harder to get an iostream as a temporary deleted file. That said, it's just readlng and writing lines so it doesn't need iostreams. Would you be happy if this changed to FILE * apis?
from marian-dev.
For what it's worth, it's faster if you size the input (or read everything into a buffer with binary increasing size) and sort string_view.
#include <iostream>
#include <boost/utility/string_view.hpp>
//#include <string_view>
#include <vector>
void SplitLines(boost::string_view in, std::vector<boost::string_view> &out) {
out.clear();
while (!in.empty()) {
boost::string_view::size_type found = in.find('\n');
out.push_back(in.substr(0, found + 1));
in.remove_prefix(found + 1);
}
}
int main() {
const std::streamsize kBuf = 1048576;
char buf[kBuf];
std::cin.read(buf, kBuf);
std::streamsize got = std::cin.gcount();
if (got >= kBuf) {
std::cerr << "Too big\n";
return 1;
}
std::vector<boost::string_view> strings;
SplitLines(boost::string_view(buf, got), strings);
std::random_device rd;
std::shuffle(strings.begin(), strings.end(), std::mt19937(rd()));
for (const boost::string_view &i : strings) {
std::cout << i;
}
}
from marian-dev.
I use the filestreams because they are there, not too attached to them in that particular case. I would like to keep them around due to the filtering stream functionalty.
from marian-dev.
@kpu Kenneth, what are the includes for this?
void NormalizeTempPrefix(std::string &base) {
if (base.empty())
return;
if (base[base.size() - 1] == '/')
return;
struct stat sb;
// It's fine for it to not exist.
if (-1 == stat(base.c_str(), &sb))
return;
if (S_ISDIR(sb.st_mode))
base += '/';
}
The stat struct and S_ISDIR cause compilation errors, I assume missing headers?
from marian-dev.
#include <string>
#include <sys/types.h>
#include <sys/stat.h>
#if defined(__MINGW32__)
#include <windows.h>
#include <unistd.h>
#elif defined(_WIN32) || defined(_WIN64)
#include <windows.h>
#include <io.h>
#else
#include <unistd.h>
#endif
from marian-dev.
@kpu thanks.
from marian-dev.
Related Issues (20)
- Compilation error on gcc 12: pointer used after ‘void operator delete(void*, std::size_t)’
- Doesn't compile on clang 16.0.6 due to issue in sentencepiece
- Doubt regarding scoring method,F0
- Cost nan
- Portable marian binary for the recent versions of ubuntu (20.04 and newer)
- Missing batch statistics HOT 2
- -DCOMPILE_SERVER is broken with OpenSSL >= 3 HOT 1
- -DCOMPILE_TESTS is broken on ubuntu 22.04 HOT 1
- Cmake cannot find cuBLASLt
- marian embed --compute-similarity errors out HOT 2
- Multithread Translation HOT 1
- High RAM usage with factors+shuffle-in-ram: false
- Per-factor embedding dimensions when concatenating
- Setting optimizer-delay to 0 prevents makes the trainining process stall with no error
- [Feature Request] Decoder-only Marian models
- GCC 12 compilation warning: withCommas integer wraparound
- intrusive_ptr not threadsafe
- Training Optimization Question
- zstandard support in input files
- Training fails on Vertex AI (GCP) due to NCCL error on A100 GPUs HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from marian-dev.