GithubHelp home page GithubHelp logo

Comments (10)

emjotde avatar emjotde commented on July 28, 2024

Good point. So, temporary file then? I think kenlm does something that deletes the file right away, and we should specify the temporary directory with T assuming "." by default.

from marian-dev.

kpu avatar kpu commented on July 28, 2024

Assuming you don't care about Windows support (which I got from LGPL code), making a temporary file is pretty easy.

// This function has a windows port, hence the separation.
int
mkstemp_and_unlink(char *tmpl) {
  int ret = mkstemp(tmpl);
  if (ret != -1) {
    UTIL_THROW_IF(unlink(tmpl), ErrnoException, "while deleting delete " << tmpl);
  }
  return ret;
}
int MakeTemp(const std::string &base) {
  std::string name(base);
  name += "XXXXXX";
  name.push_back(0);
  int ret;
  UTIL_THROW_IF(-1 == (ret = mkstemp_and_unlink(&name[0])), ErrnoException, "while making a temporary based on " << base);
  return ret;
}
// If it's a directory, add a /.  This lets users say -T /tmp without creating
// /tmpAAAAAA
void NormalizeTempPrefix(std::string &base) {
  if (base.empty()) return;
  if (base[base.size() - 1] == '/') return;
  struct stat sb;
  // It's fine for it to not exist.
  if (-1 == stat(base.c_str(), &sb)) return;
  if (
#if defined(_WIN32) || defined(_WIN64)
    sb.st_mode & _S_IFDIR
#else
    S_ISDIR(sb.st_mode)
#endif
    ) base += '/';
}

from marian-dev.

jgwinnup avatar jgwinnup commented on July 28, 2024

I've run into this issue already - took me a bit to figure why those two systems decided not to converge...

from marian-dev.

emjotde avatar emjotde commented on July 28, 2024

It's high up on my TODO list.

from marian-dev.

kpu avatar kpu commented on July 28, 2024

I took a look at the code. It's easy to get a FILE * or an int fd but harder to get an iostream as a temporary deleted file. That said, it's just readlng and writing lines so it doesn't need iostreams. Would you be happy if this changed to FILE * apis?

from marian-dev.

kpu avatar kpu commented on July 28, 2024

For what it's worth, it's faster if you size the input (or read everything into a buffer with binary increasing size) and sort string_view.

#include <iostream>
#include <boost/utility/string_view.hpp>
//#include <string_view>
#include <vector>

void SplitLines(boost::string_view in, std::vector<boost::string_view> &out) {
  out.clear();
  while (!in.empty()) {
    boost::string_view::size_type found = in.find('\n');
    out.push_back(in.substr(0, found + 1));
    in.remove_prefix(found + 1);
  }
}

int main() {
  const std::streamsize kBuf = 1048576;
  char buf[kBuf];
  std::cin.read(buf, kBuf);
  std::streamsize got = std::cin.gcount();
  if (got >= kBuf) {
    std::cerr << "Too big\n";
    return 1;
  }
  std::vector<boost::string_view> strings;
  SplitLines(boost::string_view(buf, got), strings);

  std::random_device rd;
  std::shuffle(strings.begin(), strings.end(), std::mt19937(rd()));
  for (const boost::string_view &i : strings) {
    std::cout << i;
  }
}

from marian-dev.

emjotde avatar emjotde commented on July 28, 2024

I use the filestreams because they are there, not too attached to them in that particular case. I would like to keep them around due to the filtering stream functionalty.

from marian-dev.

emjotde avatar emjotde commented on July 28, 2024

@kpu Kenneth, what are the includes for this?

    void NormalizeTempPrefix(std::string &base) {
      if (base.empty())
        return;
      if (base[base.size() - 1] == '/')
        return;
      struct stat sb;
      // It's fine for it to not exist.
      if (-1 == stat(base.c_str(), &sb))
        return;
      if (S_ISDIR(sb.st_mode))
        base += '/';
    }

The stat struct and S_ISDIR cause compilation errors, I assume missing headers?

from marian-dev.

kpu avatar kpu commented on July 28, 2024
#include <string>
#include <sys/types.h>
#include <sys/stat.h>
#if defined(__MINGW32__)
#include <windows.h>
#include <unistd.h>
#elif defined(_WIN32) || defined(_WIN64)
#include <windows.h>
#include <io.h>
#else
#include <unistd.h>
#endif

from marian-dev.

emjotde avatar emjotde commented on July 28, 2024

@kpu thanks.

from marian-dev.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.