GithubHelp home page GithubHelp logo

Vectorize over first argument about cld2 HOT 3 CLOSED

dselivanov avatar dselivanov commented on August 29, 2024
Vectorize over first argument

from cld2.

Comments (3)

kbenoit avatar kbenoit commented on August 29, 2024

Agreed - and it should be able to detect each vector element individually. I'd say a very common use case would be to detect the language from a collection of texts, each in different languages. Example:

# read in some multi-language texts
require(readtext)
DATA_DIR <- system.file("extdata/", package = "readtext")
(rt7 <- readtext(paste0(DATA_DIR, "pdf/UDHR/*.pdf"), 
                 docvarsfrom = "filenames", 
                 docvarnames = c("document", "language")))
## readtext object consisting of 11 documents and 2 docvars.
## # data.frame [11 x 4]
##             doc_id                          text document language
##              <chr>                         <chr>    <chr>    <chr>
## 1 UDHR_chinese.pdf "\"世界人权宣言\n联合国\"..."     UDHR  chinese
## 2   UDHR_czech.pdf           "\"VŠEOBECNÁ \"..."     UDHR    czech
## 3  UDHR_danish.pdf           "\"Den 10. de\"..."     UDHR   danish
## 4 UDHR_english.pdf           "\"Universal \"..."     UDHR  english
## 5  UDHR_french.pdf           "\"Déclaratio\"..."     UDHR   french
## 6   UDHR_greek.pdf           "\"ΟΙΚΟΥΜΕΝΙΚ\"..."     UDHR    greek
## # ... with 5 more rows

Encoding(rt7$text)
## [1] "UTF-8"   "UTF-8"   "UTF-8"   "unknown" "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"  


require(cld2)
detect_language(rt7$text)
## [1] NA
detect_language_multi(rt7$text)
## $guess
##   language code latin probability
## 1    GREEK   el FALSE        0.18
## 2 Japanese   ja FALSE        0.14
## 3  RUSSIAN   ru FALSE        0.11
## 
## $bytes
## [1] 19722
## 
## $reliabale
## [1] FALSE

from cld2.

jeroen avatar jeroen commented on August 29, 2024

OK we can do that. I usually prefery to use vapply myself instead of having everything vectorised by default, but perhaps it makes sense.

sapply(as.list(rt7$text), detect_language)

from cld2.

jeroen avatar jeroen commented on August 29, 2024

Fixed in 232bd0e.

from cld2.

Related Issues (7)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.