GithubHelp home page GithubHelp logo

utf8.lua's Introduction

utf8.lua

pure-lua 5.3 regex library for Lua 5.3, Lua 5.1, LuaJIT

This library provides simple way to add UTF-8 support into your application.

Example:

local utf8 = require('.utf8'):init()
for k,v in pairs(utf8) do
  string[k] = v
end

local str = "пыщпыщ ололоо я водитель нло"
print(str:find("(.л.+)н"))
-- 8	26	ололоо я водитель

print(str:gsub("ло+", "보라"))
-- пыщпыщ о보라보라 я водитель н보라	3

print(str:match("^п[лопыщ ]*я"))
-- пыщпыщ ололоо я

Usage:

This library can be used as drop-in replacement for vanilla string library. It exports all vanilla functions under raw sub-object.

local utf8 = require('.utf8'):init()
local str = "пыщпыщ ололоо я водитель нло"
utf8.gsub(str, "ло+", "보라")
-- пыщпыщ о보라보라 я водитель н보라	3
utf8.raw.gsub(str, "ло+", "보라")
-- пыщпыщ о보라보라о я водитель н보라	3

It also provides all functions from Lua 5.3 UTF-8 module except utf8.len (s [, i [, j]]). If you need to validate your strings use utf8.validate(str, byte_pos) or iterate over with utf8.validator.

Please note that library assumes regexes are valid UTF-8 strings, if you need to manipulate individual bytes use vanilla functions under utf8.raw.

Installation:

Download repository to your project folder. (no rockspecs yet)

Examples assume library placed under utf8 subfolder not utf8.lua.

As of Lua 5.3 default utf8 module has precedence over user-provided. In this case you can specify full module path (.utf8).

Configuration:

Library is highly modular. You can provide your implementation for almost any function used. Library already has several back-ends:

Probably most interesting customizations are utf8.config.loadstring and utf8.config.cache if you want to precompile your regexes.

local utf8 = require('.utf8')
utf8.config = {
  cache = my_smart_cache,
}
utf8:init()

For lower and upper functions to work in environments where ffi cannot be used, you can specify substitution tables (data example)

local utf8 = require('.utf8')
utf8.config = {
  conversion = {
    uc_lc = utf8_uc_lc,
    lc_uc = utf8_lc_uc
  },
}
utf8:init()

Customization is done before initialization. If you want, you can change configuration after init, it might work for everything but modules. All of them should be reloaded.

Issue reporting:

Please provide example script that causes error together with environment description and debug output. Debug output can be obtained like:

local utf8 = require('.utf8')
utf8.config = {
  debug = utf8:require("util").debug
}
utf8:init()
-- your code

Default logger used is io.write and can be changed by specifying logger = my_logger in configuration

utf8.lua's People

Contributors

starius avatar stepets avatar tst2005 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

utf8.lua's Issues

Help a Lua novice?

Hi,

I'm a Lua novice. I'm using Lua 5.4.3.

In my project folder I used git clone https://github.com/Stepets/utf8.lua. This created a subfolder called utf8.lua.

Near the beginning of my Lua code, I have local utf8 = require('.utf8'):init(). When I try to execute the code I get:

~/src/test$ ./new.lua
/usr/local/bin/lua: ./new.lua:5: module '.utf8' not found:
        no field package.preload['.utf8']
        no file '/usr/local/share/lua/5.4//utf8.lua'
        no file '/usr/local/share/lua/5.4//utf8/init.lua'
        no file '/usr/local/lib/lua/5.4//utf8.lua'
        no file '/usr/local/lib/lua/5.4//utf8/init.lua'
        no file './/utf8.lua'
        no file './/utf8/init.lua'
        no file '/usr/local/lib/lua/5.4//utf8.so'
        no file '/usr/local/lib/lua/5.4/loadall.so'
        no file './/utf8.so'
        no file '/usr/local/lib/lua/5.4/.so'
        no file '/usr/local/lib/lua/5.4/loadall.so'
        no file './.so'
stack traceback:
        [C]: in function 'require'
        ./new.lua:5: in main chunk
        [C]: in ?

I suspect I need to configure package.path, but I'm not sure how. Can you help?

Error when working with strict.lua

stepets\utf8\functions\lua53.lua:37: variable 'plain' is not declared

This is a minor issue, more serious ones will hopefully follow.
Each next bug is getting harder to find.

Incorrect matching case

The following example gives different results between string and utf8.lua libraries:

local u = (require 'stepets.utf8'):init()

local subj = "items"
local patt = "^(.-)([^.]*)$"

print(      u.match(subj, patt) ) --> nil
print( string.match(subj, patt) ) -->         items

basic test cases

I have been using this library recently since I don't need the performance of C modules and much prefer the simplicity.

I imagine adapting existing test cases won't be too difficult (some have the same API), if you don't have time, I might consider working on it :)

A better module name ?

Hello,

You probably know that since lua 5.3 a module named utf8 is already used ( https://www.lua.org/manual/5.3/manual.html#6.5 )
I like your project. I think it should be more widly use to support string operation with utf-8 data.

  • Do you planned to make a rockspec and publish your module on luarocks ?

  • Can you consider to find a better repository name ?

I'm working on :

  1. make a simple rockspec
  2. make an all-in-one module file (my util to do that usually use the rockspec file)
    In my current tries, I use stepets_utf8 as module name. (require "stepets_utf8")

Another name could be utf8string because you mainly implement all lua string.* features with utf8 support.
This utf8string package name is not yet used on luarocks

What is your opinion ?

Question about the require syntax

I want to ask what's the difference between

local utf8 = require('.utf8'):init()

and

local utf8 = require('utf8'):init()

I placed this library in my package path in the folder utf8 (using LuaJIT), when I run my test, utf8.codepoint is undefined. Where should I place this library to get .utf8 work.

Amalgam without require?

Hello,

I am trying to use this library in a Lua "sandbox" environment (within an application), where I have:

  • Lua 5.1 based interpreter,
  • No require,
  • No access to the native platform, so no ffi, no os, etc.
  • No external dependencies, unless I can also bring them in as pure Lua code and they work with 5.1 without any native platform or require access.

Multiple files are handled by a well-defined load order configured within the application, but I would also prefer to bring in everything "statically" into a single file.

What is the minimum set of .lua files from this project that I would need to make a working version of utf8.lua without any external dependencies? Should I include the modifiers or do they all have external dependencies?

My goal is to define a global utf8 table that contains all of the functionality of this library in a single file. I would prefer if it supports lazy init, because sometimes this code might be "loaded" but not actually executed, and if the initialization takes time, I would rather skip that init time when the utf8 code isn't needed.

I was able to use squish (a tool written in Lua that combines multiple Lua files into one) to generate an amalgamated file, but it still depends on require to do what you would expect because of utf8:require all throughout the code. squish is not smart enough to translate "through" the dynamic require wrapper in utf8; if they were static requires it would be able to follow them and amalgamate the code intelligently.

I was able to get all of the code to load (uninitialized) in my environment, so I know that the Lua interpreter in my environment was able to syntactically validate all of the amalgamated code from squish; but the problem was that as soon as it hits utf8:require's native require call-through, it dies because require is nil in this environment.

Any ideas how best to address this?

lower() and upper() support for utf8data.lua

This is related to the closed issue lower() and upper() do not work on UTF-8 strings (Lua 5.1, Lua5.3)

In utf8.lua/README.md under Usage it says: "This library can be used as drop-in replacement for vanilla string library."

In utf8.lua/primitives/dummy.lua it says:

-- If utf8data.lua (containing the lower<->upper case mappings) is loaded, these
-- additional functions are available:
-- * utf8upper(s)
-- * utf8lower(s)

utf8data.lua mapping tables can be obtained from github.com/artemshein/luv and elsewhere.

There appear to be no utf8 drop-in replacements for string.upper or string.lower functions.
Even if require("utf8data") is used with a copy of utf8data.lua then no additional functions are available.

Earlier versions of the utf8 library had script such as below that seems to offer the above functions but now appears to be missing.
See Id: utf8.lua 147 2007-01-04 00:57:00Z pasta lines 244 - 293
Is it possible to incorporate those functions back into the utf8 library please?

-- replace UTF-8 characters based on a mapping table
local function utf8replace (s, mapping)
	-- argument checking
	if type(s) ~= "string" then
		error("bad argument #1 to 'utf8replace' (string expected, got ".. type(s).. ")")
	end
	if type(mapping) ~= "table" then
		error("bad argument #2 to 'utf8replace' (table expected, got ".. type(mapping).. ")")
	end

	local pos = 1
	local bytes = s:len()
	local charbytes
	local newstr = ""

	while pos <= bytes do
		charbytes = utf8charbytes(s, pos)
		local c = s:sub(pos, pos + charbytes - 1)

		newstr = newstr .. (mapping[c] or c)

		pos = pos + charbytes
	end

	return newstr
end

-- identical to string.upper except it knows about unicode simple case conversions
local function utf8upper (s)
	return utf8replace(s, utf8_lc_uc)
end

-- install in the string library
if not string.utf8upper and utf8_lc_uc then
	string.utf8upper = utf8upper
end

-- identical to string.lower except it knows about unicode simple case conversions
local function utf8lower (s)
	return utf8replace(s, utf8_uc_lc)
end

-- install in the string library
if not string.utf8lower and utf8_uc_lc then
	string.utf8lower = utf8lower
end

BTW:
The utf8replace function seems over complex and its body could use the utf8.charpattern as below:

	local newstr = s:gsub( "([\1-\x7F\xC2-\xF4][\x80-\xBF]*)", mapping )

Regular expressions can contain invalid utf8 byte sequences.

For now library assumes that regular expressions and text strings are valid utf8 and for optimization looks only on character head byte to determine where is next character begins.

It doesn't work with raw bytes. While purpose for this library is to hide underlaying byte processing this approach brings incompatibility with vanilla string library.

I suppose working with broken utf8 strings and searching in them raw byte regexes is quite rare use-case. So I wouldn't fix it for now but will provide insights on how it can be fixed.

One of core functions of this library is utf8next. It takes text with byte index in it and returns head byte index of following utf8 character. It uses utf8charbytes that works without utf8 character validation.

local function utf8charbytes(str, bs)
return head_table[byte(str, bs) or 256]
end
local function utf8next(str, bs)
return bs + utf8charbytes(str, bs)
end

Also there is utf8validate function that uses utf8validator as iterator function.

local function utf8validate(str, byte_pos)
local result = {}
for nbs, bs, part, code in utf8validator, str, byte_pos do
if bs then
result[#result + 1] = { pos = bs, part = part, code = code }
end
end
return #result == 0, result
end

utf8validator takes text with byte index in it and determines supposed utf8 character length. Then it checks byte after byte and returns either following utf8 character head byte position or position of byte that breaks utf8 sequence. So utf8validator might be used instead utf8next as is (needs testing).

Next is configuration. I think it could be just flag named something like utf8_valid_strings. utf8.next should be set accordingly to this flag value

utf8.next = utf8next

Yet another incorrect matching case

local u = (require "stepets.utf8"):init()

local subj = "ab"
local patt = "a"
local repl = "%1"

print( string.gsub(subj, patt, repl) ) --> ab      1
print(      u.gsub(subj, patt, repl) )
   --> attempt to concatenate field '?' (a nil value)

Error in Lua 5.3 with pattern longger than 1

Example not work in Lua 5.3.
Seems any pattern length greater than 1 cause error.
Test Lua 5.1 is fine.
Coded in UTF8.

local utf8 = require('.utf8'):init()
print(utf8.find('abc',"a"))--ok
print(utf8.find('abc',"ab"))--crash

Error message like:
C:\ProgramFiles\ZeroBraneStudio\bin\lua53.exe: ...gramFiles\ZeroBraneStudio\lualibs/\utf8\regex_parser.lua:77: [string "12"]:24: unexpected symbol near '.0' stack traceback: [C]: in function 'assert' ...gramFiles\ZeroBraneStudio\lualibs/\utf8\regex_parser.lua:77: in function '.utf8.regex_parser' ...mFiles\ZeroBraneStudio\lualibs/\utf8\functions\lua53.lua:16: in function 'get_matcher_function' ...mFiles\ZeroBraneStudio\lualibs/\utf8\functions\lua53.lua:24: in function '.utf8.functions.lua53.find' ...sers\RobertLin\Documents\Lua\LuaAgain2015\PathSearch.lua:16: in main chunk [C]: in ?

Another incorrect matching case

local u = (require "stepets.utf8"):init()

local subj = "ab.?"
local patt = "%?"
local repl = "123"

print(      u.gsub(subj, patt, repl) ) --> 123a123b123.123?123     5
print( string.gsub(subj, patt, repl) ) --> ab.123  1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.