GithubHelp home page GithubHelp logo

rrthomas / recode Goto Github PK

View Code? Open in Web Editor NEW

This project forked from pinard/recode

129.0 129.0 12.0 6.85 MB

Charset converter tool and library

License: GNU General Public License v3.0

Makefile 1.25% Shell 16.96% Python 18.19% M4 0.44% Emacs Lisp 2.31% C 54.24% Lex 4.36% Cython 2.26%

recode's People

Contributors

github-cygwin avatar jpopelka avatar kugland avatar maelan avatar pinard avatar ppisar avatar rrthomas avatar shlomif avatar thetechrobo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

recode's Issues

FSF address in copyright notice

I noticed that src/main.c prints the copyright notice including FSF's address for GNU GPLv3.

Just a minor thing: GNU GPLv3 doesn't include FSF's address in the copyright notice, it was replaced with link to the license in GNU website. Maybe you want to do the same?

So, instead of:

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software Foundation,
Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

there is now:

You should have received a copy of the GNU General Public License
along with this program.  If not, see <https://www.gnu.org/licenses/>.

See: https://www.gnu.org/licenses/gpl-3.0.html#howto

Providing bootstrapped release sources

I wonder if it would be possible, to provide also bootstrapped sources.

After the local directory is bootstrapped , make dist can be executed, which will produce a tarball containing everything needed for configure and make scripts.

The generated tarball does not need additional sources to be download, after the release sources have been downloaded and extracted.

Steps:

$ git clone https://github.com/rrthomas/recode.git && cd recode
$ ./bootstrap
$ ./configure
$ make
$ make dist
$ ls
... recode-3.7.tar.gz ...

Please let me know your ideas, having the sources at least alongside the existing ones(which don't include gnulib sources) might be useful for linux distributions.

UTF-32BE to UTF-8 conversion

What's the expected input? Most sites that I can find use something like...

$: echo "U00003072" | recode UTF-32BE..UTF-8
recode: Invalid input in step 'UTF-32BE..UTF-8'
$: echo "U 00 00 30 72" | recode UTF-32BE..UTF-8
recode: Invalid input in step 'UTF-32BE..UTF-8'
$: echo "00 00 30 72" | recode UTF-32BE..UTF-8
recode: Invalid input in step 'UTF-32BE..UTF-8'
$: echo "00003072" | recode UTF-32BE..UTF-8
recode: Invalid input in step 'UTF-32BE..UTF-8'
$: echo "" | recode UTF-32BE..UTF-8
recode: Invalid input in step 'UTF-32BE..UTF-8'

... but nothing works, as you can see.

possible regression in 3.7

When running the tests for enca, I get 2 test failures (see nijel/enca#30), one of them in test-recode.sh.

The tests pass when I have recode-3.6 installed, but fail with recode-3.7.

Does not compile with glibc >= 2.28

Changes in glibc 2.28 will breaks this application build,

make all-recursive
make[3]: Entering directory '/usr/src/RPM/BUILD/recode-3.7/lib'
make[4]: Entering directory '/usr/src/RPM/BUILD/recode-3.7/lib'
/bin/sh ../libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. -I.. -O2 -g -pipe -Wformat -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fstack-protector --param=ssp-buffer-size=4 -D_REENTRANT -fPIC -MT fseterr.lo -MD -MP -MF .deps/fseterr.Tpo -c -o fseterr.lo fseterr.c
libtool: compile: gcc -DHAVE_CONFIG_H -I. -I.. -O2 -g -pipe -Wformat -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fstack-protector --param=ssp-buffer-size=4 -D_REENTRANT -fPIC -MT fseterr.lo -MD -MP -MF .deps/fseterr.Tpo -c fseterr.c -fPIC -DPIC -o .libs/fseterr.o
fseterr.c: In function ‘fseterr’:
fseterr.c:78:3: error: #error "Please port gnulib fseterr.c to your platform! Look at the definitions of ferror and clearerr on your system, then report this to bug-gnulib."
#error "Please port gnulib fseterr.c to your platform! Look at the definitions of ferror and clearerr on your system, then report this to bug-gnulib."
^~~~~

Rename fails in some circumstances on WSL

I am running Ubuntu WSL

I was running the following to find files with Windows-1252 encoding and convert them to UTF-8:

LC_ALL=C.UTF-8 find . -type f \( -name '*txt' -or -name '*html' -or -name '*htm' \) -exec grep -laxv '.*' {} + | xargs uchardet | grep WINDOWS-1252 | cut -d: -f1 | xargs -n 1 recode -t 'windows-1252..UTF-8'

Unfortunately it seems that recode does not work properly with NTFS filesystems at all

I ended up with hundreds of these messages:

recode: chmod (./path/path/rec4308.tmp): Operation not permitted
recode: chmod (./path/path/rec4309.tmp): Operation not permitted
recode: chmod (./path/path/rec4310.tmp): Operation not permitted
recode: chmod (./path/path/rec4311.tmp): Operation not permitted
recode: chmod (./path/path/rec4312.tmp): Operation not permitted
recode: chmod (./path/path/rec4313.tmp): Operation not permitted
recode: chmod (./path/path/rec4314.tmp): Operation not permitted
recode: chmod (./path/path/rec4315.tmp): Operation not permitted
recode: chmod (./path/path/rec4316.tmp): Operation not permitted
recode: chmod (./path/path/rec4317.tmp): Operation not permitted
recode: chmod (./path/path/rec4318.tmp): Operation not permitted
recode: chmod (./path/path/rec4319.tmp): Operation not permitted
recode: chmod (./path/path/rec4320.tmp): Operation not permitted
recode: chmod (./path/path/rec4321.tmp): Operation not permitted

All of the original files (and all filename information) are gone

The .tmp files are in fact UTF-8 but there's no way to know what the original filenames were so the files are effectively gone/useless

even if I had the original filenames there's no way to know which .tmp file correlates with which original filename

it's not uncommon for Linux programs to not work perfectly on NTFS but I've never encountered anything this bad before

I "lost" nearly 400 files and it would have been more if I hadn't noticed the errors and aborted the job

Here's an example using a single file:

$ file testfile
testfile: HTML document, ASCII text, with very long lines, with LF, NEL line terminators
$ uchardet testfile
WINDOWS-1250
$ recode -t 'windows-1250..UTF-8' testfile
recode: chmod (rec5087.tmp): Operation not permitted
$ ls testfile
ls: cannot access 'testfile': No such file or directory
$ ls *.tmp
rec5087.tmp
$ file *.tmp
rec5087.tmp: HTML document, UTF-8 Unicode text, with very long lines
$ uchardet *.tmp
WINDOWS-1250
$

With a single file it's not a big deal to rename the .tmp file back to the original filename (as long as you have the original filename) but when many files are affected it seems impossible to recover from, especially if you don't have the original filenames.

I verified the same thing happens even without the -t

I verified that this happens on both NTFS and exFAT but does NOT happen on FAT32

this issue might or might not be specific to WSL systems; a pure Linux system with an NTFS or exFAT filesystem mounted might or might not behave differently; I'm unable to test this

librecode can hang for invalid conversions

When testing AnyMeal, on Fedora 38, found
that the program below

#include <stdio.h>
#include <stdbool.h>
#include <stdlib.h>
#include <recodext.h>

const char *program_name;

int
main (int argc, char *const *argv)
{
  program_name = argv[0];
  RECODE_OUTER outer = recode_new_outer (true);
  RECODE_REQUEST request = recode_new_request (outer);
  char input_buffer[10] = "Äpfel";
  char *output_buffer;
  size_t output_length;
  size_t input_length = 6;
  size_t output_allocated;
  bool success;

  success = recode_scan_request (request, "latin1..ascii");
  request->verbose_flag = true;
  RECODE_TASK task = recode_new_task(request);
  task->input.buffer = input_buffer;
  task->input.cursor = input_buffer;
  task->input.limit = input_buffer + input_length;
  task->output.buffer = output_buffer;
  task->output.cursor = output_buffer;
  task->output.limit = output_buffer + output_allocated;
  printf("Starting task\n");
  success = recode_perform_task (task);
  printf("Task complete\n");
  output_buffer = task->output.buffer;
  output_length = task->output.cursor - task->output.buffer;
  output_allocated = task->output.limit - task->output.buffer;
  recode_delete_task (task);
  printf("%s\n",output_buffer);
  recode_delete_request (request);
  recode_delete_outer (outer);

  exit (success ? 0 : 1);
}

hangs and does not print Task Complete. If one changes Äpfel to Apfel it works. Attempting the conversion from the command line does however issue an error message that the input is untranslatable. This is with release 3.7.14

Support for the ZOS_UNIX surface for EBCDIC encodings

For the end-of-line handling, the only documented surfaces so far are CR and CR-LF. (Doc node "Representation for end of lines")

The Unicode Standard https://www.unicode.org/versions/Unicode15.0.0/ch05.pdf explains (section 5.8 "Newline Guidelines") that for EBCDIC encodings there are two end-of-line mapping conventions in use (see table 5-1):

This is the summary; more details in the thread that starts at https://lists.gnu.org/archive/html/bug-gnu-libiconv/2023-04/msg00002.html .

GNU libiconv now makes use of the concept and syntax of a recode "surface":

  • When an encoding such as IBM-1047 is specified (AFAIU, that's the default encoding for many people on z/OS), the newline 0x15 maps to U+0085.
  • When an encoding is specified as IBM-1047/ZOS_UNIX, the newline 0x15 maps to U+000A, and 0x25 maps to U+0085. Like shown in table 5-1.

I would suggest that recode supports the same surface ZOS_UNIX with the same name and the same semantics (swap 0x15 and 0x25).

To understand how this works in practice, with GNU libiconv, see this unit test:
https://git.savannah.gnu.org/gitweb/?p=libiconv.git;a=blob;f=tests/check-ebcdic;h=62dfd61437d008af1f3f47ae69baeba692e01792;hb=19b6af5e5efe306bc1b2da87ba054b7391360ca2

BUILDSTDERR: clang-7: error: no such file or directory: './.libs/librecode.so'

Can't build it. Here are more logs
http://file-store.openmandriva.org/api/v1/file_stores/777c895b65d45922bd80dfc0c2e28f24edec286e.log?show=true

make[3]: Entering directory '/builddir/build/BUILD/recode-3.7.1/src'
/bin/sh ../libtool  --tag=CC   --mode=link /usr/bin/clang  -Os -gdwarf-4 -Wstrict-aliasing=2 -pipe -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -fstack-protector-strong --param=ssp-buffer-size=4  -fPIC  -D_REENTRANT -fPIC --rtlib=compiler-rt -D_REENTRANT -fPIC  -Os -gdwarf-4 -Wstrict-aliasing=2 -pipe -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -fstack-protector-strong --param=ssp-buffer-size=4  -fPIC  -D_REENTRANT -fPIC --rtlib=compiler-rt -Wl,-O2  -Wl,--no-undefined   -o recode main.o mixed.o librecode.la 
libtool: link: /usr/bin/clang -Os -gdwarf-4 -Wstrict-aliasing=2 -pipe -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -fstack-protector-strong --param=ssp-buffer-size=4 -fPIC -D_REENTRANT -fPIC --rtlib=compiler-rt -D_REENTRANT -fPIC -Os -gdwarf-4 -Wstrict-aliasing=2 -pipe -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -fstack-protector-strong --param=ssp-buffer-size=4 -fPIC -D_REENTRANT -fPIC --rtlib=compiler-rt -Wl,-O2 -Wl,--no-undefined -o .libs/recode main.o mixed.o  ./.libs/librecode.so
make[3]: Leaving directory '/builddir/build/BUILD/recode-3.7.1/src'
BUILDSTDERR: clang-7: error: no such file or directory: './.libs/librecode.so'
BUILDSTDERR: make[3]: *** [Makefile:1668: recode] Error 1
BUILDSTDERR: make[2]: *** [Makefile:2157: recode.1] Error 2
BUILDSTDERR: make[2]: *** Waiting for unfinished jobs....

tests fail on macOS when recode is not installed

The tests fail on macOS if recode has not been installed, because it can't find the just-built shared library. tests/Makefile sets LD_LIBRARY_PATH to the library build directory, which is great for non-Mac operating systems, but macOS uses DYLD_LIBRARY_PATH for that purpose.

make clean does not remove files in tests/__pycache__

Hello. While trying to update the Debian package I found that

./configure; make; make check; make clean

leaves a bunch of unwanted files in tests/__pycache__ that should probably be removed
by the clean target.

Thanks.

ISO-10646-UCS-2 gives wrong endianess in 3.7

My familiar recode ..dump is broken with 3.7.

$ printf '\ua3' | recode -v ..dump
Request: UTF-8..:iconv:..ISO-10646-UCS-2..dump-with-names
Shrunk to: UTF-8..ISO-10646-UCS-2..dump-with-names
UCS2   Mne   Description

A300         syllabe yi nzup
$

The UCS2 column should be 00A3. It seems the problem is the byte order of ISO-10646-UCS-2 output. With 3.6:

$ printf A | recode UTF-8..ISO-10646-UCS-2 | sed -n l
\000A$

3.7:

$ printf A | recode UTF-8..ISO-10646-UCS-2 | sed -n l
A\000$

The Info manual for both versions says

By default, when producing an 'UCS-2' file, Recode always outputs the
high order byte before the low order byte.

That's my expectation, and dump's, and 3.6 does it. 3.7 doesn't and this means dump gives bogus results.

Installing recode on termux in android

Hi rrthomas

Could you take a look at this issue with building recode on termux (details here). I'm trying to install tuxi. It looks like it cannot be installed unless it gets root permission. Any way to work around it?

compiling recode bleeding edge: autoconf tools seem to have blind spots

While reporting #31 a few hours ago I noticed I could not compile recode on my Debian10 (stable) with very few development tools installed.

The bootstrap script detected I missed some tools and I had to install the following packages to please it:

m4 libtool help2man texinfo autopoint

However the build step still failed after that (error message is quoted in #31).

@rrthomas kindly explained me the empty command name should have been "msgfmt" which gave me the clue to install yet another Debian package:

 gettext

Now I get another error message, probably later in the build process:

Completing strip-data.c
Writing strip-pool.c
echo '#include "config.h"' > merged.c
(cd . && cat ascilat1.l iso5426lat1.l ansellat1.l ltexlat1.l btexlat1.l txtelat1.l ) \
| /usr/bin/python3 ./mergelex.py > merged.tm1
: -t -8 -Plibrecode_yy merged.tm1 > merged.tm2
grep -av '^# *line [0-9]' merged.tm2 >> merged.c
make[2]: *** [Makefile:2159: merged.c] Error 1
make[2]: Leaving directory '/tmp/recode/src'
make[1]: *** [Makefile:1509: all-recursive] Error 1
make[1]: Leaving directory '/tmp/recode'
make: *** [Makefile:1441: all] Error 2

so I believe I miss yet another command / package.

May I suggest the bootstrap script to test for gettext and whatever I am still missing at this point?

Error handling with //IGNORE (iconv)

(See #3.) The iconv(1) man page says:

If the string //IGNORE is appended to to-encoding, characters that cannot be converted are discarded and an error is printed after conversion.

This is indeed what happens with both iconv and recode, currently. Arguably, recode should not emit an error unless in --strict mode in this case. I think this means we want to ignore an EILSEQ return code from iconv unless we're in strict mode.

Set soname from version

I'm upgrading recode in Fedora from 3.6 to 3.7.1 and I noticed the librecode library removed module_libiconv() function as well as a public recode_outer structure removed some members. These changes break ABI of the library and hence the library should change soname. However, that did not happen and it is still "librecode.so.0".

Was the ABI change intentional? Would you mind changing the soname (-version-info argument in src/Makefile.am)? Or do you think that 3.7 exists so long that changing the soname now would be counterproductive?

Old COPYING-LIB

I have two issues with COPYING-LIB:

The license text quotes Free Software Foundation postal address that is not valid anymore. Current one can be found at https://www.gnu.org/licenses/old-licenses/lgpl-2.1.txt. Please update the license wording to deliver an up-to-date address to your users.

The license file carry an LGPLv2 text, but there is actually no source file with that license declaration. E.g. src/ucs.c and src/recode.h are LGPLv3+. I recommend replacing the COPYING-LIB file with LGPLv3 license text https://www.gnu.org/licenses/lgpl-3.0.txt. That will be more accurate and less confusing.

gcc-12.2.0 build warning

In packaging recode 3.7.14 for nixpkgs/unstable, -Werror with gcc-12.2.0 throws this

ecode> request.c: In function 'scan_charset':                            
recode> request.c:989:24: warning: dereference of NULL 'options_pointer' [8;;https://cwe.mitre.org/data/definitions/476.htmlCWE-4768;;] [8;;https://g
cc.gnu.org/onlinedocs/gcc/Static-Analyzer-Options.html#index-Wanalyzer-null-dereference-Wanalyzer-null-dereference8;;]
recode>   989 |       *options_pointer = charset_options; 
recode>       |       ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~                                                                                             
recode>   'recode_scan_request': events 1-2                                                                                                          
recode>     |                     
recode>     | 1146 | recode_scan_request (RECODE_REQUEST request, const char *string)                                                                
recode>     |      | ^~~~~~~~~~~~~~~~~~~                                  
recode>     |      | |                                                    
recode>     |      | (1) entry to 'recode_scan_request'
recode>     |......                                                       
recode>     | 1149 |     decode_request (request, string)                 
recode>     |      |     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                 
recode>     |      |     |                                                
recode>     |      |     (2) calling 'decode_request' from 'recode_scan_request'                                                                     
recode>     |                                                                                                                                        
recode>     +--> 'decode_request': events 3-8                                                                                                        
recode>            |                                                      
recode>            | 1050 | decode_request (RECODE_REQUEST request, const char *string)                          
recode>            |      | ^~~~~~~~~~~~~~                                
recode>            |      | |                                             
recode>            |      | (3) entry to 'decode_request'     
recode>            |......                                                
recode>            | 1055 |   if (!ALLOC (request->scanned_string, strlen (string) + 1, char))       
recode>            |      |      ~
recode>            |      |      |                                                                                                                   recode>            |      |      (4) following 'false' branch...
recode>            | 1056 |     return false;               
recode>            | 1057 |   request->sequence_length = 0;            
recode>            |      |   ~~~~~~~                                     
recode>            |      |   |
recode>            |      |   (5) ...to here                    
recode>            | 1058 |
recode>            | 1059 |   if (*request->scan_cursor)                  
recode>            |      |      ~                                        
recode>            |      |      |   
recode>            |      |      (6) following 'true' branch...
recode>            | 1060 |     {                                                                                                                    
recode>            | 1061 |       if (!scan_request (request))                                                                                       
recode>            |      |       ~~   ~~~~~~~~~~~~~~~~~~~~~~                                                                                        
recode>            |      |       |    |                                                                                                             
recode>            |      |       |    (8) calling 'scan_request' from 'decode_request'                     
recode>            |      |       (7) ...to here       
recode>            |                                                      
recode>            +--> 'scan_request': events 9-10   
recode>                   |                                               
recode>                   | 1011 | scan_request (RECODE_REQUEST request)
recode>                   |      | ^~~~~~~~~~~~                                                                                                      recode>                   |      | |                                                                                                                 recode>                   |      | (9) entry to 'scan_request'
recode>                   |......                                                                                                                    recode>                   | 1015 |   RECODE_SYMBOL charset = scan_charset (request, NULL, NULL, &options);                                           recode>                   |      |                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                            recode>                   |      |                           |            
recode>                   |      |                           (10) calling 'scan_charset' from 'scan_request'
recode>                   |                                               
recode>                   +--> 'scan_charset': events 11-16     
recode>                          |                                        
recode>                          |  902 | scan_charset (RECODE_REQUEST request,                                                                      recode>                          |      | ^~~~~~~~~~~~     
recode>                          |      | |                                                                                                          recode>                          |      | (11) entry to 'scan_charset'
recode>                          |......                                                                                                             recode>                          |  916 |   if (!alias)   
recode>                          |      |      ~                    
recode>                          |      |      |   
recode>                          |      |      (12) following 'false' branch (when 'alias' is non-NULL)...                                           
recode>                          |  917 |     return NULL; 
recode>                          |  918 |   charset = alias->symbol;
recode>                          |      |   ~~~~~~~    
recode>                          |      |   |   
recode>                          |      |   (13) ...to here
recode>                          |  919 |                                                                                                            
recode>                          |  920 |   if (before)
recode>                          |      |      ~                                                                                                     
recode>                          |      |      |                                                                                                     
recode>                          |      |      (14) following 'false' branch (when 'before' is NULL)...                                              
recode>                          |......                                                                                                             
recode>                          |  989 |       *options_pointer = charset_options;                                                                  
recode>                          |      |       ~                         
recode>                          |      |       |                                                                                                    
recode>                          |      |       (15) ...to here                                                                                      
recode>                          |......                                                                                                             
recode>                          |  993 |           if (!scan_unsurfacers (request))                                                                 
recode>                          |      |                ~~~~~~~~~~~~~~~~~~~~~~~~~~                                 
recode>                          |      |                |                                                                                           
recode>                          |      |                (16) calling 'scan_unsurfacers' from 'scan_charset'                                         
recode>                          |                                                                                                                   recode>                          +--> 'scan_unsurfacers': events 17-18                                                                               recode>                                 |                                                                                                            recode>                                 |  832 | scan_unsurfacers (RECODE_REQUEST request)                              
recode>                                 |      | ^~~~~~~~~~~~~~~~                                                                                    
recode>                                 |      | |                                                                                                   
recode>                                 |      | (17) entry to 'scan_unsurfacers'                      
recode>                                 |......                                                                                                      
recode>                                 |  861 |   if (surface && surface->unsurfacer)                                                               
recode>                                 |      |      ~                                                                                              
recode>                                 |      |      |                   
recode>                                 |      |      (18) following 'false' branch (when 'surface' is NULL)...                                      
recode>                                 |                                 
recode>                               'scan_unsurfacers': event 19
recode>                                 |     
recode>                                 |cc1:               
recode>                                 | (19): ...to here                
recode>                                 |                       
recode>                          <------+     
recode>                          |                                        
recode>                        'scan_charset': events 20-21
recode>                          |                                        
recode>                          |  993 |           if (!scan_unsurfacers (request))
recode>                          |      |              ~ ^~~~~~~~~~~~~~~~~~~~~~~~~~
recode>                          |      |              | |
recode>                          |      |              | (20) returning to 'scan_charset' from 'scan_unsurfacers'
recode>                          |      |              (21) following 'true' branch...
recode>                          |
recode>                        'scan_charset': event 22                                                                                              
recode>                          |
recode>                          |cc1:                        
recode>                          | (22): ...to here
recode>                          |                                        
recode>                   <------+
recode>                   |                                               
recode>                 'scan_request': events 23-26
recode>                   |                                               
recode>                   | 1015 |   RECODE_SYMBOL charset = scan_charset (request, NULL, NULL, &options);
recode>                   |      |                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
recode>                   |      |                           |
recode>                   |      |                           (23) returning to 'scan_request' from 'scan_charset'
recode>                   | 1016 |                                        
recode>                   | 1017 |   if (!charset)
recode>                   |      |      ~                     
recode>                   |      |      |       
recode>                   |      |      (24) following 'false' branch (when 'charset' is non-NULL)...
recode>                   |......
recode>                   | 1020 |   if (request->scan_cursor[0] == '.' && request->scan_cursor[1] == '.')
recode>                   |      |   ~~ ~       
recode>                   |      |   |  |                   
recode>                   |      |   |  (26) following 'true' branch...
recode>                   |      |   (25) ...to here            
recode>                   |
recode>                 'scan_request': event 27                
recode>                   |
recode>                   |cc1:                                           
recode>                   | (27): ...to here      
recode>                   |          
recode>                 'scan_request': events 28-30
recode>                   |                                                                                                                          
recode>                   | 1021 |     while (request->scan_cursor[0] == '.' && request->scan_cursor[1] == '.')
recode>                   |      |            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
recode>                   |      |                                           |
recode>                   |      |                                           (28) following 'true' branch...
recode>                   | 1022 |       {             
recode>                   | 1023 |         request->scan_cursor += 2;
recode>                   |      |         ~~~~~~~    
recode>                   |      |         |                   
recode>                   |      |         (29) ...to here
recode>                   | 1024 |         charset = scan_charset (request, charset, options, NULL);
recode>                   |      |                   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~        
recode>                   |      |                   |
recode>                   |      |                   (30) calling 'scan_charset' from 'scan_request'        
recode>                   |                                               
recode>                   +--> 'scan_charset': events 31-36                                                                                          
recode>                          |                                        
recode>                          |  902 | scan_charset (RECODE_REQUEST request,
recode>                          |      | ^~~~~~~~~~~~  
recode>                          |      | |                     
recode>                          |      | (31) entry to 'scan_charset'
recode>                          |......                        
recode>                          |  916 |   if (!alias)    
recode>                          |      |      ~                                                                                                     
recode>                          |      |      |
recode>                          |      |      (32) following 'false' branch (when 'alias' is non-NULL)...
recode>                          |  917 |     return NULL;
recode>                          |  918 |   charset = alias->symbol;
recode>                          |      |   ~~~~~~~
recode>                          |      |   |
recode>                          |      |   (33) ...to here
recode>                          |  919 |
recode>                          |  920 |   if (before)
recode>                          |      |      ~
recode>                          |      |      |
recode>                          |      |      (34) following 'false' branch (when 'before' is NULL)...
recode>                          |......
recode>                          |  989 |       *options_pointer = charset_options;
recode>                          |      |       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
recode>                          |      |       |                |
recode>                          |      |       |                (36) dereference of NULL 'options_pointer'
recode>                          |      |       (35) ...to here
recode>                          |

If you have nix installed, this should be reproducable with:

git clone https://github.com/jcumming/nixpkgs.git
cd nixpkgs ; git checkout recode-3-7-14
nix build .#recode

recode-3.7.1 BUILDSTDERR: libtool: error: 'html.lo' is not a valid libtool object

Hi, i cant build recode on OpenMandriva Lx.

Full logs here http://file-store.openmandriva.org/api/v1/file_stores/d18b8a782db67b2b93c7ca101ecbd876ee80c32a.log?show=true

make[3]: Leaving directory '/builddir/build/BUILD/recode-3.7.1/src'
BUILDSTDERR: clang-7: warning: argument unused during compilation: '--rtlib=compiler-rt' [-Wunused-command-line-argument]
make[3]: Entering directory '/builddir/build/BUILD/recode-3.7.1/src'
/bin/sh ../libtool  --tag=CC   --mode=link /usr/bin/clang  -Os -gdwarf-4 -Wstrict-aliasing=2 -pipe -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -fstack-protector-strong --param=ssp-buffer-size=4  -fPIC  -D_REENTRANT -fPIC --rtlib=compiler-rt -D_REENTRANT -fPIC -version-info 0:0:0 -Os -gdwarf-4 -Wstrict-aliasing=2 -pipe -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -fstack-protector-strong --param=ssp-buffer-size=4  -fPIC  -D_REENTRANT -fPIC --rtlib=compiler-rt -Wl,-O2  -Wl,--no-undefined   -o librecode.la -rpath /usr/lib64 charname.lo combine.lo fr-charname.lo iconv.lo names.lo outer.lo recode.lo request.lo strip-pool.lo task.lo african.lo afrtran.lo atarist.lo bangbang.lo cdcnos.lo ebcdic.lo ibmpc.lo iconqnx.lo lat1asci.lo lat1iso5426.lo lat1ansel.lo java.lo mule.lo strip-data.lo testdump.lo ucs.lo utf16.lo utf7.lo utf8.lo varia.lo vn.lo flat.lo html.lo lat1ltex.lo lat1btex.lo lat1txte.lo rfc1345.lo texinfo.lo base64.lo dump.lo endline.lo permut.lo quoted.lo      ../lib/libgnu.la libmerged.la 
make[3]: Leaving directory '/builddir/build/BUILD/recode-3.7.1/src'
BUILDSTDERR: libtool:   error: 'html.lo' is not a valid libtool object
BUILDSTDERR: make[3]: *** [Makefile:1664: librecode.la] Error 1
BUILDSTDERR: make[2]: *** [Makefile:2157: recode.1] Error 2
BUILDSTDERR: make[2]: *** Waiting for unfinished jobs....
make[2]: Entering directory '/builddir/build/BUILD/recode-3.7.1/src'

Conversion from java to utf-8 fails for certain characters.

When trying to recode a java-encoded file containing \u00dc (which corresponds to the character Ü) from java to utf-8, it fails at the step utf16..utf8.

Steps to reproduce:

  1. Create a file containing just "\u00dc": echo '\u00dc' > myfile.
  2. Issue recode -v java..utf8 myfile.
  3. See it fail with Recoding myfile... failed: Invalid input in step 'UTF-16..UTF-8'.

The same thing happens if you create a file containing "Ü", recoding it from UTF8..java (this works) and then back again (this fails).

A workaround is rerouting over ISO-10646-UCS-2, which apparently was the default for UTF16..UTF8 in recode 3.6.

recode -v java..ISO-10646-UCS-2,ISO-10646-UCS-2..UTF8 myfile

Ü was the only character I could find that would fail here. ÄÖäöüß all work fine.

Better documentation for //IGNORE with iconv

(copied from pinard#14)

Sometimes recode dies with 'Invalid input'. An --ignore-invalid flag would do whatever needed to skip over junk bytes in the input, recovering whatever valid text can be found. Of course, there is more than one way to decide what to skip when decoding a multibyte encoding, so it would have to pick something broadly sensible.

I'm not envisaging a fully specified decoding for all possible junk input sequences in all possible encodings, just a best effort to extract whatever usable text remains. For UTF-8, having just read an invalid byte sequence, it could discard the first byte of the sequence and try again.

Add URL encoding: https://en.wikipedia.org/wiki/URL_encoding

_AZ="激光, 這兩個字是甚麼意思"

_AZ=$(echo "${_AZ}" | recode html..utf-8)
echo "// __ $_AZ: |${_AZ}|"
// __ $_AZ: |激光, 這兩個字是甚麼意思|

_AZ="t%C3%AAte-%C3%A0-t%C3%AAte"
...
// __ $_AZ: |t%C3%AAte-%C3%A0-t%C3%AAte|
it should be: "tête-à-tête"

How do you make recode give you UTF-8 regardless of the input string (which encoding should be easy to figure out based on the patterns of the input string)?

(recode 3.6) windows-1252: U+017E LATIN SMALL LETTER Z WITH CARON is at byte 0x9e, not byte 0x8f

Observed on recode 3.6 (Debian stable).

NB: I tried to reproduce this with bleeding edge recode but compilation fails at this step:

make[3]: Entering directory '/tmp/recode/po'
rm -f be.gmo && : -c --statistics --verbose -o be.gmo be.po
mv: cannot stat 't-be.gmo': No such file or directory
make[3]: *** [Makefile:164: be.gmo] Error 1
make[3]: Leaving directory '/tmp/recode/po'
make[2]: *** [Makefile:202: stamp-po] Error 2
make[2]: Leaving directory '/tmp/recode/po'
make[1]: *** [Makefile:1509: all-recursive] Error 1
make[1]: Leaving directory '/tmp/recode'

Wikipedia and various other online resources think U+017E is at byte 0x9e:

https://en.wikipedia.org/wiki/Windows-1252#Character_set

However, recode 3.6 thinks this character is at byte 0x8f and byte 0x9e is invalid:

$ perl -e 'print "\x8f\n"' | recode windows-1252..html
&#382;
$ perl -e 'print "\x9e\n"' | recode windows-1252..html
recode: Untranslatable input in step `CP1252..ISO-10646-UCS-2'

This document

https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

found on the IANA website:

https://www.iana.org/assignments/charset-reg/windows-1252

has the following regarding those two bytes:

0x8F	      	#UNDEFINED
0x9E	0x017E	#LATIN SMALL LETTER Z WITH CARON

so I would be tempted to believe recode 3.6 is wrong on this.

Regards,

“Too many open files” when low open file ulimit

Recode fails with Too many open files when given more files than the soft limit on open file descriptors. Isn’t it possible to close the descriptors after recoding each file?

% ulimit -Sn 100
% seq 1 100 | xargs touch
% recode latin1..utf8 *   
pipe (): Too many open files
zsh: segmentation fault (core dumped)  recode latin1..utf8 *

Memory usage increases monotonically when recoding thousands of files because of this.

Test suite / make check runs no tests

make check gives:

make[4]: Entering directory '/home/shlomif/Download/unpack/to-del/recode/tests'
make[5]: Entering directory '/home/shlomif/Download/unpack/to-del/recode/tests'
============================================================================
Testsuite summary for recode 3.7.1
============================================================================
# TOTAL: 0
# PASS:  0
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0
============================================================================
make[5]: Leaving directory '/home/shlomif/Download/unpack/to-del/recode/tests'
make[4]: Leaving directory '/home/shlomif/Download/unpack/to-del/recode/tests'
make[3]: Leaving directory '/home/shlomif/Download/unpack/to-del/recode/tests'
make[2]: Leaving directory '/home/shlomif/Download/unpack/to-del/recode/tests'
make[1]: Leaving directory '/home/shlomif/Download/unpack/to-del/recode/tests'
Making check in contrib
make[1]: Entering directory '/home/shlomif/Download/unpack/to-del/recode/contrib'
make[1]: Nothing to be done for 'check'.
make[1]: Leaving directory '/home/shlomif/Download/unpack/to-del/recode/contrib'
make[1]: Entering directory '/home/shlomif/Download/unpack/to-del/recode'
make[1]: Leaving directory '/home/shlomif/Download/unpack/to-del/recode'
[shlomif@telaviv1 recode]$ 

I'm on mageia v7 x86-64.

double free in recode_file_to_file

Using recode_file_to_file a double free occurs because of closing the provided file handle

I think recode_file_to_file should not close file handles given by the caller
This is caller work (open + close)

Regression introduce in 2.7.13 in 951bdbc

Reverting this commit fixes the issue.

Found when running PHP recode extension.

(gdb) bt
#0  0x00007ffff7621b94 in __pthread_kill_implementation () from /lib64/libc.so.6
#1  0x00007ffff75d0aee in raise () from /lib64/libc.so.6
#2  0x00007ffff75b987f in abort () from /lib64/libc.so.6
#3  0x00007ffff75ba60f in __libc_message.cold () from /lib64/libc.so.6
#4  0x00007ffff762bac5 in malloc_printerr () from /lib64/libc.so.6
#5  0x00007ffff762dea5 in _int_free () from /lib64/libc.so.6
#6  0x00007ffff76304ee in free () from /lib64/libc.so.6
#7  0x00007ffff7609cd7 in fclose@@GLIBC_2.2.5 () from /lib64/libc.so.6
#8  0x00005555557aef19 in php_stdiop_close ()
#9  0x00005555557a9a07 in _php_stream_free ()
#10 0x00005555557339e3 in zif_fclose ()
#11 0x000055555586710a in execute_ex ()
#12 0x000055555586e6a0 in zend_execute ()
#13 0x00005555557fbc4b in zend_execute_scripts ()
#14 0x000055555579470a in php_execute_script ()
#15 0x00005555558e8306 in do_cli ()
#16 0x000055555563fac5 in main ()
(gdb) quit

Segfault happens when PHP closes the stream it has open before calling recode_file_to_file

Restore transliteration by iconv

This is a Debian 10 (Buster):

$ recode --version
Free recode 3.6
Written by Franc,ois Pinard <[email protected]>.

Copyright (C) 1990, 92, 93, 94, 96, 97, 99 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ echo 'ľščť' | recode -v utf8..iso-8859-1
Request: UTF-8..:libiconv:..ISO-8859-1
Shrunk to: UTF-8..ISO-8859-1
lsct

and this is Arch Linux:

$ recode --version

recode 3.7.6
Written by François Pinard <[email protected]>.

Copyright (C) 1990-2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ echo 'ľščť' | recode -v utf8..iso-8859-1
Request: UTF-8..:iconv:..ISO-8859-1
Shrunk to: UTF-8..ISO-8859-1
recode: Untranslatable input in step `UTF-8..ISO-8859-1'

Now I understand that Debian maintainers patched 3.6 version heavily but still would like to learn where the inconsistency is coming from.

Migrate tests to python3

The package uses python for tests ('make check'). Any chance to move the tests to python3 (or out of python, since recode otherwise does not need python as I see)? At the moment it does not work with python3. python2 is getting EOL in some linux distros, and running the 'make check' in an automatic build system is getting problematic.

Recode produces invalid UTF-8 given invalid UTF-8

See #3. Recode versions 3.6, 3.7.9 and 3.7.11 all produce the same invalid output given invalid input:

% perl -C0 -E 'say chr(128), chr(65), chr(206), chr(177), chr(66)' | recode UTF-8..UTF-8
\200AαB

Since the behaviour is clearly not new it will require some study to see why it behaves as it does (is it a long-standing bug? or deliberate? or a deep-seated design problem?).

Don't use iconv by default

See #6.

Desired behavior: only use iconv if it's needed for a particular recoding. This will require some work. An easier half-way house is to not use iconv by default, and have a command-line switch to turn it on.

help2man error on mageia v7 x86-64

make[1]: Entering directory '/home/shlomif/recode'
Making all in doc
make[2]: Entering directory '/home/shlomif/recode/doc'
make[2]: Nothing to be done for 'all'.
make[2]: Leaving directory '/home/shlomif/recode/doc'
Making all in lib
make[2]: Entering directory '/home/shlomif/recode/lib'
make  all-recursive
make[3]: Entering directory '/home/shlomif/recode/lib'
make[4]: Entering directory '/home/shlomif/recode/lib'
make[4]: Nothing to be done for 'all-am'.
make[4]: Leaving directory '/home/shlomif/recode/lib'
make[3]: Leaving directory '/home/shlomif/recode/lib'
make[2]: Leaving directory '/home/shlomif/recode/lib'
Making all in src
make[2]: Entering directory '/home/shlomif/recode/src'
make  recode
make[3]: Entering directory '/home/shlomif/recode/src'
make[3]: 'recode' is up to date.
make[3]: Leaving directory '/home/shlomif/recode/src'
if ( touch recode.1.w && rm -f recode.1.w; ) >/dev/null 2>&1; then \
  ../build-aux/missing --run /usr/bin/help2man --locale=en_US.UTF-8 \
        --name="converts files between character sets" \
        --output=recode.1 ./recode; \
fi
help2man: no locale support (Locale::gettext required)
`help2man' generates a man page out of `--help' and `--version' output.

Usage: help2man [OPTION]... EXECUTABLE

 -n, --name=STRING       description for the NAME paragraph
 -s, --section=SECTION   section number for manual page (1, 6, 8)
 -m, --manual=TEXT       name of manual (User Commands, ...)
 -S, --source=TEXT       source of program (FSF, Debian, ...)
 -L, --locale=STRING     select locale (default "C")
 -i, --include=FILE      include material from `FILE'
 -I, --opt-include=FILE  include material from `FILE' if it exists
 -o, --output=FILE       send output to `FILE'
 -p, --info-page=TEXT    name of Texinfo manual
 -N, --no-info           suppress pointer to Texinfo manual
 -l, --libtool           exclude the `lt-' from the program name
     --help              print this help, then exit
     --version           print version number, then exit

EXECUTABLE should accept `--help' and `--version' options and produce output on
stdout although alternatives may be specified using:

 -h, --help-option=STRING     help option string
 -v, --version-option=STRING  version option string
 --version-string=STRING      version string
 --no-discard-stderr          include stderr when parsing option output

Report bugs to <[email protected]>.
make[2]: *** [Makefile:2175: recode.1] Error 255
make[2]: Leaving directory '/home/shlomif/recode/src'
make[1]: *** [Makefile:1518: all-recursive] Error 1
make[1]: Leaving directory '/home/shlomif/recode'
make: *** [Makefile:1450: all] Error 2

after ./bootstrap and ./configure.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.