GithubHelp home page GithubHelp logo

jdunck / python-unicodecsv Goto Github PK

View Code? Open in Web Editor NEW
595.0 25.0 90.0 494 KB

Python2's stdlib csv module is nice, but it doesn't support unicode. This module is a drop-in replacement which *does*. If you prefer python 3's semantics but need support in py2, you probably want https://github.com/ryanhiebert/backports.csv

License: Other

Python 100.00%

python-unicodecsv's Introduction

unicodecsv

The unicodecsv is a drop-in replacement for Python 2.7's csv module which supports unicode strings without a hassle. Supported versions are python 2.6, 2.7, 3.3, 3.4, 3.5, and pypy 2.4.0.

More fully

Python 2's csv module doesn't easily deal with unicode strings, leading to the dreaded "'ascii' codec can't encode characters in position ..." exception.

You can work around it by encoding everything just before calling write (or just after read), but why not add support to the serializer?

>>> import unicodecsv as csv
>>> from io import BytesIO
>>> f = BytesIO()
>>> w = csv.writer(f, encoding='utf-8')
>>> _ = w.writerow((u'é', u'ñ'))
>>> _ = f.seek(0)
>>> r = csv.reader(f, encoding='utf-8')
>>> next(r) == [u'é', u'ñ']
True

Note that unicodecsv expects a bytestream, not unicode -- so there's no need to use codecs.open or similar wrappers. Plain open(..., 'rb') will do.

(Version 0.14.0 dropped support for python 2.6, but 0.14.1 added it back. See c0b7655248c4249 for the mistaken, breaking change.)

python-unicodecsv's People

Contributors

brent-hoover avatar dgilman avatar endeepak avatar erikrose avatar jdotjdot avatar jdunck avatar ludovic-gasc avatar nowells avatar pombredanne avatar ryanhiebert avatar svisser avatar toabctl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

python-unicodecsv's Issues

DictReader doesn't work with io.StringIO (Python 2.7)

As described in this SO question, I am getting the following error with unicodecsv.DictReader:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in position 2: ordinal not in range(128)

Here's a simplified version of my code:

from io import StringIO
from unicodecsv import DictReader, Dialect, QUOTE_MINIMAL

data = (
    'first_name,last_name,email\r'
    'Elmer,Fudd,[email protected]\r'
    'Jo\xc3\xa3o Ant\xc3\xb4nio,Ara\xc3\xbajo,[email protected]\r'
)

unicode_data = StringIO(unicode(data, 'utf-8-sig'), newline=None)

class CustomDialect(Dialect):
    delimiter = ','
    doublequote = True
    escapechar = '\\'
    lineterminator = '\r\n'
    quotechar = '"'
    quoting = QUOTE_MINIMAL
    skipinitialspace = True

rows = DictReader(unicode_data, dialect=CustomDialect)

for row in rows:
    print row

If I replace StringIO with BytesIO, the encoding works but I can't send the newlines argument anymore and then I get:

Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?

No isinstance unicode in Python3

The isinstance statement with unicode does not exist in python3:

    if isinstance(s, unicode):

Here's how I fixed it in my versions, but I still don't know if it's actually the best way to test unicode so I won't do a pull request unless someone can tell me it's actually correct.

# Example ONLY
if sys.version_info.major == 2:
    if isinstance(value, (str, unicode)):  # noqa
        return value

elif sys.version_info.major == 3:
    if isinstance(value, str):
        return value

Python3 error

Traceback (most recent call last):
File "/home/denis/Github/ACAPS-tools/parser.py", line 166, in
export('Guinea_Locations_Export.csv', ['google', 'bing', 'osm'])
File "/home/denis/Github/ACAPS-tools/parser.py", line 124, in export
writer.writeheader()
File "/usr/local/lib/python3.4/dist-packages/unicodecsv/init.py", line 158, in writeheader
self.writerow(header)
File "/usr/lib/python3.4/csv.py", line 153, in writerow
return self.writer.writerow(self._dict_to_list(rowdict))
File "/usr/local/lib/python3.4/dist-packages/unicodecsv/init.py", line 86, in writerow
return self.writer.writerow(_stringify_list(row, self.encoding, self.encoding_errors))
File "/usr/local/lib/python3.4/dist-packages/unicodecsv/init.py", line 51, in _stringify_list
return [_stringify(s, encoding, errors) for s in iter(l)]
File "/usr/local/lib/python3.4/dist-packages/unicodecsv/init.py", line 51, in
return [_stringify(s, encoding, errors) for s in iter(l)]
File "/usr/local/lib/python3.4/dist-packages/unicodecsv/init.py", line 41, in _stringify
if isinstance(s, unicode):
NameError: name 'unicode' is not defined

working with streams

I'm writing an mapReduce script (and thus are working with input / output streams).

If i use the unicodecsv module

#!/usr/bin/python
import sys
import unicodecsv as csv


def mapper():
    reader = csv.reader(sys.stdin, delimiter='\t')
    # writer = csv.writer(sys.stdout, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)

    for line in reader:
        print line

Then i get the error:

Traceback (most recent call last):
  File "scripts/streaming/adwords/mapper.py", line 30, in <module>
    mapper()
  File "scripts/streaming/adwords/mapper.py", line 10, in mapper
    for line in reader:
  File "/usr/local/lib/python2.7/dist-packages/unicodecsv/py2.py", line 117, in next
    row = self.reader.next()
_csv.Error: line contains NULL byte

If i read the file with pandas

data = pandas.read_csv(input_file, encoding='utf-16', sep='\t', skiprows=5, skip_footer=1, engine='python')

then everything works like a charm.

I don't know how to resolve this issue. I tried almost everything, even opening and saving (in utf-8) the file with libreOffice, but that can't be a solution because my csv files are to big for libreOffice.

If i open / save the file with libreOffice in utf-8 and run the script again the strings in the lines are prefixed with u. I know this has something to do with encodings but it's not clear to me how it works.

Preferably i want to read the (unicode (i guess)) input stream, map it line by line (and encode it to utf-8) and write it like writer.writerow((line[0] + line[2], line[5])) so that my reducer.py doesn't have to hassle with encodings.

any help would deeply be appreciated.

unicodecsv is kind of slow; but maybe unavoidable?

Thank you so much for unicodecsv, it's been a big help for me in Python2. Not to sound ungrateful, but...

unicodecsv seems fairly slow. Some benchmarking suggests it's about 5-6x slower than the plain Py2 csv module. Of course it's doing more work, decoding bytes to strings! But for comparison the Py3 csv module (which does decoding) is only 2-3x slower than Py2. Is there room for improvement in unicodecsv?

I did some profiling and code reading and didn't see any obvious way unicodecsv could be made faster. So maybe there's no real way to optimize it. But wanted to file the issue both to document what I learned and get a second opinion.

My benchmark code and results are at https://nelsonslog.wordpress.com/2015/02/26/python-csv-benchmarks/

Unbreak Python 2.6 after version 0.13.0?

pywikibot-core is currently using unicodecsv on Python 2.7 and 2.6 and removing support for Python 2.6 broke support. Now it's not that much of a problem as users could install an older version. But I wanted to ask whether you are interested in supporting 2.6 again by using OrderedDict from a third party library like ordereddict or future. If you want I could work on a pull request, but there still seem to be a lot of Python 2.6 users especially those using Redhat Linux.

Now if you are not interested you can just close this bug and we're going to limit the version to 0.13.0 on Python 2.6.

python-unicodecsv 0.14.1 broke read/write on text-mode files on Py3

Hi guys,

Unittest on python-cliff project are failing on py34 env. This is due to recent upgrade to python-unicodecsv 0.14.1. When tested with python-unicodecsv 0.14.0 (from pip) it is passing. The commit that broke it is c68ae77.

What's worrying is that version bump to 0.14.1 (b5301ad) was before that commit. If using unicodecsv from this commit, it is actually passing. So it looks like the build was made from tip of the master at some point.

Simple test:

(unicodecsv34)nlm:~/tmp$ pip install unicodecsv
You are using pip version 6.0.8, however version 7.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Collecting unicodecsv
  Using cached unicodecsv-0.14.1.tar.gz
Installing collected packages: unicodecsv
  Running setup.py install for unicodecsv
Successfully installed unicodecsv-0.14.1
(unicodecsv34)nlm:~/tmp$ 
(unicodecsv34)nlm:~/tmp$ python
Python 3.4.2 (default, Jan 12 2015, 12:13:20) 
[GCC 4.9.2 20150107 (Red Hat 4.9.2-5)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodecsv
>>> import sys
>>> writer = unicodecsv.writer(sys.stdout)
>>> writer.writerow(('a', 'b'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zarnovic/venv/unicodecsv34/lib/python3.4/site-packages/unicodecsv/py3.py", line 28, in writerow
    return self.writer.writerow(row)
  File "/home/zarnovic/venv/unicodecsv34/lib/python3.4/site-packages/unicodecsv/py3.py", line 15, in write
    return self.binary.write(string.encode(self.encoding, self.errors))
TypeError: must be str, not bytes
>>> 

Does not handle None properly

In _stringify, if s is None, it will be turned into str(None), or the string 'None'. It should remain None instead.

Problem in Saving File

As shown in readme, how to save f into a filename suppose "test.csv". How to store BytesIO() into a file. Please mention that in the Readme too.
Thanks in advance.

DictWriter.writerows throws AttributeError

Trying to use .writerows with DictWriter throws an error:

Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "...", line 168, in makeCSV
    writer.writerows(rows)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 153, in writerows
    rows.append(self._dict_to_list(rowdict))
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 140, in _dict_to_list
    if self.extrasaction == "raise":
AttributeError: DictWriter instance has no attribute 'extrasaction'

I will try to look into this when I have more time tomorrow, and submit a patch.

Null character not supported

>>> import unicodecsv
>>> from cStringIO import StringIO
>>> f = StringIO('"\0"')
>>> f.seek(0)
>>> r = unicodecsv.reader(f, encoding='utf-8')
>>> row = r.next()
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/app/.heroku/python/lib/python2.7/site-packages/unicodecsv/__init__.py", line 106, in next
    row = self.reader.next()
Error: line contains NULL byte

Pip can't find the source for unicodecsv

When I try to install unicodecsv I get the following error:

(rs3) chris@fen-desktop2(rs3)$ pip install -v unicodecsv
Downloading/unpacking unicodecsv
  Could not find any downloads that satisfy the requirement unicodecsv
No distributions at all found for unicodecsv

The log looks like this (abridged):

Downloading/unpacking unicodecsv
  Getting page http://pypi.python.org/simple/unicodecsv
  URLs to search for versions for unicodecsv:
  * http://pypi.python.org/simple/unicodecsv/
  Getting page https://github.com/jdunck/python-unicodecsv
  Analyzing links from page http://pypi.python.org/simple/unicodecsv/
    Skipping link https://github.com/jdunck/python-unicodecsv (from http://pypi.python.org/simple/unicodecsv/); not a file
  Analyzing links from page https://github.com/jdunck/python-unicodecsv
    Skipping link https://github.com/opensearch.xml (from https://github.com/jdunck/python-unicodecsv); unknown archive format: .xml
...
    Skipping link https://github.com/jdunck/python-unicodecsv.git (from https://github.com/jdunck/python-unicodecsv); unknown archive format: .git
    Skipping link git://github.com/jdunck/python-unicodecsv.git (from https://github.com/jdunck/python-unicodecsv); unknown archive format: .git
...
  Could not find any downloads that satisfy the requirement unicodecsv
No distributions at all found for unicodecsv

The problem seems to be that there are no links to .tar.gz files on that page (the project home page), and pip doesn't like to install from git without further configuration.

I think if you added either:

to the PyPi page at http://pypi.python.org/simple/unicodecsv/ that might be enough to enable pip to find and install the package?

As a workaround for anyone who needs to put this in their requirements.txt, you can add the following line instead:

git+git://github.com/jdunck/python-unicodecsv.git

Cheers, Chris.

Remove 2.6 from the classifiers

Just a minor note, but as you don't support 2.6 anymore it would be nice if you could remove it from the classifiers from setup.py. Otherwise the pypi page gives the impression you are still supporting it.

Backport of Python 3's csv module

The working name for this module is xcsv. Currently, I plan to make it a pure Python implementation. It'll be slow, but it'll be API compatible with Python 3's csv module that also works on Python 2.

Work in Progress implementation at https://github.com/ryanhiebert/python-unicodecsv/blob/xcsv/unicodecsv/xcsv/py2.py. That's on the xcsv branch on my fork.

I copied the test suite from Python 3, and got it running on Python 2.6. Right now most of the tests except for tests involving the reader are passing under Python 2.6. What I'm planning to do with the reader is make a pure-python port of the C code from the _csv module in Python 3.


So the question is, should this be a separate package, or start out as a subpackage of unicodecsv? Either way, we could do the separate package later if it was appropriate, and unicodecsv can certainly have a dependency on it if it proves useful.

I think that this could be a solution to #59 , which breaks because of encodings having null bytes. However, it would also be significantly slower, since it would be a pure-python implementation. We'd probably want to watch for encodings that break the Python 2 version, and then swap out the slower implementation.

Example of how to open a file for reading

Hi,

Thanks for writing this.

I might be useful in your docs to show how to open a file for reading. I was opening the file using:

fp = codecs.open(csv_path, encoding='utf-8'

Evidently that's the wrong thing to do. When I opened it the normal way:

fp = open(csv_path, 'r')

everything worked.

Thanks,
Chuck

problem in encode string

>>> yelp_soup=BeautifulSoup(yelp_r.text,'html.parser')
>>> print(yelp_soup.prettify())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python34\lib\encodings\cp437.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2019' in position 33349: character maps to <undefined>
>>>

utf-8 file with BOM passes BOM as part of first header NAME

If my file/stream starts with the UTF-8 BOM (3 char \xef\xbb\xbf), it is passed through as part of the first header (or first data value on the first row).

Should unicodecsv handle this (remove it), or should the user sniff for and skip over it before instantiating the UnicodeReader class?

What do you think the best way to handle this is? FWIIW, I'm using Python 2.7.

Receiving UnicodeDecodeError on CSV file

Hello, so I am trying to go througha huge data file using a machine learning software that requires the text to be in Unicode. I have been working on this problem for a number of hours and it is quite urgent. Here is my code:

TEST_SENTENCES = []
with open('Book2.csv', 'rb') as csvfile:
    reader = unicodecsv.DictReader(csvfile)
    for row in reader:
        TEST_SENTENCES.append(row["Tweet"])
    for x in [TEST_SENTENCES]:
        codecs.encode(x, 'utf-8')

Here is the error I am receiving:

Traceback (most recent call last):
  File "C:\Users\pjame\Desktop\DeepMoji-master\examples\score_texts_emojis.py", line 25, in <module>
    for row in reader:
  File "C:\Python27\lib\site-packages\unicodecsv\py2.py", line 217, in next
    row = csv.DictReader.next(self)
  File "C:\Python27\lib\csv.py", line 108, in next
    row = self.reader.next()
  File "C:\Python27\lib\site-packages\unicodecsv\py2.py", line 128, in next
    for value in row]
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 254: unexpected end of data

The error is always the same, and is always in the same position no matter what data I use. It seems like it could be a problem with the copying and pasting of large amounts of data into the spreadsheet, but I am not sure.
Does anyone have any idea how I can get around this error? I see that for some reason csv is being called in one of the error messages, but I am not sure why. It could be in the parts of the file I did not write which can be found here: (https://github.com/bfelbo/DeepMoji/blob/master/examples/score_texts_emojis.py)

If anyone has any ideas, or fixes to this, I would be so so appreciative. Even just being able to find and delete the lines in the CSV file that are errors would be super useful.

Unsupported encodings?

I wanted to parse a UTF-16 CSV file, so I did something like this:

r = unicodecsv.reader(f, encoding='UTF-16')

Unfortunately, this just raises an exception when I try to read from it. I looked at the unicodecsv source code, and I don't think the unicodecsv approach can ever work for this case. It tries loading the input stream as 8-bit characters, and then decodes each cell value. Python's 'csv' module can't handle NUL bytes, which are common in UTF-16, so this fails.

I think the answer to this is that the 'unicodecsv' library only works for encodings like UTF-8 or Latin-1 which are supersets of ASCII, and don't use 0x00 bytes. Is this true? We should put it in the documentation.

(Also, I think this means I should really upgrade to Python 3!)

First quoted fieldname keeps quotes in key

A utf-8 csv file (in my case: utf-8-with-signature-dos [according to emacs]) with quoted header fields like

"column1", "column2", "column3"

Generate a fieldnames list like

[u'"column1"', u'column2', u'column3']

I would expect all of the field names to have the quote character stripped.

I'm guessing this is due to the BOM being present and messing up the process that gets the headers

Choking on unknown encoding

I am not sure if this is actually a problem in python-unicodecsv or if I am unreasonably expecting it to work with what I am feeding it.

I have a CSV file (exported by MS Excel on Windows). It happens to have some weirdly-encoded text in it on one line.

I loop over the rows in the file:

rows = unicodecsv.reader(csv_file)
for row in rows:
    ...

As soon as we get to the line with the offending characters:

Environment:


Request Method: POST
Request URL: http://v029.medcn.uwcm.ac.uk:8003/upload/

Django Version: 1.4.10
Python Version: 2.7.2
Installed Applications:
('arkestra_utilities',
 'cms',
 'menus',
 'cms.plugins.text',
 'cms.plugins.snippet',
 'sekizai',
 'contacts_and_people',
 'vacancies_and_studentships',
 'news_and_events',
 'links',
 'arkestra_utilities.widgets.combobox',
 'arkestra_image_plugin',
 'video',
 'housekeeping',
 'publications',
 'symplectic',
 'arkestra_clinical_studies',
 'polymorphic',
 'semanticeditor',
 'mptt',
 'easy_thumbnails',
 'typogrify',
 'filer',
 'widgetry',
 'south',
 'form_designer',
 'form_designer.contrib.cms_plugins.form_designer_form',
 'treeadmin',
 'inspector',
 'django_easyfilters',
 'pagination',
 'debug_toolbar',
 'inspector',
 'chained_selectbox',
 'django.contrib.auth',
 'django.contrib.contenttypes',
 'django.contrib.sessions',
 'django.contrib.sites',
 'django.contrib.messages',
 'django.contrib.staticfiles',
 'django.contrib.admin',
 'django.contrib.admindocs',
 'django.contrib.humanize',
 'django.contrib.staticfiles',
 'django.contrib.redirects',
 'django.contrib.markup')
Installed Middleware:
(u'debug_toolbar.middleware.DebugToolbarMiddleware',
 'django.middleware.common.CommonMiddleware',
 'django.contrib.sessions.middleware.SessionMiddleware',
 'django.middleware.csrf.CsrfViewMiddleware',
 'django.contrib.auth.middleware.AuthenticationMiddleware',
 'django.contrib.messages.middleware.MessageMiddleware',
 'cms.middleware.page.CurrentPageMiddleware',
 'cms.middleware.user.CurrentUserMiddleware',
 'cms.middleware.toolbar.ToolbarMiddleware',
 'django.contrib.redirects.middleware.RedirectFallbackMiddleware',
 'pagination.middleware.PaginationMiddleware')


Traceback:
File "/home/daniele/dev-14-05-21/local/lib/python2.7/site-packages/django/core/handlers/base.py" in get_response
  111.                         response = callback(request, *callback_args, **callback_kwargs)
File "/home/daniele/dev-14-05-21/local/lib/python2.7/site-packages/django/contrib/auth/decorators.py" in _wrapped_view
  20.                 return view_func(request, *args, **kwargs)
File "/home/daniele/dev-14-05-21/src/arkestra-publications/publications/views.py" in upload
  81.             for row in rows:
File "/home/daniele/dev-14-05-21/local/lib/python2.7/site-packages/unicodecsv/__init__.py" in next
  112.                  unicode_(value, encoding, encoding_errors)) for value in row]

Exception Type: UnicodeDecodeError at /upload/
Exception Value: 'utf8' codec can't decode byte 0x91 in position 52: invalid start byte

The standard csv library handles the file without difficulty. The problem text is &#x2018;app&#x2019; (those seem to be left and right inverted commas).

DictReader chokes on Python3

Hi there,

I installed 0.10.1 with:

$ pip install https://github.com/jdunck/python-unicodecsv/zipball/master#egg=unicodecsv-0.10.1

This code instantiating DictReader works fine on 2.7.5 but produces an error on Python >= 3:

from __future__ import print_function
from __future__ import unicode_literals
import unicodecsv
from unicodecsv import DictReader
import sys
print(sys.version)
try:
    import StringIO as io
except ImportError:
    pass  # Python >= 3
from io import StringIO
d = DictReader(StringIO('a,b,c\n1,2,3'))
print('yay')

Here is the output:

$ python test.py 
3.3.3 (default, Jan 14 2014, 20:39:06) 
[GCC 4.6.3]
Traceback (most recent call last):
  File "test.py", line 13, in <module>
    d = DictReader(StringIO('a,b,c\n1,2,3'))
  File "~/.pyenv/versions/3.3.3/lib/python3.3/site-packages/unicodecsv-0.10.1-py3.3.egg/unicodecsv/__init__.py", line 187, in __init__
  File "~/.pyenv/versions/3.3.3/lib/python3.3/csv.py", line 96, in fieldnames
    self._fieldnames = next(self.reader)
TypeError: 'UnicodeReader' object is not an iterator

Is anyone able to reproduce this?

Failed to install on windows

Python 3.7.2 Windows x86 embeddable zip file
Windows 10
pip 19.0.1

I have already discovered that the failure is because the unicodecsv package directory is not in sys.path but I do not know how to resolve the installation.

unicodecsv was declared inside a requirements.txt script

Installation command:

python.exe -m pip install -r requirements.txt --no-cache-dir

Collecting unicodecsv (from -r requirements.txt (line 74))
  Using cached https://files.pythonhosted.org/packages/6f/a4/691ab63b17505a26096608cc309960b5a6bdf39e4ba1a793d5f9b1a53270/unicodecsv-0.14.1.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "C:\Users\genio\AppData\Local\Temp\pip-install-ckob5edl\unicodecsv\setup.py", line 5, in <module>
        version = __import__('unicodecsv').__version__
    ModuleNotFoundError: No module named 'unicodecsv'

Any idea how to solve it ?

csv reader behaves different for python 2.6 and 2.7

There is record having extra quote in it.
rec = '3,Abhinav,3000,"કેમ છો""'
and ran with code:
result = csv.reader([rec.encode("utf-8")], delimiter=',').reader.next()
It gives output as:
In Python 2.7.5 : [3,Abhinav,3000,'"કેમ છો""']
In python 2.6.6 : Gives exception - newline inside string

Calling writer.writerow on a list of strings in python 3.5 results in TypeError: string argument expected, got 'bytes'

> /home/ben/.envs/rss/lib/python3.5/site-packages/unicodecsv/py3.py(29)writerow()
     28         import ipdb; ipdb.set_trace()
---> 29         return self.writer.writerow(row)
     30 

ipdb> p row
['http://localhost:10081/content/HiMFYeCsAfJxP9Jc/allianceearth.org', 'South African team may have solved solar puzzle even Google couldn’t crack', 'ARL', 'ephemeral', 'rss_feeder', '2015-11-30 21:53:07 ', '', '']
ipdb> n
TypeError: string argument expected, got 'bytes'
> /home/ben/.envs/rss/lib/python3.5/site-packages/unicodecsv/py3.py(29)writerow()
     28         import ipdb; ipdb.set_trace()
---> 29         return self.writer.writerow(row)
     30 

ipdb> pinfo self.writer.writerow
Docstring:
writerow(iterable)

Construct and write a CSV record from an iterable of fields.  Non-string
elements will be converted to string.
Type:      builtin_function_or_method
ipdb> [type(x) for x in row]
[<class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>]

To make extra certain that all of my items in the row i was submitting were in unicode, i also called the following function on all of them:

def to_unicode(v, encoding='utf8'):
    """
    Convert a value to Unicode string (or just string in Py3). This function
    can be used to ensure string is a unicode string. This may be useful when
    input can be of different types (but meant to be used when input can be
    either bytestring or Unicode string), and desired output is always Unicode
    string.
    The ``encoding`` argument is used to specify the encoding for bytestrings.
    """
    if isinstance(v, str):
        return v
    try:
        return v.decode(encoding)
    except (AttributeError, UnicodeEncodeError):
        return str(v)

I'm not sure if i'm providing an incorrect argument, but i'd be happy to provide any further information to assist in determining what exactly is going wrong.

How do I put the filename ???

Hello,

The README.MD contains a simple sample.

But where/how do I choose the filename and folder ?

import unicodecsv as csv
from io import BytesIO
f = BytesIO()
w = csv.writer(f, encoding='utf-8')
_ = w.writerow((u'é', u'ñ'))
_ = f.seek(0)
r = csv.reader(f, encoding='utf-8')
next(r) == [u'é', u'ñ']
True

Receive error when accessing the 'fieldnames' attribute of unicodecsv.DictReader

import unicodecsv
f = open('/home/brent/Projects/pyprojects/pmtool/csv/products.csv', 'r')
pr = unicodecsv.DictReader(f, encoding='utf_8', dialect='excel')
pr.fieldnames
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python2.6/csv.py", line 88, in fieldnames
if self._fieldnames is None:
AttributeError: DictReader instance has no attribute '_fieldnames'

LICENSE file is missing from the source tarball

[root@2562116f031a feedstocks]# curl -LO https://pypi.io/packages/source/u/unicodecsv/unicodecsv-0.14.1.tar.gz                                                                                                     
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0   122    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   281  100   281    0     0   1196      0 --:--:-- --:--:-- --:--:--  3649
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 10267  100 10267    0     0  19546      0 --:--:-- --:--:-- --:--:-- 19546

[root@2562116f031a feedstocks]# tar -tf unicodecsv-0.14.1.tar.gz 
unicodecsv-0.14.1/
unicodecsv-0.14.1/MANIFEST.in
unicodecsv-0.14.1/PKG-INFO
unicodecsv-0.14.1/README.rst
unicodecsv-0.14.1/setup.cfg
unicodecsv-0.14.1/setup.py
unicodecsv-0.14.1/unicodecsv/
unicodecsv-0.14.1/unicodecsv/__init__.py
unicodecsv-0.14.1/unicodecsv/py2.py
unicodecsv-0.14.1/unicodecsv/py3.py
unicodecsv-0.14.1/unicodecsv/test.py
unicodecsv-0.14.1/unicodecsv.egg-info/
unicodecsv-0.14.1/unicodecsv.egg-info/dependency_links.txt
unicodecsv-0.14.1/unicodecsv.egg-info/PKG-INFO
unicodecsv-0.14.1/unicodecsv.egg-info/SOURCES.txt
unicodecsv-0.14.1/unicodecsv.egg-info/top_level.txt

Unicode input raises UnicodeEncodeError exception

>>> import unicodecsv
>>> unicodecsv.reader([u"Lowis,Löwis"]).next()

/tmp/python-unicodecsv/unicodecsv/__init__.pyc in next(self)
    107 
    108     def next(self):
--> 109         row = self.reader.next()
    110         encoding = self.encoding
    111         encoding_errors = self.encoding_errors

UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 7: ordinal not in range(128)

The possible workaround is to encode input unicode into string:

>>> unicodecsv.reader([u"Lowis,Löwis".encode('utf8')]).next()
[u'Lowis', u'L\xf6wis']

Reader not iterable on Python3

This works on Python2 and with the build in module.

Python 3.4.3 (default, Mar 25 2015, 17:13:50) 
[GCC 4.9.2 20150304 (prerelease)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodecsv
>>> import io
>>> r = unicodecsv.reader(io.BytesIO(b"hello,world"))
>>> list(r)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: iter() returned non-iterator of type 'UnicodeReader'

0.9.4 on pypi and github (tag) are different

Seems that the tar.gz files from pypi and from github tags for version 0.9.4 are different. sha256sum's are different and the release on pypi doesn't contain the testrun program so when I try to run the tests, I get:

./setup.py test
running test
running egg_info
writing unicodecsv.egg-info/PKG-INFO
writing top-level names to unicodecsv.egg-info/top_level.txt
writing dependency_links to unicodecsv.egg-info/dependency_links.txt
reading manifest file 'unicodecsv.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'unicodecsv.egg-info/SOURCES.txt'
running build_ext
Traceback (most recent call last):
File "./setup.py", line 26, in
'Programming Language :: Python :: Implementation :: CPython',],
File "/usr/lib/python2.7/distutils/core.py", line 152, in setup
dist.run_commands()
File "/usr/lib/python2.7/distutils/dist.py", line 953, in run_commands
self.run_command(cmd)
File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
File "/usr/lib/python2.7/dist-packages/setuptools/command/test.py", line 138, in run
self.with_project_on_sys_path(self.run_tests)
File "/usr/lib/python2.7/dist-packages/setuptools/command/test.py", line 118, in with_project_on_sys_path
func()
File "/usr/lib/python2.7/dist-packages/setuptools/command/test.py", line 164, in run_tests
testLoader = cks
File "/usr/lib/python2.7/unittest/main.py", line 94, in init
self.parseArgs(argv)
File "/usr/lib/python2.7/unittest/main.py", line 149, in parseArgs
self.createTests()
File "/usr/lib/python2.7/unittest/main.py", line 158, in createTests
self.module)
File "/usr/lib/python2.7/unittest/loader.py", line 128, in loadTestsFromNames
suites = [self.loadTestsFromName(name, module) for name in names]
File "/usr/lib/python2.7/unittest/loader.py", line 91, in loadTestsFromName
module = import('.'.join(parts_copy))
ImportError: No module named runtests

Sniffer returns unicode but Reader expects bytes

Hi,

unicode.Sniffer returns unicode delimiters when fed with a unicodestring, but that makes unicodecsv.reader choke with this message:

>       self.reader = csv.reader(f, dialect, **kwds)
E       TypeError: "delimiter" must be string, not unicode

Here is a sample test that shows the problem:

def testUnicodeDelimiters(tmpdir):
    csv_filename = str(tmpdir / u'input.csv')

    with io.open(csv_filename, u'wb') as csv_file:
        csv_writer = unicodecsv.writer(csv_file, encoding=u'utf-8')
        csv_writer.writerow([u'Sandstone, no scavenging', u'4.4', u'3.3'])
        csv_writer.writerow([u'Sandstone, low scavenging', u'5.5', u'6.6'])

    with io.open(csv_filename, u'r', encoding=u'utf-8') as csv_file:
        data = csv_file.read()

    sniffer = unicodecsv.Sniffer()
    dialect = sniffer.sniff(data)
    #dialect.delimiter = bytes(dialect.delimiter)
    #dialect.quotechar = bytes(dialect.quotechar)
    reader = unicodecsv.reader(StringIO(data), dialect=dialect)
    contents = list(reader)
    assert contents == [
        [u'Sandstone, no scavenging', u'4.4', u'3.3'],
        [u'Sandstone, low scavenging', u'5.5', u'6.6'],
    ]
    assert {type(s) for s in contents[0]} == {unicode}
    assert {type(s) for s in contents[1]} == {unicode}

It seems the problem only happens when a delimiter is found inside one of the input strings ("Sandstone, no scavenging").

Please note that commented lines fix the test.

Kind Regards,

Failing to detect header row on unicode csv file

Using Sniffer class to detect if file contains a header row fails with:

has_header = unicodecsv.Sniffer().has_header(csvfile.read(4096))
Error: line contains NULL byte

The same error, that csv module from standard lib throws.

Python 3 - Writing "ç" converted to ç in csv file

list_data = ['ë ï','Française de Mécanique','ü','ç']

f = open('C:\MFiles\\M.csv','wb')

cs = csv.writer(f,delimiter = '§',encoding='utf-8')

cs.writerow(list_data)

f.close()

When I look into M.csv I could see the below text

  1. ë ï§
  2. Franç
  3. aise de Mécanique§
  4. ü§
  5. ç

But when you look into the list I have only 4 strings

I want to write as ç character instead of §

Because the delimiter also §

Please help me how to go about

unicodecsv.DictReader still throws "UnicodeEncodeError: 'ascii' " with unicode StringIO

Here's some sample code, and the error that I'm getting

import unicodecsv as csv
import codecs

handle = codecs.open('a_utf8_encoded_file.txt','r','utf-8')

data = handle.read()

handle.close()

f = StringIO(data)

reader = csv.DictReader(f, delimiter=',')

for line in reader:
  print line

And the error I get:

Traceback (most recent call last):
  File "create_and_refresh_test_copy.py", line 125, in <module>
    createTests(gauth,formFileName,localTestName,localKeyName,remoteTestName,remoteKeyName)
  File "create_and_refresh_test_copy.py", line 60, in createTests
    tc = TestCreator(temp_csv_name)
  File "/home/matt/class_practice_tests/testcreator.py", line 27, in __init__
    self._parseFileString(csvFile)
  File "/home/matt/class_practice_tests/testcreator.py", line 56, in _parseFileString
    for line in reader:
  File "/usr/local/lib/python2.7/dist-packages/unicodecsv/py2.py", line 218, in next
    row = csv.DictReader.next(self)
  File "/usr/lib/python2.7/csv.py", line 108, in next
    row = self.reader.next()
  File "/usr/local/lib/python2.7/dist-packages/unicodecsv/py2.py", line 118, in next
    row = self.reader.next()
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 244: ordinal not in range(128)

Allow headers to differ from fieldnames

In DictWriter the header is currently defined as:

header = dict(zip(self.fieldnames, self.fieldnames))

This means the keys in the rows have to match the headers of the file. But that's inconvenient because the keys are often things like dateOfOrder whereas we'd like a human-readable header in the CSV file: "Date of order".

version 0.11.1 breaks when writing Unicode CSV headers

Hey, I think a recent change caused a bug. Here's a demo program: https://gist.github.com/NelsonMinar/aacf7d6dfe4e40b36c16

Long story short, if the CSV header contains Unicode strings it now throws an error.

Unicode CSV version 0.11.1
/usr/lib/python2.7/csv.py:145: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  wrong_fields = [k for k in rowdict if k not in self.fieldnames]
Traceback (most recent call last):
  File "testUCSV.py", line 13, in <module>
    writer.writeheader()
  File "/usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py", line 159, in writeheader
    self.writerow(header)
  File "/usr/lib/python2.7/csv.py", line 152, in writerow
    return self.writer.writerow(self._dict_to_list(rowdict))
  File "/usr/lib/python2.7/csv.py", line 148, in _dict_to_list
    + ", ".join([repr(x) for x in wrong_fields]))
ValueError: dict contains fields not in fieldnames: 'unicode\xe2\x98\x83'

I wouldn't be surprised if you don't have any tests for non-ASCII headers, it's unusual. But I have some live Japanese government street address data like this. I got a bit confused looking at the git history but I think this is related to calling _stringify_list in writeheader. Version 0.11.0 didn't do that.

How can I write code that work in Python 2 and 3?

unicodecsv's API differs whether you're on Python 2 or 3, because on Python 3 it delegates to the stdlib. Unfortunately, even ignoring the use of cStringIO, the README example doesn't work in Python 3, because there's no encoding argument to reader (or writer), since the Python 3 csv module just writes strings (unicode).

I'm not sure what the best path forward is to write code that works on both Python 2 and 3. Perhaps it's to write a wrapper that gives an API that works on both. I'm not yet sure. Ideas welcome.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.