olemb / dbfread Goto Github PK

View Code? Open in Web Editor NEW

216.0 216.0 89.0 420 KB

Read DBF Files with Python

License: MIT License

Python 100.00%

dbfread's People

Contributors

Stargazers

Watchers

Forkers

jashort cedk stack-of-pancakes mhburlin pombredanne giserh romankharin wangyu190810 rainwoodman angeloballabio yuego sam-m888 brunotikami cyber-wisdom molebot jhumphry vthriller syuoni supermario1990 dominiquefeliz pqyplzxhgf ejosias samrose sean-adler jreyes289 rysson trzcinski-matt mdspecht vecchp catalyst-cooperative doobeh lyjas m271828ngtao corcioli fufp rosessp tmsbodnar calebdinsmore zuarbase sherwinterunez funicia zoeyzeng1225 wasdee thaffreingue stanislavlevin k2s william-andre jayuda ccszwg richardson-souza mvalenzr ktaranov zoynels gchiribo neurohn lexa1983 friartech juanalexei gcms thesivis shawnbrown pyt0xic janneyt leeruns evvaroli-forks stvm1965 rene-dev 91varun robsonruiz espenenes aaronperk lucifersteph opd-servicios-salud-jalisco davemc50 1u2s3e4r5 ted537 hitokiri82 patientgrid arpitjain799 sakuladev tammus64 davidhjong ochuat day-ii hhenriques1999

dbfread's Issues

Create debug tools for inspecting unsupported or broken files

feature request: support non-seekable input files

First of all, what a wonderfully-coded and -documented project!

Second: It would be awesome if non-seekable files (FIFOs for example) were able to be used as input files; this way dbfread can be used for unix-style command-line filters.

It seems like there are not too many seek statements, so hopefully this is not too difficult; if it isn't very easy/quick, I may be able to give it a try.

Not reading entire file

Hi
I'm trying to work with a large file (over 6 million records) but dbfread appears to stop before the end of the file, no exceptions or anything just stops as though its reached the end of the file, the system that produces the .dbf file is 'quirky' occasionally records get saved out of order when they shouldn't this appears to be the point at which dbfreader stops.

if i delete and the reinstate the records that cause the stop using dbf viewer 2000, dbfread will continue to the next set of quirky records, if i delete and reinstate all the record in the file it reads everything.

ideally I don't want to have to use dbf viewer 2000 to do this.

anyone got any pointers as to how i might resolve this?

Requires exactly pip version 8.1.0

pkg_resources.DistributionNotFound: The 'pip==8.1.0' distribution was not found and is required by the application

This seems broken... No one should require exactly one particular version of pip.

I forgot to mention this was when trying to install it with python3. Python2 seems OK.

Import into postgres -specific schema

How would I be able to import dbf data into a specific schema e.g. For e.g. if table was people.dbf I want to not import into the main public schema but want to import into import.people

Return dict instead of OrderedDict as default from 3.7 and up

Dictionaries are already ordered in CPython 3.6 and guaranteed to be ordered from 3.7 and up, so there's no need to return an OrderedDict anymore. (Should be also return dict for 3.6? Can we assume that people are using CPython?)

dbfread read hidden rows

I use dbfread to read a .DBF file, and then, here have a strange question:

for example:

use dbfread can read:

code: 0001, name : jack, card_number: 2132
code: 0001, name : jack, card_number: 2400

and then, when use FoxPro open DBF file,
only see:

code: 0001, name : jack, card_number: 2132

the line which card_number is 2400 is missed.

anyone meet this question ? seem like dbfread get a row but it not appear in FoxPro.

UnicodeDecodeError error

Hello Guys,

I got this error

UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 16: ordinal not in range(128)

    from dbfread import DBF
    for item in DBF('/home/chris/Desktop/ITEM.DBF'):
        print item

I tried to put encoding

    from dbfread import DBF
    for item in DBF('/home/chris/Desktop/ITEM.DBF', encoding='utf-8'):
        print item

Still get this error

  File "/home/chris/frappe-bench/env/local/lib/python2.7/site-packages/dbfread/dbf.py", line 304, in _iter_records
    for field in self.fields]
  File "/home/chris/frappe-bench/env/local/lib/python2.7/site-packages/dbfread/field_parser.py", line 75, in parse
    return func(field, data)
  File "/home/chris/frappe-bench/env/local/lib/python2.7/site-packages/dbfread/field_parser.py", line 83, in parseC
    return decode_text(data.rstrip(b'\0 '), self.encoding)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xab in position 16: invalid start byte

Any ideas?

how to read faster

Many thanks to the dbfread package, it saves my time. But now, it consume lots time of my computer :-)

The API I use is:

for record in DBF(filepath):
    # something really simple

I waste lots of time on reading the file, about 1 minute, the dbf have approximate 12 MB

How can I read it faster?

If I am concerned about the incremental part, how can i read them?

thank you in advance!

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81

I'm trying to read a database produced by an ancient version of the ACDsee photo manager program (don't ask).
When I try to read it simply as:

table = DBF('asset.dbf')
for record in table:
    print(record)

I get ValueError: Unknown field type: '7'.
I followed the advice in another issue and created a field parser as:

class TestFieldParser(FieldParser):
    def parse7(self, field, data):
        return data

table = DBF('asset.dbf', parserclass=TestFieldParser)
for record in table:
    print(record)

This produces the stack trace below. Googling for the error suggests that maybe the file is being read with the wrong encoding. Is there an easy way to try reading e.g. with UTF-8?

Traceback (most recent call last):
  File "/mnt/acdsee/ACDsee/./dumpdb.py", line 10, in <module>
    for record in table:
  File "/mnt/acdsee/ACDsee/venv/lib/python3.9/site-packages/dbfread/dbf.py", line 314, in _iter_records
    items = [(field.name,
  File "/mnt/acdsee/ACDsee/venv/lib/python3.9/site-packages/dbfread/dbf.py", line 315, in <listcomp>
    parse(field, read(field.length))) \
  File "/mnt/acdsee/ACDsee/venv/lib/python3.9/site-packages/dbfread/field_parser.py", line 79, in parse
    return func(field, data)
  File "/mnt/acdsee/ACDsee/venv/lib/python3.9/site-packages/dbfread/field_parser.py", line 87, in parseC
    return self.decode_text(data.rstrip(b'\0 '))
  File "/mnt/acdsee/ACDsee/venv/lib/python3.9/site-packages/dbfread/field_parser.py", line 45, in decode_text
    return decode_text(text, self.encoding, errors=self.char_decode_errors)
  File "/home/linuxbrew/.linuxbrew/opt/[email protected]/lib/python3.9/encodings/cp1252.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 35: character maps to <undefined>

DB4MemoFile field termination.

I am reading a memo file where i dont retrive the full length of my field due to different termination of fields.
there is a #Todo for this:

from dbfread/memo/DB4MemoFile
------snipp----------
# Todo: fields are terminated in different ways.
# \x1a is one of them
# \x1f seems to be another (dbase_8b.dbt)
return data.split(b'\x1f', 1)[0]
-------end snipp----

if i use \x1a or return all data, it seems to be correct

So why are we splitting the output??

Only supports reading a file from the filesystem

It would be great if the API would take file-like objects as input, rather than a string to the file path. In my case, I load DBF file data directly from an API into a io.BytesIO object and want to operate on that data directly. To use dbfread I'd have to save it to a temp file.

I will have problems with Python3.6

can i install python 3.6 and it will run smoothly ???

ValueError: could not convert string to float: b'60.00\x00\x00'

Traceback (most recent call last):
File "", line 1, in
File "/home/btx/Python/btx-venv/lib/python3.10/site-packages/dbfread/dbf.py", line 314, in _iter_records
items = [(field.name,
File "/home/btx/Python/btx-venv/lib/python3.10/site-packages/dbfread/dbf.py", line 315, in
parse(field, read(field.length)))
File "/home/btx/Python/btx-venv/lib/python3.10/site-packages/dbfread/field_parser.py", line 79, in parse
return func(field, data)
File "/home/btx/Python/btx-venv/lib/python3.10/site-packages/dbfread/field_parser.py", line 174, in parseN
return float(data.replace(b',', b'.'))
ValueError: could not convert string to float: b'60.00\x00\x00'

Throwing exceptions if column name contains commas

I'm trying to make a python code for converting a .dbf file into an .xlsx file, but it throws an exception (with the name of a column in question, which doesn't make me understand it any better) when the column name has a comma. It doesn't throw one if it doesn't though:

from openpyxl import Workbook, load_workbook
from dbfread import DBF

wb = Workbook()
table = DBF("DBRADON.DBF", load = True, ignore_missing_memofile=True)
length = len(table)
ws1 = wb.create_sheet("RFD", 0)
ws2 = wb.create_sheet("RAC")
ws1['A1'] = "Date"
ws1['E1'] = "Measurement Start"
ws1['F1'] = "Measurement End"
ws1['G1'] = "Exposition Length"
ws1['H1'] = "Measurement Length"
ws1['I1'] = "Radon activity"
ws1['J1'] = "Radon activity +-"
ws1['K1'] = "RFD"
ws1['L1'] = "RFD+-"
ws1['O1'] = "SK-13"
E = []
for e in range(2, len(table)):
    E.append('E' + str(e))

#print(table.records[12]['A214BI']) #This doesn't throw an exception
#print(table.records[12]['BGTM,N,2,0']) #This throws an exception

wb.save('radon.xlsx')

File in question is attached:
DBRADON.zip

Better tests for field parsing

It's getting hard to keep track of all the special cases of field formats in different DBF files.

We need better tests for these special cases.

Value Error: could not convert string to float

Hi, you have developed an excelent package for readinf .dbf but I have had many issues reading one of them. I want to keep just 2 columns of the .dbf but first I have to load it to then read for delete columns later but at this step, I got a Value Error of one of the registers of the .dbf which is part of the columns I want to delete.

In fact, first I got a Value Error with strange characters but ,modifying the function 'parseN' it was converted that strange character to 'nz'

Thanks for helping me in this issue, I hope get your answer.

Generalize the --encoding-xlsx input option for other file types in in2csv

I've recently tried to use in2csv in a GIS dBase IV dbf file and was unable to do so until I amended (in a very ugly way) in2csv.py to get an additional --encoding-dbf command-line argument (which mirrors --encoding-xls in passing an encoding argument to agata.Table).

Since probably this is not a very promising solution (I imagine in the future an --encoding-xxx for each possible data file), wouldn't it be better to just have an --input-encoding parameter? Would that break anything else?

Value Error:could not convert string to float: b'**'

DBF file contains a value of '0.00**' in an int field, when I go to read I get the following error. I understand that the problem is in part due to data impurity in the DBF files I am using, but thought you'd like to be aware of the issue.

File "C:\Continuum\Anaconda3\lib\site-packages\dbfread\dbf.py", line 310, in _iter_records
for field in self.fields]
File "C:\Continuum\Anaconda3\lib\site-packages\dbfread\dbf.py", line 310, in
for field in self.fields]
File "C:\Continuum\Anaconda3\lib\site-packages\dbfread\field_parser.py", line 75, in parse
return func(field, data)
File "C:\Continuum\Anaconda3\lib\site-packages\dbfread\field_parser.py", line 164, in parseN
return float(data.replace(b',', b'.'))
ValueError: could not convert string to float: b'**'

Dealing with '\x00' in a numeric field

Hello, and thank you so much for the super useful library!

I'm trying to import data from a very old and buggy VFP application. The DB is so dirt it even has \x00 characters in a numeric row. This causes dbfread to try and interpret it as a float, which fails as such:

Traceback (most recent call last):
  File "C:\Python36-32\lib\site-packages\dbfread\field_parser.py", line 168, in parseN
    return int(data)
ValueError: invalid literal for int() with base 10: b'\x00'

During handling of the above exception, another exception occurred:

...

  File "C:\Python36-32\lib\site-packages\dbfread\field_parser.py", line 79, in parse
    return func(field, data)
  File "C:\Python36-32\lib\site-packages\dbfread\field_parser.py", line 174, in parseN
    return float(data.replace(b',', b'.'))
ValueError: could not convert string to float: b'\x00'

I'm not worried about trying to retrieve useful data from that row - it's empty anyway. The problem is I don't see a way to skip that row during parsing without monkey-patching dbfread. Can you please give any guidance on this matter?

I suppose worst case scenario, I could delete the row from the input file somehow without using dbfread... no idea how, though.

Many thanks!

Oh, my code is basically just iterating through the entire dbf. for row in dbf_file kinda thing.

Docs are not clear in the accession order of the data

it is not clear when you do

for row in DBF('xxx.dbf'):
......

If the rows are accessed/read in a sorted order or in a random order

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 65: character maps to <undefined>

Hello, thanks for your work.
I am trying to read a dbf file, but halfway from this error.
I was probably told that memo fields are the cause. in this case it should be the email.
It crashes at line 685 of CLI_ENTI.DBF
Do you have time to help me? Thanks.

Windows 10.0.19043 N/D build 19043
Python 3.9.4
Editor Atom

File "C:\Python39\lib\site-packages\dbfread\dbf.py", line 314, in _iter_records
    items = [(field.name,
  File "C:\Python39\lib\site-packages\dbfread\dbf.py", line 315, in <listcomp>
    parse(field, read(field.length))) \
  File "C:\Python39\lib\site-packages\dbfread\field_parser.py", line 79, in parse
    return func(field, data)
  File "C:\Python39\lib\site-packages\dbfread\field_parser.py", line 157, in parseM
    return self.decode_text(memo)
  File "C:\Python39\lib\site-packages\dbfread\field_parser.py", line 45, in decode_text
    return decode_text(text, self.encoding, errors=self.char_decode_errors)

  File "C:\Python39\lib\encodings\cp1252.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 65: character maps to <undefined>

Break up code into functions that can be composed in different ways

This will make it easier to write tools to analyze files, and to add support to the library for reading from and open file or pipe.

Support for dataclasses?

This is not something we need to support directly in the code, but it would be nice to have a documented way to create dataclasses from records.

I think it should be possible to create dataclasses dynamically be inspecting the DBF object, but perhaps it is best after all to create the dataclass manually.

ValueError: Field type I must have length 4 (was 0)

Hello,
I'm unsuccessfully trying to open dbf

from dbfread import DBF, FieldParser

class TestFieldParser(FieldParser):
def parse00(self, field, data):
print(field.name, data)
return data
def parseR(self, field, data):
print(field.name, data)
return data

dbf = DBF('c:/base/75.dbf', encoding='cp1251', parserclass=TestFieldParser)
for rec in dbf:
pass

Traceback (most recent call last):
File "C:\Users\Dmitry\PycharmProjects\Transit\Lib\site-packages\Transit.py", line 13, in
dbf = DBF('c:/base/75.dbf', encoding='cp1251', parserclass=TestFieldParser)
File "C:\Users\Dmitry\PycharmProjects\Transit\Lib\site-packages\dbfread\dbf.py", line 123, in init
self._check_headers()
File "C:\Users\Dmitry\PycharmProjects\Transit\Lib\site-packages\dbfread\dbf.py", line 257, in _check_headers
raise ValueError(message.format(field.length))
ValueError: Field type I must have length 4 (was 0)

What does this error mean and how to fix it?

Thanks everyone

ValueError: could not convert string to float

Traceback (most recent call last):

  File "<ipython-input-129-655019aebcfc>", line 1, in <module>
    for rec in dbf:

  File "C:\Users\Rdebbout\AppData\Local\Continuum\Anaconda2\envs\cdi3\lib\site-packages\dbfread\dbf.py", line 316, in _iter_records
    for field in self.fields]

  File "C:\Users\Rdebbout\AppData\Local\Continuum\Anaconda2\envs\cdi3\lib\site-packages\dbfread\dbf.py", line 316, in <listcomp>
    for field in self.fields]

  File "C:\Users\Rdebbout\AppData\Local\Continuum\Anaconda2\envs\cdi3\lib\site-packages\dbfread\field_parser.py", line 79, in parse
    return func(field, data)

  File "C:\Users\Rdebbout\AppData\Local\Continuum\Anaconda2\envs\cdi3\lib\site-packages\dbfread\field_parser.py", line 174, in parseN
    return float(data.replace(b',', b'.'))

ValueError: could not convert string to float: b'\x00\x00\x00\x00\x00\x00\x00\x00\x00

Through reading other issues I have found a way to fix the problem, but I'm not sure of the best way to implement it.

class TestFieldParser(FieldParser):    
    def parseN(self, field, data):
        """Parse numeric field (N)

        Returns int, float or None if the field is empty.
        """
        # In some files * is used for padding.
        data = data.strip().strip(b'*')

        try:
            return int(data)
        except ValueError:
            if not data.strip():
                return None
            elif isinstance(data, (bytes, bytearray)):  # I added these 2 lines
                return int.from_bytes(data, byteorder='big', signed=True)
            else:
                # Account for , in numeric fields
                return float(data.replace(b',', b'.'))

This works in my instance, but there may be a better way to implement, I'm not sure of the best way to make the comparison to find if the data object is a byte literal, I had also made the comparison as such,

data == b'\x00\x00\x00\x00\x00\x00\x00\x00\x00'

Hope this helps

Unknown field type: 'V'

Some files created with FoxPro 9 can not be opened with dbfread.DBF("file.dbf")

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "~/dbfread/dbf.py", line 120, in __init__
    self._check_headers()
  File "~/dbfread/dbf.py", line 259, in _check_headers
    raise ValueError('Unknown field type: {!r}'.format(field.type))
ValueError: Unknown field type: 'V'

Seems to be DBF format slightly updated
https://msdn.microsoft.com/en-us/library/st4a0s68%28VS.80%29.aspx

struct.error: unpack requires a bytes object of length 32

I am trying to open a simple dbase III file and I am getting

File "C:/Users/user/Documents/Python/SAG/dbftest.py", line 4, in <module>
    test = DBF('SystemRecord.dbf')

  File "C:\Python36-32\lib\site-packages\dbfread\dbf.py", line 122, in __init__
    self._read_field_headers(infile)

  File "C:\Python36-32\lib\site-packages\dbfread\dbf.py", line 224, in _read_field_headers
    field = DBFField.unpack(sep + infile.read(DBFField.size - 1))

  File "C:\Python36-32\lib\site-packages\dbfread\struct_parser.py", line 36, in unpack
    items = zip(self.names, self.struct.unpack(data))

struct.error: unpack requires a bytes object of length 32

Im assuming there is some field/index in my file that is larger/shorter than some maximum.
Any ideas whats causing this?

New PyPI release and release-checklist.rst file

Currently,the latest version of dbfread available on PyPI is 2.0.7 (from November 2016). That version does not register its supported versions of Python and it doesn't include any trove classifiers. It would be great to make the newest version available via pip install --upgrade dbfread.

Also, I think it could be helpful to add a check-list that could be used when publishing releases. This is something I do for my own projects and it helps formalize the process and reduce mistakes.

If you like the idea, I can submit it as a pull request. But for now, here is a release-checklist.rst file I would suggest as a good starting point:

Release Checklist
=================

#. In ``dbfread/version.py``, make sure the correct version number is defined
   for this release.

#. Make sure that information about supported Python versions is consistent:

   * In the call to ``setup()``, check the versions defined by the
     *python_requires* argument (see the "Version specifiers" section of
     PEP-440 for details).
   * In the call to ``setup()``, check the trove classifiers in the
     *classifiers* argument (see https://pypi.org/classifiers/ for values).
   * In ``README.rst``, check the versions listed in the "Main Features" and
     "Installing" sections.
   * In ``docs/installing.rst``, check the versions listed in the
     "Requirements" section.

#. Make sure the *description* argument in ``setup.py`` matches the project
   description on GitHub (in the "About" section).

#. Check that *packages* argument of ``setup()`` is correct. Check that the
   value matches what ``setuptools.find_packages()`` returns:

        >>> import setuptools
        >>> sorted(setuptools.find_packages('.', exclude=['tests']))

   Defining this list explicitly (rather than using ``find_packages()``
   directly in ``setup.py`` file) is needed when installing on systems
   where ``setuptools`` is not available.

#. Make final updates to ``docs/changes.rst`` file.

#. Commit and push final changes to the upstream development repository:

        Prepare version info, documentation, and README for version N.N.N release.

#. In the upstream repository, make sure that all of the tests and checks
   are passing.

#. Make sure the packaging tools are up-to-date:

        pip install -U twine wheel setuptools check-manifest

#. Check the manifest against the project's root folder:

        check-manifest .

#. Remove any existing files from the ``dist/`` folder.

#. Build new distributions:

        python setup.py sdist bdist_wheel

#. Upload distributions to TestPyPI:

        twine upload --repository testpypi dist/*

#. View the package's web page on TestPyPI and verify that the information
   is correct for the "Project links" and "Meta" sections:

   * https://test.pypi.org/project/dbfread

   If you are testing a pre-release version, make sure to use the URL returned
   by twine in the previous step (the default URL shows the latest *stable*
   version).

#. Test the installation process from TestPyPI:

        python -m pip install --index-url https://test.pypi.org/simple/ dbfread

   If you're testing a pre-release version, make sure to use the "pip install"
   command listed at the top of the project's TestPyPI page.

#. Upload source and wheel distributions to PyPI:

        twine upload dist/*

#. Double check PyPI project page and test installation from PyPI:

        python -m pip install dbfread

#. Make sure the documentation version reflects the new release:

   * https://dbfread.readthedocs.io/

   If the documentation was not automatically updated, you may need to
   login to https://readthedocs.org/ and start the build process manually.

#. Publish update announcement to relevant mailing lists:

   * [email protected]

error occured when parsing a 'N' type field which is mistakenly defined as ‘C'

hi,

When I was trying to read a dbf file, I found that when the dbf file has a wrong field type definition, it cannot read the data correctly.
e.g.
the blfm2 is actually filled with float, but it was mistakenly defined as 'C' type. If correctly parsed with the follwing method (dbf.py line 122)
self._read_field_headers(infile)

def _read_field_headers(self, infile):
        while True:
            sep = infile.read(1)
            if sep in (b'\r', b'\n', b''):
                # End of field headers
                break

            field = DBFField.unpack(sep + infile.read(DBFField.size - 1))

            field.type = chr(ord(field.type))

            # For character fields > 255 bytes the high byte
            # is stored in decimal_count.
            if field.type in 'C':
                field.length |= field.decimal_count << 8
                field.decimal_count = 0

            # Field name is b'\0' terminated.
            field.name = self._decode_text(field.name.split(b'\0')[0])
            if self.lowernames:
                field.name = field.name.lower()

            self.field_names.append(field.name)

            self.fields.append(field)

it should produce DBFField like:
DBFField(name='blfm2', type='C', address=196, length=17, decimal_count=12,

But the it turns out to be

DBFField(name='blfm2', type='C', address=196, length=3072, decimal_count=0

I think the problem is here:

            # For character fields > 255 bytes the high byte
            # is stored in decimal_count.
            if field.type in 'C':
                field.length |= field.decimal_count << 8
                field.decimal_count = 0

After I removed these lines, I finally got the correct result.

ValueError: could not convert string to float: F

Hi! I tried your library to open a dbf file but get the following error:

import dbfread
file_ = '/home/gijs/.wine/drive_c/OLMSoft/BrouwVisie/Data/Recepten.Dbf'
t = dbfread.dbf.DBF(file_, load=True)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-33-e461b614b7c2> in <module>()
----> 1 t = dbfread.dbf.DBF(file_, load=True)

/usr/local/lib/python2.7/dist-packages/dbfread/dbf.pyc in __init__(self, filename, encoding, ignorecase, lowernames, parserclass, recfactory, load, raw, ignore_missing_memofile)
    131 
    132         if load:
--> 133             self.load()
    134 
    135     @property

/usr/local/lib/python2.7/dist-packages/dbfread/dbf.pyc in load(self)
    166         """
    167         if not self.loaded:
--> 168             self._records = list(self._iter_records(b' '))
    169             self._deleted = list(self._iter_records(b'*'))
    170 

/usr/local/lib/python2.7/dist-packages/dbfread/dbf.pyc in _iter_records(self, record_type)
    308                         items = [(field.name,
    309                                   parse(field, read(field.length))) \
--> 310                                  for field in self.fields]
    311 
    312                     yield self.recfactory(items)

/usr/local/lib/python2.7/dist-packages/dbfread/field_parser.pyc in parse(self, field, data)
     73             raise ValueError('Unknown field type: {!r}'.format(field.type))
     74         else:
---> 75             return func(field, data)
     76 
     77     def parse0(self, field, data):

/usr/local/lib/python2.7/dist-packages/dbfread/field_parser.pyc in parseN(self, field, data)
    162             else:
    163                 # Account for , in numeric fields
--> 164                 return float(data.replace(b',', b'.'))
    165 
    166     def parseO(self, field, data):

ValueError: could not convert string to float: F

$ file Recepten.Dbf
Recepten.Dbf: FoxBase+/dBase III DBF, 1 record * 13734, update-date 115-8-16, codepage ID=0x37, with index file .MDX, at offset 8577 1st record "                                                      FMijn eig"

Error parsing dates

While opening a dbf I got this error:

ValueError: invalid date b'24/01/20' 

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/usr/local/lib/python3.8/dist-packages/dbfread/field_parser.py in parseD(self, field, data)
     91         try:
---> 92             return datetime.date(int(data[:4]), int(data[4:6]), int(data[6:8]))
     93         except ValueError:

ValueError: invalid literal for int() with base 10: b'24/0'

the parser is apparently spliting the date in the wrong positions.

No way to change which schema

ValueError: Unknown field type: '\x00'

I have this error opening a DBF file:

Traceback (most recent call last):
  File "hwmnucheck.py", line 486, in main
    checker.loaddb()
  File "hwmnucheck.py", line 301, in loaddb
    self._loaddb_turni()
  File "hwmnucheck.py", line 236, in _loaddb_turni
    table = dbfread.DBF(self._db_turnimensa_f)
  File "\WinPython-32bit-3.4.3.7\python-3.4.3\lib\site-packages\dbfread\dbf.py", line 123, in __init__
    self._check_headers()
  File "WinPython-32bit-3.4.3.7\python-3.4.3\lib\site-packages\dbfread\dbf.py", line 265, in _check_headers
    raise ValueError('Unknown field type: {!r}'.format(field.type))
ValueError: Unknown field type: '\x00'

Replace test system

The way tests are run is not working anymore.

Add support for reading from a (non-seekable) file object instead of a filename

This will require internal changes to the library, and perhaps some API changes.

Relevant issues and pull requests:

feature request: support non-seekable input files #5
Streaming #14
Only supports reading a file from the filesystem #25
Add support to read dbf file from zip-archive #39
#37

I suspect the best solution to handle all of these cases is to break most of the code into functions that can be composed into different APIs. For example, there could be a function that reads the headers from a file object, an d a generator that reads records from a file object and we could write a new backwards compatible DBF class or another leaner API on top of these.

These smaller function would also be useful when writing debugging tools.

How can I dump DBFs to CSVs faster?

I ran a benchmark dumping a DBF to CSV 1,000 times using dbfread 2.0.7 and Pandas 0.24.1 and comparing it to https://github.com/yellowfeather/dbf2csv.

The file I used was a 1.5 MB DBF in 'FoxBASE+/Dbase III plus, no memory' format with 4,522 records and 40 columns made up of 36 numeric fields, 2 char fields and 2 date fields. I used CPython 2.7.12 for dbfread and I compiled dbf2csv using GCC 5.4.0 and libboost v1.58.0. I ran the test on a t3.2xlarge instance on AWS.

The disk can be written to at 115 MB/s according to:

dd if=/dev/zero of=./test bs=1G count=1 oflag=dsync

dbfread and Pandas managed to write the CSV 1,000 in 26 minutes while dbf2csv took just under 74 seconds, a 21x difference in performance.

These were the steps I took during the benchmark:

$ vi run_1

for i in {1..1000}; do
    ./dbf2csv -c FILE.DBF > FILE.csv
done

$ time bash run_1 # 1m13.699s, 21x faster

$ vi run_2.py

from   dbfread import DBF
import pandas as pd
dbf = DBF('FILE.DBF', encoding='latin-1', char_decode_errors='strict')
pd.DataFrame(iter(dbf)).to_csv('FILE.csv', index=False, encoding='utf-8', mode='w')

$ vi run_2

for i in {1..1000}; do
    python run_2.py
done

$ time bash run_2 # 25m59.877s

The DBF would have sat in the page cache but nonetheless represented 1,489 MB of data read. The resulting CSV represented 638 MB. So this works out to 0.95 MB/s read and 0.41 MB/s written.

Do you know of any way I could see improved performance in this workload using dbfread?

Support for DBC (compressed DBF) files

Would it make sense for dbfread to support compressed DBF files – DBC files?

It would be similar to what the read.dbc package does in R. I couldn't find a package that does the same in Python. I've tried dbfread and it currently only reads uncompressed DBF files.

If you need DBC files for testing, there are lots of them at the DATASUS website (official public healthcare system statistics from Brazil).

Cannot convert to ASCII

I'm having problems tables with unicode. CNTR_AT_2013 comes from Eurostat db. Below is my attempt to read data and what is expected (parsed via Unix dbfview).

Problems occurs both in Python 2.7 and 3.5.

>>> for record in dbfread.DBF('CNTR_AT_2013.dbf'):
           print record
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/dbfread/dbf.py", line 316, in _iter_records
    for field in self.fields]
  File "/usr/local/lib/python2.7/dist-packages/dbfread/field_parser.py", line 79, in parse
    return func(field, data)
  File "/usr/local/lib/python2.7/dist-packages/dbfread/field_parser.py", line 87, in parseC
    return self.decode_text(data.rstrip(b'\0 '))
  File "/usr/local/lib/python2.7/dist-packages/dbfread/field_parser.py", line 45, in decode_text
    return decode_text(text, self.encoding, errors=self.char_decode_errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)
>>> table = dbfread.read("CNTR_AT_2013.dbf")
/usr/local/lib/python2.7/dist-packages/dbfread/deprecated_dbf.py:47: UserWarning: dbfread.read() has been replaced by DBF(load=True) and will be removed in 2.2.
  warnings.warn("dbfread.read() has been replaced by DBF(load=True)"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/dbfread/deprecated_dbf.py", line 49, in read
    return DeprecatedDBF(filename, load=True, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/dbfread/dbf.py", line 136, in __init__
    self.load()
  File "/usr/local/lib/python2.7/dist-packages/dbfread/deprecated_dbf.py", line 18, in load
    self[:] = self._iter_records(b' ')
  File "/usr/local/lib/python2.7/dist-packages/dbfread/dbf.py", line 316, in _iter_records
    for field in self.fields]
  File "/usr/local/lib/python2.7/dist-packages/dbfread/field_parser.py", line 79, in parse
    return func(field, data)
  File "/usr/local/lib/python2.7/dist-packages/dbfread/field_parser.py", line 87, in parseC
    return self.decode_text(data.rstrip(b'\0 '))
  File "/usr/local/lib/python2.7/dist-packages/dbfread/field_parser.py", line 45, in decode_text
    return decode_text(text, self.encoding, errors=self.char_decode_errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)

Cntr id    : QN
Cntr name  : Puerto Rico and Virgin Islands of the United States
Name asci  : Puerto Rico and Virgin Islands of the United States
Name engl  : Puerto Rico and Virgin Islands of the United States
Name fren  : PUERTO RICO ET LES ÎLES VIERGES DES ÉTATS-UNIS
Poli org c : 99.00000000
Name gaul  : 
Iso3 code  : 
Svrg un    : US Territory
Capt       : 
Cntr code  : UA
Eu stat    : F
Efta stat  : F
Cc stat    : F

Cntr id    : QO
Cntr name  : Guadeloupe and Martinique
Name asci  : Guadeloupe and Martinique
Name engl  : GUADELOUPE AND MARTINIQUE
Name fren  : GUADELOUPE ET MARTINIQUE
Poli org c : 99.00000000
Name gaul  : 
Iso3 code  : 
Svrg un    : FR Territory
Capt       : 
Cntr code  : UA
Eu stat    : F
Efta stat  : F
Cc stat    : F

Error: unpack requires a string argument of length 32

Hi, I've been using this and generally it supports most of the foxpro .dbf file, but there is one file that can't be read for some reason, it says "unpack requires a string argument of length 32". What might causes this?

Here is the file: https://drive.google.com/file/d/0BweenIzZNEAtSWdoM1M1d3Z5YjQ/view?usp=sharing

UnicodeEncodeError for special charater

UnicodeEncodeError: 'utf-8' codec can't encode characters. This error is coming when i am using utf-8 encoding.
The file is having value like below:
ABCâ XYZ 123
dbf = DBF(input_path, encoding="utf-8")

When i used encoding="iso-8859-1" then i am not getting an error but its getting converted to
ABCÂ XYZ 123
dbf = DBF(input_path, encoding="iso-8859-1")

olemb / dbfread Goto Github PK

dbfread's People

Contributors

Stargazers

Watchers

Forkers

dbfread's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs