olemb / dbfread Goto Github PK
View Code? Open in Web Editor NEWRead DBF Files with Python
License: MIT License
Read DBF Files with Python
License: MIT License
First of all, what a wonderfully-coded and -documented project!
Second: It would be awesome if non-seekable files (FIFOs for example) were able to be used as input files; this way dbfread can be used for unix-style command-line filters.
It seems like there are not too many seek statements, so hopefully this is not too difficult; if it isn't very easy/quick, I may be able to give it a try.
Hi
I'm trying to work with a large file (over 6 million records) but dbfread appears to stop before the end of the file, no exceptions or anything just stops as though its reached the end of the file, the system that produces the .dbf file is 'quirky' occasionally records get saved out of order when they shouldn't this appears to be the point at which dbfreader stops.
if i delete and the reinstate the records that cause the stop using dbf viewer 2000, dbfread will continue to the next set of quirky records, if i delete and reinstate all the record in the file it reads everything.
ideally I don't want to have to use dbf viewer 2000 to do this.
anyone got any pointers as to how i might resolve this?
pkg_resources.DistributionNotFound: The 'pip==8.1.0' distribution was not found and is required by the application
This seems broken... No one should require exactly one particular version of pip.
I forgot to mention this was when trying to install it with python3. Python2 seems OK.
How would I be able to import dbf data into a specific schema e.g. For e.g. if table was people.dbf I want to not import into the main public schema but want to import into import.people
Dictionaries are already ordered in CPython 3.6 and guaranteed to be ordered from 3.7 and up, so there's no need to return an OrderedDict
anymore. (Should be also return dict
for 3.6? Can we assume that people are using CPython?)
I use dbfread to read a .DBF file, and then, here have a strange question:
for example:
use dbfread can read:
code: 0001, name : jack, card_number: 2132
code: 0001, name : jack, card_number: 2400
and then, when use FoxPro open DBF file,
only see:
code: 0001, name : jack, card_number: 2132
the line which card_number is 2400 is missed.
anyone meet this question ? seem like dbfread get a row but it not appear in FoxPro.
Hello Guys,
I got this error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 16: ordinal not in range(128)
from dbfread import DBF
for item in DBF('/home/chris/Desktop/ITEM.DBF'):
print item
I tried to put encoding
from dbfread import DBF
for item in DBF('/home/chris/Desktop/ITEM.DBF', encoding='utf-8'):
print item
Still get this error
File "/home/chris/frappe-bench/env/local/lib/python2.7/site-packages/dbfread/dbf.py", line 304, in _iter_records
for field in self.fields]
File "/home/chris/frappe-bench/env/local/lib/python2.7/site-packages/dbfread/field_parser.py", line 75, in parse
return func(field, data)
File "/home/chris/frappe-bench/env/local/lib/python2.7/site-packages/dbfread/field_parser.py", line 83, in parseC
return decode_text(data.rstrip(b'\0 '), self.encoding)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xab in position 16: invalid start byte
Any ideas?
Many thanks to the dbfread package, it saves my time. But now, it consume lots time of my computer :-)
The API I use is:
for record in DBF(filepath):
# something really simple
I waste lots of time on reading the file, about 1 minute, the dbf have approximate 12 MB
How can I read it faster?
If I am concerned about the incremental part, how can i read them?
thank you in advance!
I'm trying to read a database produced by an ancient version of the ACDsee photo manager program (don't ask).
When I try to read it simply as:
table = DBF('asset.dbf')
for record in table:
print(record)
I get ValueError: Unknown field type: '7'
.
I followed the advice in another issue and created a field parser as:
class TestFieldParser(FieldParser):
def parse7(self, field, data):
return data
table = DBF('asset.dbf', parserclass=TestFieldParser)
for record in table:
print(record)
This produces the stack trace below. Googling for the error suggests that maybe the file is being read with the wrong encoding. Is there an easy way to try reading e.g. with UTF-8?
Traceback (most recent call last):
File "/mnt/acdsee/ACDsee/./dumpdb.py", line 10, in <module>
for record in table:
File "/mnt/acdsee/ACDsee/venv/lib/python3.9/site-packages/dbfread/dbf.py", line 314, in _iter_records
items = [(field.name,
File "/mnt/acdsee/ACDsee/venv/lib/python3.9/site-packages/dbfread/dbf.py", line 315, in <listcomp>
parse(field, read(field.length))) \
File "/mnt/acdsee/ACDsee/venv/lib/python3.9/site-packages/dbfread/field_parser.py", line 79, in parse
return func(field, data)
File "/mnt/acdsee/ACDsee/venv/lib/python3.9/site-packages/dbfread/field_parser.py", line 87, in parseC
return self.decode_text(data.rstrip(b'\0 '))
File "/mnt/acdsee/ACDsee/venv/lib/python3.9/site-packages/dbfread/field_parser.py", line 45, in decode_text
return decode_text(text, self.encoding, errors=self.char_decode_errors)
File "/home/linuxbrew/.linuxbrew/opt/[email protected]/lib/python3.9/encodings/cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 35: character maps to <undefined>
I am reading a memo file where i dont retrive the full length of my field due to different termination of fields.
there is a #Todo for this:
from dbfread/memo/DB4MemoFile
------snipp----------
# Todo: fields are terminated in different ways.
# \x1a is one of them
# \x1f seems to be another (dbase_8b.dbt)
return data.split(b'\x1f', 1)[0]
-------end snipp----
if i use \x1a or return all data, it seems to be correct
So why are we splitting the output??
It would be great if the API would take file-like objects as input, rather than a string to the file path. In my case, I load DBF file data directly from an API into a io.BytesIO
object and want to operate on that data directly. To use dbfread I'd have to save it to a temp file.
can i install python 3.6 and it will run smoothly ???
Traceback (most recent call last):
File "", line 1, in
File "/home/btx/Python/btx-venv/lib/python3.10/site-packages/dbfread/dbf.py", line 314, in _iter_records
items = [(field.name,
File "/home/btx/Python/btx-venv/lib/python3.10/site-packages/dbfread/dbf.py", line 315, in
parse(field, read(field.length)))
File "/home/btx/Python/btx-venv/lib/python3.10/site-packages/dbfread/field_parser.py", line 79, in parse
return func(field, data)
File "/home/btx/Python/btx-venv/lib/python3.10/site-packages/dbfread/field_parser.py", line 174, in parseN
return float(data.replace(b',', b'.'))
ValueError: could not convert string to float: b'60.00\x00\x00'
I'm trying to make a python code for converting a .dbf file into an .xlsx file, but it throws an exception (with the name of a column in question, which doesn't make me understand it any better) when the column name has a comma. It doesn't throw one if it doesn't though:
from openpyxl import Workbook, load_workbook
from dbfread import DBF
wb = Workbook()
table = DBF("DBRADON.DBF", load = True, ignore_missing_memofile=True)
length = len(table)
ws1 = wb.create_sheet("RFD", 0)
ws2 = wb.create_sheet("RAC")
ws1['A1'] = "Date"
ws1['E1'] = "Measurement Start"
ws1['F1'] = "Measurement End"
ws1['G1'] = "Exposition Length"
ws1['H1'] = "Measurement Length"
ws1['I1'] = "Radon activity"
ws1['J1'] = "Radon activity +-"
ws1['K1'] = "RFD"
ws1['L1'] = "RFD+-"
ws1['O1'] = "SK-13"
E = []
for e in range(2, len(table)):
E.append('E' + str(e))
#print(table.records[12]['A214BI']) #This doesn't throw an exception
#print(table.records[12]['BGTM,N,2,0']) #This throws an exception
wb.save('radon.xlsx')
File in question is attached:
DBRADON.zip
It's getting hard to keep track of all the special cases of field formats in different DBF files.
We need better tests for these special cases.
Hi, you have developed an excelent package for readinf .dbf but I have had many issues reading one of them. I want to keep just 2 columns of the .dbf but first I have to load it to then read for delete columns later but at this step, I got a Value Error of one of the registers of the .dbf which is part of the columns I want to delete.
In fact, first I got a Value Error with strange characters but ,modifying the function 'parseN' it was converted that strange character to 'nz'
Thanks for helping me in this issue, I hope get your answer.
I've recently tried to use in2csv in a GIS dBase IV dbf file and was unable to do so until I amended (in a very ugly way) in2csv.py to get an additional --encoding-dbf command-line argument (which mirrors --encoding-xls in passing an encoding argument to agata.Table).
Since probably this is not a very promising solution (I imagine in the future an --encoding-xxx for each possible data file), wouldn't it be better to just have an --input-encoding parameter? Would that break anything else?
DBF file contains a value of '0.00**' in an int field, when I go to read I get the following error. I understand that the problem is in part due to data impurity in the DBF files I am using, but thought you'd like to be aware of the issue.
File "C:\Continuum\Anaconda3\lib\site-packages\dbfread\dbf.py", line 310, in _iter_records
for field in self.fields]
File "C:\Continuum\Anaconda3\lib\site-packages\dbfread\dbf.py", line 310, in
for field in self.fields]
File "C:\Continuum\Anaconda3\lib\site-packages\dbfread\field_parser.py", line 75, in parse
return func(field, data)
File "C:\Continuum\Anaconda3\lib\site-packages\dbfread\field_parser.py", line 164, in parseN
return float(data.replace(b',', b'.'))
ValueError: could not convert string to float: b'**'
Hello, and thank you so much for the super useful library!
I'm trying to import data from a very old and buggy VFP application. The DB is so dirt it even has \x00 characters in a numeric row. This causes dbfread to try and interpret it as a float, which fails as such:
Traceback (most recent call last):
File "C:\Python36-32\lib\site-packages\dbfread\field_parser.py", line 168, in parseN
return int(data)
ValueError: invalid literal for int() with base 10: b'\x00'
During handling of the above exception, another exception occurred:
...
File "C:\Python36-32\lib\site-packages\dbfread\field_parser.py", line 79, in parse
return func(field, data)
File "C:\Python36-32\lib\site-packages\dbfread\field_parser.py", line 174, in parseN
return float(data.replace(b',', b'.'))
ValueError: could not convert string to float: b'\x00'
I'm not worried about trying to retrieve useful data from that row - it's empty anyway. The problem is I don't see a way to skip that row during parsing without monkey-patching dbfread. Can you please give any guidance on this matter?
I suppose worst case scenario, I could delete the row from the input file somehow without using dbfread... no idea how, though.
Many thanks!
Oh, my code is basically just iterating through the entire dbf. for row in dbf_file
kinda thing.
it is not clear when you do
for row in DBF('xxx.dbf'):
......
If the rows are accessed/read in a sorted order or in a random order
Hello, thanks for your work.
I am trying to read a dbf file, but halfway from this error.
I was probably told that memo fields are the cause. in this case it should be the email.
It crashes at line 685 of CLI_ENTI.DBF
Do you have time to help me? Thanks.
File "C:\Python39\lib\site-packages\dbfread\dbf.py", line 314, in _iter_records
items = [(field.name,
File "C:\Python39\lib\site-packages\dbfread\dbf.py", line 315, in <listcomp>
parse(field, read(field.length))) \
File "C:\Python39\lib\site-packages\dbfread\field_parser.py", line 79, in parse
return func(field, data)
File "C:\Python39\lib\site-packages\dbfread\field_parser.py", line 157, in parseM
return self.decode_text(memo)
File "C:\Python39\lib\site-packages\dbfread\field_parser.py", line 45, in decode_text
return decode_text(text, self.encoding, errors=self.char_decode_errors)
File "C:\Python39\lib\encodings\cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 65: character maps to <undefined>
This will make it easier to write tools to analyze files, and to add support to the library for reading from and open file or pipe.
This is not something we need to support directly in the code, but it would be nice to have a documented way to create dataclasses from records.
I think it should be possible to create dataclasses dynamically be inspecting the DBF object, but perhaps it is best after all to create the dataclass manually.
Hello,
I'm unsuccessfully trying to open dbf
from dbfread import DBF, FieldParser
class TestFieldParser(FieldParser):
def parse00(self, field, data):
print(field.name, data)
return data
def parseR(self, field, data):
print(field.name, data)
return data
dbf = DBF('c:/base/75.dbf', encoding='cp1251', parserclass=TestFieldParser)
for rec in dbf:
pass
Traceback (most recent call last):
File "C:\Users\Dmitry\PycharmProjects\Transit\Lib\site-packages\Transit.py", line 13, in
dbf = DBF('c:/base/75.dbf', encoding='cp1251', parserclass=TestFieldParser)
File "C:\Users\Dmitry\PycharmProjects\Transit\Lib\site-packages\dbfread\dbf.py", line 123, in init
self._check_headers()
File "C:\Users\Dmitry\PycharmProjects\Transit\Lib\site-packages\dbfread\dbf.py", line 257, in _check_headers
raise ValueError(message.format(field.length))
ValueError: Field type I must have length 4 (was 0)
What does this error mean and how to fix it?
Thanks everyone
Traceback (most recent call last):
File "<ipython-input-129-655019aebcfc>", line 1, in <module>
for rec in dbf:
File "C:\Users\Rdebbout\AppData\Local\Continuum\Anaconda2\envs\cdi3\lib\site-packages\dbfread\dbf.py", line 316, in _iter_records
for field in self.fields]
File "C:\Users\Rdebbout\AppData\Local\Continuum\Anaconda2\envs\cdi3\lib\site-packages\dbfread\dbf.py", line 316, in <listcomp>
for field in self.fields]
File "C:\Users\Rdebbout\AppData\Local\Continuum\Anaconda2\envs\cdi3\lib\site-packages\dbfread\field_parser.py", line 79, in parse
return func(field, data)
File "C:\Users\Rdebbout\AppData\Local\Continuum\Anaconda2\envs\cdi3\lib\site-packages\dbfread\field_parser.py", line 174, in parseN
return float(data.replace(b',', b'.'))
ValueError: could not convert string to float: b'\x00\x00\x00\x00\x00\x00\x00\x00\x00
Through reading other issues I have found a way to fix the problem, but I'm not sure of the best way to implement it.
class TestFieldParser(FieldParser):
def parseN(self, field, data):
"""Parse numeric field (N)
Returns int, float or None if the field is empty.
"""
# In some files * is used for padding.
data = data.strip().strip(b'*')
try:
return int(data)
except ValueError:
if not data.strip():
return None
elif isinstance(data, (bytes, bytearray)): # I added these 2 lines
return int.from_bytes(data, byteorder='big', signed=True)
else:
# Account for , in numeric fields
return float(data.replace(b',', b'.'))
This works in my instance, but there may be a better way to implement, I'm not sure of the best way to make the comparison to find if the data object is a byte literal, I had also made the comparison as such,
data == b'\x00\x00\x00\x00\x00\x00\x00\x00\x00'
Hope this helps
Some files created with FoxPro 9 can not be opened with dbfread.DBF("file.dbf")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "~/dbfread/dbf.py", line 120, in __init__
self._check_headers()
File "~/dbfread/dbf.py", line 259, in _check_headers
raise ValueError('Unknown field type: {!r}'.format(field.type))
ValueError: Unknown field type: 'V'
Seems to be DBF format slightly updated
https://msdn.microsoft.com/en-us/library/st4a0s68%28VS.80%29.aspx
I am trying to open a simple dbase III file and I am getting
File "C:/Users/user/Documents/Python/SAG/dbftest.py", line 4, in <module>
test = DBF('SystemRecord.dbf')
File "C:\Python36-32\lib\site-packages\dbfread\dbf.py", line 122, in __init__
self._read_field_headers(infile)
File "C:\Python36-32\lib\site-packages\dbfread\dbf.py", line 224, in _read_field_headers
field = DBFField.unpack(sep + infile.read(DBFField.size - 1))
File "C:\Python36-32\lib\site-packages\dbfread\struct_parser.py", line 36, in unpack
items = zip(self.names, self.struct.unpack(data))
struct.error: unpack requires a bytes object of length 32
Im assuming there is some field/index in my file that is larger/shorter than some maximum.
Any ideas whats causing this?
Currently,the latest version of dbfread
available on PyPI is 2.0.7 (from November 2016). That version does not register its supported versions of Python and it doesn't include any trove classifiers. It would be great to make the newest version available via pip install --upgrade dbfread
.
Also, I think it could be helpful to add a check-list that could be used when publishing releases. This is something I do for my own projects and it helps formalize the process and reduce mistakes.
If you like the idea, I can submit it as a pull request. But for now, here is a release-checklist.rst
file I would suggest as a good starting point:
Release Checklist
=================
#. In ``dbfread/version.py``, make sure the correct version number is defined
for this release.
#. Make sure that information about supported Python versions is consistent:
* In the call to ``setup()``, check the versions defined by the
*python_requires* argument (see the "Version specifiers" section of
PEP-440 for details).
* In the call to ``setup()``, check the trove classifiers in the
*classifiers* argument (see https://pypi.org/classifiers/ for values).
* In ``README.rst``, check the versions listed in the "Main Features" and
"Installing" sections.
* In ``docs/installing.rst``, check the versions listed in the
"Requirements" section.
#. Make sure the *description* argument in ``setup.py`` matches the project
description on GitHub (in the "About" section).
#. Check that *packages* argument of ``setup()`` is correct. Check that the
value matches what ``setuptools.find_packages()`` returns:
>>> import setuptools
>>> sorted(setuptools.find_packages('.', exclude=['tests']))
Defining this list explicitly (rather than using ``find_packages()``
directly in ``setup.py`` file) is needed when installing on systems
where ``setuptools`` is not available.
#. Make final updates to ``docs/changes.rst`` file.
#. Commit and push final changes to the upstream development repository:
Prepare version info, documentation, and README for version N.N.N release.
#. In the upstream repository, make sure that all of the tests and checks
are passing.
#. Make sure the packaging tools are up-to-date:
pip install -U twine wheel setuptools check-manifest
#. Check the manifest against the project's root folder:
check-manifest .
#. Remove any existing files from the ``dist/`` folder.
#. Build new distributions:
python setup.py sdist bdist_wheel
#. Upload distributions to TestPyPI:
twine upload --repository testpypi dist/*
#. View the package's web page on TestPyPI and verify that the information
is correct for the "Project links" and "Meta" sections:
* https://test.pypi.org/project/dbfread
If you are testing a pre-release version, make sure to use the URL returned
by twine in the previous step (the default URL shows the latest *stable*
version).
#. Test the installation process from TestPyPI:
python -m pip install --index-url https://test.pypi.org/simple/ dbfread
If you're testing a pre-release version, make sure to use the "pip install"
command listed at the top of the project's TestPyPI page.
#. Upload source and wheel distributions to PyPI:
twine upload dist/*
#. Double check PyPI project page and test installation from PyPI:
python -m pip install dbfread
#. Make sure the documentation version reflects the new release:
* https://dbfread.readthedocs.io/
If the documentation was not automatically updated, you may need to
login to https://readthedocs.org/ and start the build process manually.
#. Publish update announcement to relevant mailing lists:
* [email protected]
hi,
When I was trying to read a dbf file, I found that when the dbf file has a wrong field type definition, it cannot read the data correctly.
e.g.
the blfm2 is actually filled with float, but it was mistakenly defined as 'C' type. If correctly parsed with the follwing method (dbf.py line 122)
self._read_field_headers(infile)
def _read_field_headers(self, infile):
while True:
sep = infile.read(1)
if sep in (b'\r', b'\n', b''):
# End of field headers
break
field = DBFField.unpack(sep + infile.read(DBFField.size - 1))
field.type = chr(ord(field.type))
# For character fields > 255 bytes the high byte
# is stored in decimal_count.
if field.type in 'C':
field.length |= field.decimal_count << 8
field.decimal_count = 0
# Field name is b'\0' terminated.
field.name = self._decode_text(field.name.split(b'\0')[0])
if self.lowernames:
field.name = field.name.lower()
self.field_names.append(field.name)
self.fields.append(field)
it should produce DBFField like:
DBFField(name='blfm2', type='C', address=196, length=17, decimal_count=12,
But the it turns out to be
DBFField(name='blfm2', type='C', address=196, length=3072, decimal_count=0
I think the problem is here:
# For character fields > 255 bytes the high byte
# is stored in decimal_count.
if field.type in 'C':
field.length |= field.decimal_count << 8
field.decimal_count = 0
After I removed these lines, I finally got the correct result.
Hi! I tried your library to open a dbf file but get the following error:
import dbfread
file_ = '/home/gijs/.wine/drive_c/OLMSoft/BrouwVisie/Data/Recepten.Dbf'
t = dbfread.dbf.DBF(file_, load=True)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-33-e461b614b7c2> in <module>()
----> 1 t = dbfread.dbf.DBF(file_, load=True)
/usr/local/lib/python2.7/dist-packages/dbfread/dbf.pyc in __init__(self, filename, encoding, ignorecase, lowernames, parserclass, recfactory, load, raw, ignore_missing_memofile)
131
132 if load:
--> 133 self.load()
134
135 @property
/usr/local/lib/python2.7/dist-packages/dbfread/dbf.pyc in load(self)
166 """
167 if not self.loaded:
--> 168 self._records = list(self._iter_records(b' '))
169 self._deleted = list(self._iter_records(b'*'))
170
/usr/local/lib/python2.7/dist-packages/dbfread/dbf.pyc in _iter_records(self, record_type)
308 items = [(field.name,
309 parse(field, read(field.length))) \
--> 310 for field in self.fields]
311
312 yield self.recfactory(items)
/usr/local/lib/python2.7/dist-packages/dbfread/field_parser.pyc in parse(self, field, data)
73 raise ValueError('Unknown field type: {!r}'.format(field.type))
74 else:
---> 75 return func(field, data)
76
77 def parse0(self, field, data):
/usr/local/lib/python2.7/dist-packages/dbfread/field_parser.pyc in parseN(self, field, data)
162 else:
163 # Account for , in numeric fields
--> 164 return float(data.replace(b',', b'.'))
165
166 def parseO(self, field, data):
ValueError: could not convert string to float: F
$ file Recepten.Dbf
Recepten.Dbf: FoxBase+/dBase III DBF, 1 record * 13734, update-date 115-8-16, codepage ID=0x37, with index file .MDX, at offset 8577 1st record " FMijn eig"
While opening a dbf I got this error:
ValueError: invalid date b'24/01/20'
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/usr/local/lib/python3.8/dist-packages/dbfread/field_parser.py in parseD(self, field, data)
91 try:
---> 92 return datetime.date(int(data[:4]), int(data[4:6]), int(data[6:8]))
93 except ValueError:
ValueError: invalid literal for int() with base 10: b'24/0'
the parser is apparently spliting the date in the wrong positions.
I have this error opening a DBF file:
Traceback (most recent call last):
File "hwmnucheck.py", line 486, in main
checker.loaddb()
File "hwmnucheck.py", line 301, in loaddb
self._loaddb_turni()
File "hwmnucheck.py", line 236, in _loaddb_turni
table = dbfread.DBF(self._db_turnimensa_f)
File "\WinPython-32bit-3.4.3.7\python-3.4.3\lib\site-packages\dbfread\dbf.py", line 123, in __init__
self._check_headers()
File "WinPython-32bit-3.4.3.7\python-3.4.3\lib\site-packages\dbfread\dbf.py", line 265, in _check_headers
raise ValueError('Unknown field type: {!r}'.format(field.type))
ValueError: Unknown field type: '\x00'
The way tests are run is not working anymore.
This will require internal changes to the library, and perhaps some API changes.
Relevant issues and pull requests:
I suspect the best solution to handle all of these cases is to break most of the code into functions that can be composed into different APIs. For example, there could be a function that reads the headers from a file object, an d a generator that reads records from a file object and we could write a new backwards compatible DBF class or another leaner API on top of these.
These smaller function would also be useful when writing debugging tools.
I ran a benchmark dumping a DBF to CSV 1,000 times using dbfread 2.0.7 and Pandas 0.24.1 and comparing it to https://github.com/yellowfeather/dbf2csv.
The file I used was a 1.5 MB DBF in 'FoxBASE+/Dbase III plus, no memory' format with 4,522 records and 40 columns made up of 36 numeric fields, 2 char fields and 2 date fields. I used CPython 2.7.12 for dbfread and I compiled dbf2csv using GCC 5.4.0 and libboost v1.58.0. I ran the test on a t3.2xlarge instance on AWS.
The disk can be written to at 115 MB/s according to:
dd if=/dev/zero of=./test bs=1G count=1 oflag=dsync
dbfread and Pandas managed to write the CSV 1,000 in 26 minutes while dbf2csv took just under 74 seconds, a 21x difference in performance.
These were the steps I took during the benchmark:
$ vi run_1
for i in {1..1000}; do
./dbf2csv -c FILE.DBF > FILE.csv
done
$ time bash run_1 # 1m13.699s, 21x faster
$ vi run_2.py
from dbfread import DBF
import pandas as pd
dbf = DBF('FILE.DBF', encoding='latin-1', char_decode_errors='strict')
pd.DataFrame(iter(dbf)).to_csv('FILE.csv', index=False, encoding='utf-8', mode='w')
$ vi run_2
for i in {1..1000}; do
python run_2.py
done
$ time bash run_2 # 25m59.877s
The DBF would have sat in the page cache but nonetheless represented 1,489 MB of data read. The resulting CSV represented 638 MB. So this works out to 0.95 MB/s read and 0.41 MB/s written.
Do you know of any way I could see improved performance in this workload using dbfread?
Would it make sense for dbfread
to support compressed DBF files – DBC files?
It would be similar to what the read.dbc package does in R. I couldn't find a package that does the same in Python. I've tried dbfread
and it currently only reads uncompressed DBF files.
If you need DBC files for testing, there are lots of them at the DATASUS website (official public healthcare system statistics from Brazil).
I'm having problems tables with unicode. CNTR_AT_2013
comes from Eurostat db. Below is my attempt to read data and what is expected (parsed via Unix dbfview
).
Problems occurs both in Python 2.7 and 3.5.
>>> for record in dbfread.DBF('CNTR_AT_2013.dbf'):
print record
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/dbfread/dbf.py", line 316, in _iter_records
for field in self.fields]
File "/usr/local/lib/python2.7/dist-packages/dbfread/field_parser.py", line 79, in parse
return func(field, data)
File "/usr/local/lib/python2.7/dist-packages/dbfread/field_parser.py", line 87, in parseC
return self.decode_text(data.rstrip(b'\0 '))
File "/usr/local/lib/python2.7/dist-packages/dbfread/field_parser.py", line 45, in decode_text
return decode_text(text, self.encoding, errors=self.char_decode_errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)
>>> table = dbfread.read("CNTR_AT_2013.dbf")
/usr/local/lib/python2.7/dist-packages/dbfread/deprecated_dbf.py:47: UserWarning: dbfread.read() has been replaced by DBF(load=True) and will be removed in 2.2.
warnings.warn("dbfread.read() has been replaced by DBF(load=True)"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/dbfread/deprecated_dbf.py", line 49, in read
return DeprecatedDBF(filename, load=True, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/dbfread/dbf.py", line 136, in __init__
self.load()
File "/usr/local/lib/python2.7/dist-packages/dbfread/deprecated_dbf.py", line 18, in load
self[:] = self._iter_records(b' ')
File "/usr/local/lib/python2.7/dist-packages/dbfread/dbf.py", line 316, in _iter_records
for field in self.fields]
File "/usr/local/lib/python2.7/dist-packages/dbfread/field_parser.py", line 79, in parse
return func(field, data)
File "/usr/local/lib/python2.7/dist-packages/dbfread/field_parser.py", line 87, in parseC
return self.decode_text(data.rstrip(b'\0 '))
File "/usr/local/lib/python2.7/dist-packages/dbfread/field_parser.py", line 45, in decode_text
return decode_text(text, self.encoding, errors=self.char_decode_errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)
Cntr id : QN
Cntr name : Puerto Rico and Virgin Islands of the United States
Name asci : Puerto Rico and Virgin Islands of the United States
Name engl : Puerto Rico and Virgin Islands of the United States
Name fren : PUERTO RICO ET LES ÎLES VIERGES DES ÉTATS-UNIS
Poli org c : 99.00000000
Name gaul :
Iso3 code :
Svrg un : US Territory
Capt :
Cntr code : UA
Eu stat : F
Efta stat : F
Cc stat : F
Cntr id : QO
Cntr name : Guadeloupe and Martinique
Name asci : Guadeloupe and Martinique
Name engl : GUADELOUPE AND MARTINIQUE
Name fren : GUADELOUPE ET MARTINIQUE
Poli org c : 99.00000000
Name gaul :
Iso3 code :
Svrg un : FR Territory
Capt :
Cntr code : UA
Eu stat : F
Efta stat : F
Cc stat : F
Hi, I've been using this and generally it supports most of the foxpro .dbf file, but there is one file that can't be read for some reason, it says "unpack requires a string argument of length 32". What might causes this?
Here is the file: https://drive.google.com/file/d/0BweenIzZNEAtSWdoM1M1d3Z5YjQ/view?usp=sharing
UnicodeEncodeError: 'utf-8' codec can't encode characters. This error is coming when i am using utf-8 encoding.
The file is having value like below:
ABCâ XYZ 123
dbf = DBF(input_path, encoding="utf-8")
When i used encoding="iso-8859-1" then i am not getting an error but its getting converted to
ABCÂ XYZ 123
dbf = DBF(input_path, encoding="iso-8859-1")
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.