mlbelobraydi / txrrc_data_harvest Goto Github PK

View Code? Open in Web Editor NEW

32.0 32.0 17.0 1.78 MB

Script for accessing and organizing oil and gas well data from the Texas Railroad Commission

License: The Unlicense

Jupyter Notebook 95.67% Python 4.33%

gas hacktoberfest oil texas-railroad-commission txrrc-data-harvest

txrrc_data_harvest's People

Contributors

Stargazers

Watchers

Forkers

dccp80 skylerbast bpmcdowell11 jmcdowell26 praffiah vamseeachanta padraicbc kenman79 jozefsl danealex statcowboy higginbotham-thomas tammyreservoir deflateawning geobayb markfyoung0711

txrrc_data_harvest's Issues

TXRRC file location changes

Describe the bug
TXRCC is no longer using an FTP and documentation and code is out of date.

To Reproduce
Attempting to download or connect to any FTP file

Expected behavior
Connections to work

Additional context
It might be good to have a config file that points to the file locations that can universally change any code cascading from that point.

Request: Add support for working with Polars (in addition to Pandas)

Polars is a python dataframe library which is faster and better for very large data than pandas. Would be awesome to get native support for this project in Pandas!

Change the dbf900_formats section to work with bytes instead of string to keep lat-long accuracy.

formats, layout, main need to be adjusted to work with bytes.

Create script that tests the definitions for the files and reads the data to something that can be loaded to SQL.

Since the definitions have been complete, now a notebook needs to be created to test the process to turn the .ebc file to usable data that can be formatted to JSON or SQL tables. This task is to create a prototype of that process in a notebook.

Testing file to read data from gas ebcdic file to pandas dataframes

An initial notebook and working file have been created for the oil and gas layouts. These need to be vetted and tested to ensure the data grabbed from their respective files are getting good data.

Request: Add support for installing via `pip` by publishing to PyPi

Doing this may take at least a bit of refactoring into a normal Python module structure, renaming some files, etc. Overall, it would be an upgrade to this project though!

Thanks for making and maintaining this wonderful project!

Comp-3 function

Is your feature request related to a problem? Please describe.
The production data has several fields that are comp-3

Describe the solution you'd like
The positions in bytes are less than the ending number of digits. The pic_signed function does not account for this and an additional function needs to be created

Describe alternatives you've considered
Tried to find a way to modify pic_signed and it isn't possible and will require a new function for comp-3

Additional context
Information on Comp-3 can be found here
http://www.3480-3590-data-conversion.com/article-packed-fields.html

Request for how to setup to start helping with development. This should be in the wiki and have links to things like Anaconda, Github, and basic Python resources.

Identify the codec for the IBM mainframe files that can be read in Python

Identify the packages and methods that allow the conversion of the raw file (e.g. ftp://ftpe.rrc.texas.gov/shfwba/dbf900.ebc.gz) to be read in python for modification to other formats or manipulation.

Workflow documentation?

Nice work so far; I was thinking it might be helpful if you were to provide some documentation detailing the order in which the scripts should be ran in, as well as an overview of what each script does (outside of the comments in each notebook). This would make it easier for folks to pick it up and run with it. Looking forward to digging in and see what all this is capable of.

Gas Production Layout

Organization of Gas Production Layout

python struct format generated from Cobol copybooks at RRC

Is your feature request related to a problem? Please describe.
I wanted to generate python structs for the Cobol copybooks. And for the computational numeric fields emit for specific fields the hex for the signed/unsigned. I am part way there, but I wanted to bring this to your attention to see if you think this would be good. This way, no one would need to hand code parsing of the structures.

Describe the solution you'd like
I would like to be able to use the copybook in a full cobol program, parse the data division, and generate struct formats so that each section can be parsed directly in python without hand coding the parsing lengths as seems to be the direction now. I am working on the Oil Ledger files now with the copybook defined in the Oil Ledger PDF

Describe alternatives you've considered
I considered writing a parser of the copybook myself, but a Cobol84.g4 grammer file exists for ANTLR4, so I can just use that and generate a Listener that I can use to walk the symbol table and generate the struct formats.

Additional context
I am adding unit tests to make sure the code works as I tweak it.
I would like to integrate this into your repo and contribute to that.
My main interest is in parsing as much Oil/Gas well data out so I can continue my machine learning project which will look for aberations in production data for wells over time.

Oil Production Layout

Capturing the oil production layout

Create definition libraries for all 28 sections in the dbf900.ebc.gz file

Creating file that has the definition of all 28 sections.

Longitude not being recognized as negative

@skylerbast, I'll be pushing the bytes version of the code and I'm not sure if the values are signed or unsigned. It is now capturing the last digits to the value, but I'm not sure the below section is working correctly.

If the penultimate nibble == 0xD, then the number is negative. Otherwise,
it is either positive or unsigned.
val = (val * (-1 if signed_raw[-1] >> 4 == 0xD else 1)) / 10**decimal

Would it be possible to chat with you on how this is suppose to work?

supporting systems with limited memory.

Is your feature request related to a problem? Please describe.
No

Describe the solution you'd like
Currently the script opens and decodes the entire file in memory. This can cause issues for systems with limited memory (<8GBram). It may be good to read parts of the file and dump from memory as necessary to keep memory more free.

Describe alternatives you've considered
opening and reading by line
decoding as necessary
writing results to disk and not holding it in memory

Additional context
Any changes will need to be tested with limited memory.

Preserve original entry of date infomation along with conversion

Is your feature request related to a problem? Please describe.
Dates are not always formatted with correct numbers for good datetime conversion. The original data needs to be preserved to get out any good information in the entry for manual correction

Describe the solution you'd like
Preserve original value along with the datetime conversion

Describe alternatives you've considered
If we keep the nulls, the original data needs to be also added back in. Is it possible to parse and correct out of range months and days with a flag column to see actual vs estimated date.

Additional context
This is important for completions tracking. A month and/or year is better than nothing. DSTs need be linked to the right open section.

mlbelobraydi / txrrc_data_harvest Goto Github PK

txrrc_data_harvest's People

Contributors

Stargazers

Watchers

Forkers

txrrc_data_harvest's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs