GithubHelp home page GithubHelp logo

tirthajyoti / pydbgen Goto Github PK

View Code? Open in Web Editor NEW
293.0 11.0 58.0 703 KB

Random dataframe and database table generator

Home Page: https://pydbgen.readthedocs.io/en/latest/

License: MIT License

Python 53.66% Jupyter Notebook 46.34%
database random-generation pandas-dataframe sqlite3 sqlite python synthetic-data data-science generator data-generation

pydbgen's Introduction

Random database/dataframe generator

Authored and maintained by Dr. Tirthajyoti Sarkar, Fremont, USA

Introduction

Often, beginners in SQL or data science struggle with the matter of easy access to a large sample database file (.DB or .sqlite) for practicing SQL commands. Would it not be great to have a simple tool or library to generate a large database with multiple tables, filled with data of one's own choice?

After all, databases break every now and then and it is safest to practice with a randomly generated one :-)

image

While it is easy to generate random numbers or simple words for Pandas or dataframe operation learning, it is often non-trivial to generate full data tables with meaningful yet random entries of most commonly encountered fields in the world of database, such as

  • name,
  • age,
  • birthday,
  • credit card number,
  • SSN,
  • email id,
  • physical address,
  • company name,
  • job title,

This Python package generates a random database TABLE (or a Pandas dataframe, or an Excel file) based on user's choice of data types (database fields). User can specify the number of samples needed. One can also designate a "PRIMARY KEY" for the database table. Finally, the TABLE is inserted into a new or existing database file of user's choice.

image

Dependency and Acknowledgement

At its core, pydbgen uses Faker as the default random data generating engine for most of the data types. Original function is written for few data types such as realistic email and license plate. Also the default phone number generated by Faker is free-format and does not correspond to US 10 digit format. Therefore, a simple phone number data type is introduced in pydbgen. The original contribution of pydbgen is to take the single data-generating function from Faker and use it cleverly to generate Pandas data series or dataframe or SQLite database tables as per the specification of the user. Here is the link if you want to look up more about Faker package,

Faker Documentation Home

Installation

(On Linux and Windows) You can use pip to install pydbgen:

pip install pydbgen

(On Mac OS), first install pip, :

curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python get-pip.py

Then proceed as above.

Usage

Current version (1.0.0) of pydbgen comes with the following primary methods,

  • gen_data_series()
  • gen_dataframe()
  • gen_table()
  • gen_excel()

The gen_table() method allows you to build a database with as many tables as you want, filled with random data and fields of your choice. But first, you have to create an object of pydb class:

myDB = pydbgen.pydb()

gen_data_series()

Returns a Pandas series object with the desired number of entries and data type. Data types available:

  • Name, country, city, real (US) cities, US state, zipcode, latitude, longitude
  • Month, weekday, year, time, date
  • Personal email, official email, SSN
  • Company, Job title, phone number, license plate

Phone number can be of two types:

  • phone_number_simple generates 10 digit US number in xxx-xxx-xxxx format
  • phone_number_full may generate an international number with different format

Code example:

se=myDB.gen_data_series(data_type='date')
print(se)

0    1995-08-09
1    2001-08-01
2    1980-06-26
3    2018-02-18
4    1972-10-12
5    1983-11-12
6    1975-09-04
7    1970-11-01
8    1978-03-23
9    1976-06-03
dtype: object

gen_dataframe()

Generates a Pandas dataframe filled with random entries. User can specify the number of rows and data type of the fields/columns.

  • Name, country, city, real (US) cities, US state, zipcode, latitude, longitude
  • Month, weekday, year, time, date
  • Personal email, official email, SSN

* Company, Job title, phone number, license plate Customization choices are following:

  • real_email: If True and if a person's name is also included in the fields, a realistic email will be generated corresponding to the name of the person. For example, Tirtha Sarkar name with this choice enabled, will generate emails like [email protected] or [email protected].
  • real_city: If True, a real US city's name will be picked up from a list (included as a text data file with the installation package). Otherwise, a fictitious city name will be generated.
  • phone_simple: If True, a 10 digit US number in the format xxx-xxx-xxxx will be generated. Otherwise, an international number with different format may be returned.

Code example:

testdf=myDB.gen_dataframe(
25,fields=['name','city','phone',
'license_plate','email'],
real_email=True,phone_simple=True
)

gen_table()

Attempts to create a table in a database (.db) file using Python's built-in SQLite engine. User can specify various data types to be included as database table fields.

All data types (fields) in the SQLite table will be of VARCHAR type. Data types available:

  • Name, country, city, real (US) cities, US state, zipcode, latitude, longitude
  • Month, weekday, year, time, date
  • Personal email, official email, SSN
  • Company, Job title, phone number, license plate

Customization choices are following:

  • real_email: If True and if a person's name is also included in the fields, a realistic email will be generated corresponding to the name of the person. For example, Tirtha Sarkar name with this choice enabled, will generate emails like [email protected] or [email protected].
  • real_city: If True, a real US city's name will be picked up from a list (included as a text data file with the installation package). Otherwise, a fictitious city name will be generated.
  • phone_simple: If True, a 10 digit US number in the format xxx-xxx-xxxx will be generated. Otherwise, an international number with different format may be returned.
  • db_file: Name of the database where the TABLE will be created or updated. Default database name will be chosen if not specified by user.
  • table_name: Name of the table, to be chosen by user. Default table name will be chosen if not specified by user.
  • primarykey: User can choose a PRIMARY KEY from among the various fields. If nothing specified, the first data field will be made PRIMARY KEY. If user chooses a field, which is not in the specified list, an error will be thrown and no table will be generated.

Code example:

myDB.gen_table(
20,fields=['name','city','job_title','phone','company','email'],
    db_file='TestDB.db',table_name='People',
primarykey='name',real_city=False
)

gen_excel()

Attempts to create an Excel file using Pandas excel_writer function. User can specify various data types to be included. All data types (fields) in the Excel file will be of text type. Data types available:

  • Name, country, city, real (US) cities, US state, zipcode, latitude, longitude
  • Month, weekday, year, time, date
  • Personal email, official email, SSN

* Company, Job title, phone number, license plate Customization choices are following:

  • real_email: If True and if a person's name is also included in the fields, a realistic email will be generated corresponding to the name of the person. For example, Tirtha Sarkar name with this choice enabled, will generate emails like [email protected] or [email protected].
  • real_city: If True, a real US city's name will be picked up from a list (included as a text data file with the installation package). Otherwise, a fictitious city name will be generated.
  • phone_simple: If True, a 10 digit US number in the format xxx-xxx-xxxx will be generated. Otherwise, an international number with different format may be returned.
  • filename: Name of the Excel file to be created or updated. Default file name will be chosen if not specified by user.

Code example:

myDB.gen_excel(15,fields=['name','year','email','license_plate'],
    filename='TestExcel.xlsx',real_email=True)

Other auxiliary methods available

Few other auxiliary functions available in this package.

pydbgen's People

Contributors

andypayne avatar chrissoulierserv avatar dinis-it avatar havanagrawal avatar horaddrim avatar restarac avatar tirthajyoti avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pydbgen's Issues

Install on py 3 -

Running in Jupyter notebook / Ipython 3 shell:

!pip3 install pydbgen
import pydbgen

pydbgen.pydb()

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-4-c22bf096fc05> in <module>
----> 1 myDB = pydbgen.pydb()

AttributeError: module 'pydbgen' has no attribute 'pydb'

Also:

dir(pydbgen)

['__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__']

Fix is:

from pydbgen import pydbgen

Data Generation Issue

TypeError: Calling .seed() on instances is deprecated. Use the class method Faker.seed() instead.
I am getting this error when I am generating the data
image

FileNotFoundError in db.realistic_email()

I've installed pydbgen using pip into venv and received FileNotFoundError for realistic email generator:

>>>  import pydbgen
>>> from pydbgen import pydbgen
>>> myDB=pydbgen.pydb()

>>> for _ in range(10):
...     print(db.realistic_email('Tirtha Sarkar'))
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/home/dr/devel/pydbgen/venv/lib/python3.6/site-packages/pydbgen/pydbgen.py", line 99, in realistic_email
    fh = open(path)
FileNotFoundError: [Errno 2] No such file or directory: 'Domains.txt'

Version:

โ”€$ venv/bin/pip freeze | grep pydbgen
pydbgen==1.0.5

Error

#from pydbgen import pydbgen
myDB=pydbgen.pydb()
testdf= myDB.gen_data_series(data_type='data')
print(testdf)

=====================

Traceback (most recent call last):
File "D:/python/Demopycharm/Test_month/z2020_01_month/Data-collect.py", line 40, in
testdf= myDB.gen_data_series(data_type='data')
File "D:\python\Demopycharm\Myspider\venv\lib\site-packages\pydbgen\pydbgen.py", line 191, in gen_data_series
fake.seed(self.seed)
File "D:\python\Demopycharm\Myspider\venv\lib\site-packages\faker\proxy.py", line 79, in getattribute
raise TypeError(msg)
TypeError: Calling .seed() on instances is deprecated. Use the class method Faker.seed() instead.

cannot create a db table

when i am about to create a new db table ,it always throw an exception like Cannot choose from an empty sequence

FileNotFoundError 'Domains.txt'

while generating datafram getting below error

testdf=myDB.gen_dataframe(25,fields=['name','city','phone','license_plate','email'],real_email=True,phone_simple=True)
Error

fh = open(path)

FileNotFoundError: [Errno 2] No such file or directory: 'Domains.txt'

Phone number column not populated

When passing array containing only 'phone' into fields parameter for gen_excel function, column title but no data are present in the spreadsheet. Passing single column type seems to work fine for other data types.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.