GithubHelp home page GithubHelp logo

imdeepmind / processed-imdb-wiki-dataset Goto Github PK

View Code? Open in Web Editor NEW
58.0 2.0 12.0 49.03 MB

Processes IMDB WIKI dataset ready to be used in any projects

License: MIT License

Python 22.22% Jupyter Notebook 77.78%
machine-learning deep-learning computer-vision dataset imdb-wiki-dataset gender-classification age-classification

processed-imdb-wiki-dataset's Introduction

Processed IMDB WIKI Dataset

This GitHub repository contains a preprocessed IMDB WIKI dataset.

Table of contents:

Introduction

IMDB WIKI dataset is the largest dataset of human faces with gender, name and age information. In this project, I preprocessed the entire dataset so that it can be used easily without any problems.

IMDB WIKI Dataset

IMDB WIKI dataset is the largest publically available dataset of human faces with gender, age, and name. It contains more than 500 thousand+ images with all the meta information. All the images are in .jpg format.

For more information about the dataset please visit this website.

The Problem

The dataset is great for research purposes. It contains more than 500 thousand+ images of faces. But the dataset is not ready for any Machine Learning algorithm. There are some problems with the dataset.

  • All the images are of different size
  • Some of the images are completely corrupted
  • Some images don't have any faces
  • Some of the ages are invalid
  • The distribution between the gender is not equal(there are more male faces than female faces)
  • Also, the meta information is in .mat format. Reading .mat files in python is a tedious process.

The Solution

In this project, I filter all the images, resized them all to 128x128, remove all the images with invalid age, fix the gender distribution problem, and save them in the proper format. Along with that, Iโ€™ve also processed the .mat files and converted them in .csv files also.

File Structure

This repository contains 3 files

  • mat.py
  • gender.py
  • age.py

The first mat.py file converts the mat files IMDB and WIKI dataset to .csv format and merge them into one file.

The last two file process the images for gender and age classification.

As the size of the dataset is huge, I can not upload it here on GitHub

How to Run Locally

Following are the steps for running it locally

  • Download the dataset from this link and unzip it
  • Extract the dataset and save it in the project directory
  • After that, you should have the following folders
    • imdb_crop
    • wiki_crop
  • Run the mat.py file
  • Run age.py and gender.py file
  • Now the dataset in preprocessed and ready for your project

Dependencies

  • Numpy=1.15.4
  • Scipy=1.2.0
  • pandas=0.23.4
  • cv2=4.0.0

Acknowledgments

I really thankful to these peoples for providing this amazing dataset

processed-imdb-wiki-dataset's People

Contributors

imdeepmind avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

processed-imdb-wiki-dataset's Issues

Can you share "meta.csv" file?

Hello! At this moment, the imdb.mat file, available from the official IMDB-WIKI website, is corrupted. You said that you converted both (imdb.mat and wiki.mat) into a single meta.csv. Would you be so kind as to share this meta.csv with me? It would help me a lot with my MSc thesis... Thanks in advance.

Question about mat.py

Hello, great repository!

Anyway, I have a question regarding how the data is preprocessed.

in mat.py,

you modified the dates from wiki,
changing 00 from the month or the day to 01.

But you skipped the process for wiki.

Can I know the reason behind this difference?

Thank you

Which dataset?

In the link you provided, there are more than one dataset, which one would be suitable to your project? Btw, it seems you did a great job, hopefully your project will be helpful.
(Notee: I want to donwload the cropped and labeled data, 7GB big, for face recognition but I am not still not sure.)
Thanks

Getting error in age.py and gender.py

Hi I am getting error, error: (-215:Assertion failed) !ssize.empty() in function 'resize', while I am running the age.py or gender.py function. Btw, I used try except, in that case I get no single resized image in the dataset file.

Issue with the stacking

Everything runs perfect up to this point - any ideas what could cause this?

final_imdb = np.vstack((imdb_age, imdb_genders, imdb_path, imdb_face_score1, imdb_face_score2)).T
Traceback (most recent call last):

File "", line 1, in
final_imdb = np.vstack((imdb_age, imdb_genders, imdb_path, imdb_face_score1, imdb_face_score2)).T

File "<array_function internals>", line 6, in vstack

File "/Users/austintrombley/opt/anaconda3/lib/python3.7/site-packages/numpy/core/shape_base.py", line 283, in vstack
return _nx.concatenate(arrs, 0)

File "<array_function internals>", line 6, in concatenate

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 10 and the array at index 1 has size 460723

(-215:Assertion failed) !ssize.empty() in function 'cv::resize'

error in trying running the wiki_image.py file. In the step of processing all the images and merge them with the meta, the code goes running for the first 5 Batches and throw an error.
here is the trace for the error

error Traceback (most recent call last)
in ()
37 img = cv2.imread(path, 1)
38
---> 39 img = cv2.resize(img, (150,150))
40
41 img = img.flatten()

error: OpenCV(3.4.3) C:\projects\opencv-python\opencv\modules\imgproc\src\resize.cpp:4044: error: (-215:Assertion failed) !ssize.empty() in function 'cv::resize'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.