Project carried out as a part of a Data Engineering unit at ESIEE Paris. The goal behind was to collect data and reuse it through databases, and produce value with graphic interpretation or search engine.
This project called "TrouveTonStage" aims to make the search for an internship simpler and more personalized. It's made of few scrapers which first gather information about job offers on differents websites. Once done, these data stored in csv are then put on an ElasticSearch database, and thanks to some Flask, a searchengine is build upon that to provide an efficient way to navigate through all theses job offers according to criterions.
For the technical part, you can look to the readme file in the 'scraper' file which contains all scraper used.
If you want to use these project, you can first clone the repo:
git clone https://github.com/Leralix/TrouveTonStage.git
The scraping part in inside de 'scraper' folder.
There is one requirement about the scraping part if you want to run this, you have to install additional packages contained in "requirements.txt"
$ python -m pip install -r requirements.txt
The application part is inside the 'app' folder. To install correctly the application part, you have to get Docker, because here we use an image of ElasticSearch. You have two choices to run the backend process, either you have a simple image of ElasticSearch and you run the programm locally or you run the docker-compose.
To execute correctly the project, you can first execute the 'main.py' inside the "scraper" folder that will execute the scripts to gather informations about job offers.
Inside the main you can modify the delay between each pages load, the number of pages to scrap, output files and other things.
Once the data is gathered, it cleans all these datas and produce an output csv.
You can run it with :
$ python main.py
Edit: Be careful, the process can take quite a time to run entierly, so make sur you have time or you scrap few pages.
The scraping process is made so that every time the 'main' file is executed, it stores the informations in only one csv that kept incrementing with new datas.
The folder 'app' contains everything to run the backend process. The 'data' folder contains the csv that will be put on ES, so make sure you transfer the csv collected from the scraper to the 'app' folder.
Make sure you launch docker, then go inside the 'app' folder and type:
docker-compose up -d
This will execute the docker-compose file. Wait few seconds, its the time it takes to put all datas in csv to ES, then go on your :5000 port to see the Flask page.
The documenation was produced with Sphinx and is contained inside the 'docs' folder.
If the link doesn't work well, you can easily go to
docs/build/html
And open the "index.html" inside, and normally you'll have no problem to see the documentation.
- The docstring was made using help of the "Mintlify Doc Writer" plugin in PyCharm.
- The dockerisation of the 'app' can be changed by using a volume instead of keeping everything local