thiagopanini / sparksnake Goto Github PK

View Code? Open in Web Editor NEW

12.0 12.0 2.0 263.82 MB

Improving the development of Spark applications deployed as jobs on AWS services like Glue and EMR

License: MIT License

Python 100.00%

aws emr glue pyspark python spark

sparksnake's Introduction

👋 Hi everyone and welcome to my GitHub profile!

Here you will see a lot of good stuff! I hope you enjoy it!

🤖 Programming Languages and Tools

🤝 My Open Source Projects

If I would define me in one sentence, I would say I'm just a curious soul living in this big world and trying to share good things with others. So, I'm really excited to share with you all my open source projects that I built in order to help others and to contribute with the open source community.

👇 Click to see more.

Python Packages

Terraform Modules

Others

🖊️ Certificates

As a lifelong learner, I'm always trying to challenge myself to reach something higher through certifications. Here's some badges I had already earned in this journey.

📦 Main GitHub Repositories

📊 GitHub Stats

🤔 Some Things You Probably Don't Know About Me

✍️ I have a personal blog called panini-tech-lab where I write about Linux, Hadoop, Hive, Spark and other things
💪 I became an AWS Community Builder in 2023
☁️ I went to AWS re:Invent 2022 (Vegas, baby!)
🤖 I'm a "notebooks expert" at Kaggle
⚽ I love soccer and my team is Corinthians
🤯 I like mind blowing movies. My top 3 are
- Donnie Darko
- Inception
- Shutter Island
🏯 I'm a huge anime fan. It was very difficult to set a top 3, but here they are:
- Shingeki no Kyojin
- Full Metal Alchemist
- Darker than Black
- In fact I'm currently in love with the seinen style (Vagabond, Vinland Saga, Berserk)

📽️ Me in the Media

Well, I'm not a famous person at all, but let's say that I had already done some special partnerships with special people.

⚡ GitHub Recent Activity

⬆️ Pushed 1 commits to ThiagoPanini/panini-tech-stories
⬆️ Pushed 1 commits to ThiagoPanini/panini-tech-stories
⬆️ Pushed 1 commits to ThiagoPanini/panini-tech-stories
⬆️ Pushed 1 commits to ThiagoPanini/panini-tech-stories
⬆️ Pushed 1 commits to ThiagoPanini/panini-tech-stories

Last Updated: Saturday, June 1st, 2024, 11:32:01 PM

Note All reference links for creating this profile can be found below. If you want to do something similar, you can count on me for help you :)

🔗 References and Links

Devs @ForrestKnight, @DenverCoder1, @CodeSTACKr, @rishavchanda, @rafaballerini, @arthurspk, @Lissy93, @gautamkrishnar
DenverCoder1/readme-typing-svg for the writing on the beginning of this page
DenverCoder1/custom-icon-badges for custom icons on badges (e.g. Hashnode and Itau)
devicons/devicon for tech icons and tools
DenverCoder1/github-readme-youtube-cards for YouTube cards
anuraghazra/github-readme-stats for profile stats and GitHub activity
Readme-Workflows/recent-activity for automating recent activity on GitHub

sparksnake's People

Contributors

Stargazers

Watchers

Forkers

edson-github rodrwankenobi

sparksnake's Issues

[FEATURE] Sequential SparkSQL query executor

🚀 Feature needed:
This feature request comes with a new idea to improve sparksnake's features. The idea is to provide a new method on SparkETLManager class (by consequence, using the default operation mode) to enable users to run multiple SparkSQL statements in sequence.

In other words, imagine a Spark application (or a Glue job on AWS) built only using SparkSQL queries. In fact, this application would have several SQL statements that would be executed in a predefined order, one by one, until the final goal is finally achieved.

This feature request can address this behavior by providing a new capability to define a JSON file with all SparkSQL query statements to run sequentially. So, calling this new method would enable users to better structure their SQL steps and run it using a single line of code.

🏆 Feature benefits:
With this new features, user would:

Improve the organization of a Spark application that runs using SQL statements mostly.
Enhance the way users give maintenance for such application as long as all the SQL statements would be structured in a single JSON file

📚 Complexity:
The best category that fits into the development of this new feature is:

High complexity

💡 Ideas on how to develop it:
To develop this new feature, would be possible to:

Define a new method on manager.SparkETLManager class
Use a JSON file to structure all the steps needed to run SparkSQL queries sequentially

[DOC] Alterar toda a documentação da biblioteca para o inglês

📚 Funções, métodos ou objetos alvo:
Elementos de código que necessitam de melhorias na documentação:

README.md
Documentação no readthedocs
Todos os scripts e códigos

✏️ Detalhes adicionais:
Visando tornar o sparksnake algo global, ter uma porta de entrada pautada no idioma inglês é essencial para que o projeto ganhe força, tração e consiga, de fato, emplacar com a comunidade.

[BUG] ModuleNotFoundError: No module named 'faker'

✍️ Problem description:
There is an issue when trying to import any module from sparksnake library caused by the missing of Faker library as explicit a project dependency.

🐞 Reproducing the problem:
To reproduce the bug it's need the take the following steps (example):

Import any module from sparksnake

⚙️ Expected behavior:
No exceptions raised.

💬 Possible solutions:
Probably the the solution for this bug would be reached through:

Add Faker as a project dependency on setup.py file (and also on requirements.txt for developers and contributors)

[DOC] Improve log messages and other documentation blocks from the library

📚 Target modules, classes or functions:
After a while using sparksnake, I noticed some typos on log messages and along all the project documentation (docs page, docstrings, and others). This issue come to remind me to fix all those typos and improve all the docs.

[FEATURE] Create functions to help users to write test cases for Spark applications

🚀 Feature needed:
We all know that develop test cases in Spark applications (specially in Glue jobs) is difficult. This feature request is based on this statement and ask for sparksnake developers to write new modules and functions that can be used to help others to make the process of creating unit tests easier.

🏆 Feature benefits:
With this new features, user would:

Create fake Spark DataFrame objects to use in unit tests
Develop unit tests for those new functions

📚 Complexity:
The best category that fits into the development of this new feature is:

Medium complexity

[FEATURE] Increase test coverage by including test cases for manager and glue modules

🚀 Feature needed:
This request embraces the increase of the test coverage for the project by including new test cases for features presented on both manager.py and glue.py module. The manager module holds the SparkETLManager class with some useful methods that enable an enhancement on Spark usage on both local mode and glue mode.

🏆 Feature benefits:
With this new features, user would:

Ensure features are properly tested
Increase the project coverage

📚 Complexity:
The best category that fits into the development of this new feature is:

High complexity

💡 Ideas on how to develop it:
To develop this new feature, would be possible to:

Create a new test script called test_manager_module.py
Add new fixtures, if applicable
Create fake Spark DataFrames to be used on tests

[FEATURE] Reparticionamento dinâmico no método `repartition_dataframe()`

🚀 Resumo da funcionalidade solicitada:
Considerando a funcionalidade atual do método repartition_dataframe() da classe SparkETLManager, temos:

Coleta da quantidade atual de partições do DataFrame alvo via df.rdd.getNumPartitions()
Validação entre a quantidade atual de partições coletadas e o número esperato de partições passado através do parâmetro num_partitions
Se a quantidade atual for MENOR que o número desejado, executa o método coalesce(). Se for MENOR, executa o método repartition(). Se for IGUAL, não realiza qualquer operação.

Neste cenário, a grande dificuldade para o usuário é, sem dúvidas, ter um conhecimento prévio sobre o número desejado de partições para otimizar o armazenamento do objeto no sistema distribuído. Na prática, a ciência sobre esta informação só pode ser obtida através de experimentações práticas.

Dito isso, é preciso pensar na possibilidade de proporcionar uma opção de realizar esta otimização de maneira AUTOMÁTICA, ou seja, coletando não apenas o número de partições de um DataFrame, mas também o tamanho em bytes de cada partição para que, assim, seja possível sugerir um número ótimo de partições para o processo de reparticionamento de acordo com um tamanho ótimo de bloco (128MB ou 256MB).

🏆 Resumo sobre benefícios da nova funcionalidade:
Com a implementação dessa nova funcionalidade, os usuários poderiam:

Otimizar o processo de armazenamento de grandes volumes de dados
Eliminar problemas de small files em que um alto número de arquivos com pouco volume são gerados em processos de escrita
Evitar realizar qualquer tipo de cálculo ou experimentações prévias para alcançar o número ótimo de partições

📚 Provável complexidade:
A melhor opção que descreve a complexidade associada a esta funcionalidade é:

Alta complexidade

💡 Ideias de implementação:
Para implementação da funcionalidade, seria possível:

Consultar formas de coletar o tamanho da partição de um DataFrame
Realizar um cálculo baseado no número de partições e no tamanho total de um DataFrame
Ter como baseline um tamanho ótimo de bloco a ser considerado por partição

Referências:

[DOC] Improve the documentation page on readthedocs

📚 Target modules, classes or functions:
Project components that needs a documentation improvement:

Entire documentation page on readthedocs
Issue templates and PR templates

[FEATURE] Refatoramento completo da lógica da biblioteca

🚀 Resumo da funcionalidade solicitada:
Ao analisar a fundo a visão de futuro da biblioteca, entendeu-se que a mesma, em sua forma atual, encontra-se altamente acoplada ao uso específico do Glue na AWS. Em outras palavras, a forma com a biblioteca foi construída possibilita apenas o uso das funcionalidades Spark no serviço Glue, restringindo seu uso a outras possibilidades de uso do Spark na nuvem, como no serviço EMR, por exemplo.

Dessa forma, a proposta que se faz presente nesta issue é a de refatorar completamente a lógica da biblioteca, principalmente no que diz respeito às suas duas principais classes (GlueJobManager e GlueETLManager). A ideia é trazer um caráter universal às funcionalidades, permitindo seu uso não apenas no Glue, mas também em outros serviços de nuvem que utilizam o Spark como framework de processamento de dados.

Dito isso, o próprio nome gluesnake pode não mais fazer sentido.

Por mais que esta refatoração se faça necessária, a estrutura de consumo e de documentação da biblioteca podem permanecer intocáveis.

🏆 Resumo sobre benefícios da nova funcionalidade:
Com a implementação dessa nova funcionalidade, os usuários poderiam:

Utilizar as funcionalidades da biblioteca em qualquer serviço AWS que utiliza o Spark (e não apenas o Glue)

📚 Provável complexidade:
A melhor opção que descreve a complexidade associada a esta funcionalidade é:

Extrema complexidade

💡 Ideias de implementação:
Para implementação da funcionalidade, seria possível:

Alterar o nome da biblioteca de gluesnake para sparksnake
Alterar o nome da classe GlueETLManager para SparkETLManager
Migrar todas as funcionalidades específicas do Glue da classe GlueETLManager para a classe GlueJobManager. Métodos que podem ser migrados:
- generate_dynamicframes_dict()
- generate_dataframes_dict()
- write_data()
- drop_partition() (A discutir)
Futuro: criação de novas classes específicas de serviço (ex: EMRJobManager)

thiagopanini / sparksnake Goto Github PK

sparksnake's Introduction

👋 Hi everyone and welcome to my GitHub profile!

🤖 Programming Languages and Tools

🤝 My Open Source Projects

🖊️ Certificates

📦 Main GitHub Repositories

📊 GitHub Stats

🤔 Some Things You Probably Don't Know About Me

📽️ Me in the Media

⚡ GitHub Recent Activity

🔗 References and Links

sparksnake's People

Contributors

Stargazers

Watchers

Forkers

sparksnake's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs