GithubHelp home page GithubHelp logo

narius2030 / vietnamese-text-generator Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 472.58 MB

Implement the word embedding for exploring the correlation among words - Design a sequence model for generating text

Python 45.82% Jupyter Notebook 54.18%
apache-airflow nlp-deep-learning streamlit-webapp tensorflow data-collection text-generator

vietnamese-text-generator's Introduction

Introduction

Implement the word embedding for exploring the correlation among words - Design a sequence model for generating text

  • Work-flow

image

Implement

  • Apply natural language processing techniques, such as: remove punctuations and symbols, remove stop words, reformat text, tokenize words and create corpus
  • Design an artificial neural network model for generating text by using LSTM, I use Embedding layer for embedding word from text to feature vector and find relationships among them
  • In the final layer, I use Dense layer with Softmax activation function for classifying which word has the highest probability
  • Besides, I implement a data pipeline using Apache Airflow for crawling text data from VnExpress, I utilize the BeautifulSoup for crawler

Buil Model

model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Embedding(vocab_size, 50, input_length=50))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.LSTM(512, return_sequences=True))
model.add(tf.keras.layers.LSTM(512))
model.add(tf.keras.layers.Dense(100, activation='relu'))
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Dense(vocab_size, activation='softmax'))
Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_2 (Embedding)     (None, 50, 50)            190600    
                                                                 
 batch_normalization_4 (Bat  (None, 50, 50)            200       
 chNormalization)                                                
                                                                 
 lstm_4 (LSTM)               (None, 50, 512)           1153024   
                                                                 
 lstm_5 (LSTM)               (None, 512)               2099200   
                                                                 
 dense_9 (Dense)             (None, 100)               51300     
                                                                 
 dropout_2 (Dropout)         (None, 100)               0         
                                                                 
 batch_normalization_5 (Bat  (None, 100)               400       
 chNormalization)                                                
                                                                 
 dense_10 (Dense)            (None, 3812)              385012    
                                                                 
=================================================================
Total params: 3879736 (14.80 MB)
Trainable params: 3879436 (14.80 MB)
Non-trainable params: 300 (1.17 KB)
_______________________________________

Generated Text Sample

  • The highest probability sentence
text = generator.generate_sentences('cầu_thủ cầm vại bia lớn dội vào hlv và cầu_thủ khác', 20)

""" result: cầu_thủ cầm vại bia lớn dội vào hlv và cầu_thủ khác là hành_động ăn_mừng thường thấy sau khi giành bundesliga tối 144 nếu thắng werder bremen trên sân_nhà leverkusen sẽ đủ """
  • Top 3 highest probability sentences
text_input = "cầu_thủ cầm vại bia lớn dội vào hlv và cầu_thủ khác"
generator.generate_possible_sentences(text_input, top_n=3, n_words=20)

""" result: ['cầu_thủ cầm vại bia lớn dội vào hlv và cầu_thủ khác đã chơi giúp họ giành danh_hiệu atp ở anfield nhất tại world_cup qua anh từng giành nhiều danh_hiệu tập_thể lớn cũng',
 'cầu_thủ cầm vại bia lớn dội vào hlv và cầu_thủ khác có_thể thắng sẽ luôn được từng kéo_dài nhiều hơn alonso từng có alonso có_thể lập lại bundesliga leverkusen đang kém bayer',
 'cầu_thủ cầm vại bia lớn dội vào hlv và cầu_thủ khác là hành_động ăn_mừng thường thấy sau khi giành bundesliga tối 144 nếu thắng werder bremen trên sân_nhà leverkusen sẽ đủ điểm'] """

Data Scraping

def scrape_news():
    topics_links = read_yaml('./src/crawler/links.yaml')
    topics_links = get_links_from_subtopics(topics_links, pages=3)
        
    # set output path
    OUTPUT = './data/vnexpress/raw_news'

    print('\nCrawling...')
    for topic, links in topics_links.items():
        # the number of news links per topic
        print(f'Topic {topic} - Number of Sub-topic: {len(links)}')
        
        # save news data into text file in raw_news folder
        file_path = os.path.join(OUTPUT, f'{topic}.txt')
        with open(file_path, 'w') as file:
            for link in tqdm(links):
                url = list(link.keys())[0]
                items = link[url]
                content = get_content_from_article(url, items[0], items[1], topic)
                if content is not None:
                    file.write(json.dumps(content))
                    file.write('\n')

    print('\nCompleted!')

Data pipeline in Airflow

dag = DAG(
    'ETL-VNExpress',
    default_args={'start_date': days_ago(1)},
    schedule_interval='55 17 * * *',
    catchup=False
)

extract_data = PythonOperator(
    task_id='extract_data',
    python_callable=scrape_news,
    dag=dag
)

transform_load = PythonOperator(
    task_id='transform_load',
    python_callable=transform_load,
    dag=dag
)

print_date_task = PythonOperator(
    task_id='print_date',
    python_callable=print_date,
    dag=dag
)

# Set the dependencies between the tasks
extract_data >> transform_load >> print_date_task

vietnamese-text-generator's People

Contributors

narius2030 avatar htn-dt-beo avatar

Stargazers

Hoang Huu Tu  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.