laughingclouds / scrapia-world Goto Github PK

A web scraper for scraping wuxiaworld. Written in python, using selenium and python cmd for an interactive shell experience with a command line utility to work with text along with a database to store information.

License: MIT License

Python 99.27% JavaScript 0.73%

wuxiaworld webnovel python3 selenium-python selenium geckodriver chromedriver scraper web-scraper

scrapia-world's Introduction

Scrapia-World

A web scraper for scraping wuxiaworld, written in python, using selenium and both gecko and chrome drivers.

Note:

I don't have any releases setup in pypi, and I probably don't want to use that space for this project (I just prefer that). So you will have to do with either the latest-release or the latest pre-release. (I will do my best to keep even the pre-release; only the latest one: as functional as I can keem them).
This readme might not always be upto date so I'd rather you just go to the latest release (or pre-release)

Setting up:

Latest releases of scrapia-world use only firefox as the earlier requirement of using vivaldi as well has been made void due to a few improvements. Open novel_page_info.json and make changes to the different paths in accordance to your wishes. I assure you the latest release won't break because of any hotch potch in the paths.
The .env file is required for storing the password to the database. You can eazily make changes in the code (remove the load_dotenv function call) to use any other method to use virtual environments. The email and password for logging in should not be moved though. If they are, please make necessary changes in the source code (in the class InteractiveShell in scrapia_shell.py).
You need a database!!: Significant changes have been made in later releases in the way the database is used (or structured). For now here's how things should be:

Set the value of DATABASE in novel_page_info.json and create a database with that name.
Set the value of TABLE in novel_page_info.json and create a table with that name.
This is how the table structure should be:

abreviated_novel_name1	abreviated_novel_name2	abreviated_novel_name3	...
chapter no.	chapter no.	chapter no.	...

Something like this:

It is recommended to set the default value of every column to the integer value of the first chapter number of a novel.

For the other stuff, I'll add in a requirements.txt which you can pip install -r requirements.txt within a virtual environment.

CREATE TABLE "novel" (
 "ATG" INTEGER DEFAULT 0,
 "OG" INTEGER DEFAULT 0
);

SQL code for creating the table.

Webdrivers:

Browser	Recommended Driver	...
Vivaldi	chromedriver
Chromium	chromedriver
Firefox	geckodriver

You will need to link to vivaldi's binary file instead of chromes' to use it. This stackoverflow question might help you out. For me binary's path was /opt/vivaldi/vivaldi (I use linux btw)
Chromedriver version for vivaldi:

In the url area, enter vivaldi://about/.
The version of chromium your vivaldi is based on should be visible in the "user agent" field.
Install chromedriver for this specific version.

If you use linux and want to work with vivaldi, you can just copy the code from the v0.1.0-alpha release.
Using the drivers for chromium and firefox should be easy.

Things to add:

I have taken to adding a docstring at the top of the source files, might not be a good practise...but meh...I'll see what I can do later on.

Issues:

You can track any known issues from the [issues tab] (https://github.com/r3a10god/Scrapia-World/issues).
If you find any issues then feel free to raise them.

Current capability and a few thoughts:

I wanted to read the novel, that's it. And that's what this script helps me with. Therefore, I made it to scrape only two things from a page. The page title, and the relevant text. The title of the page is what will become the name of the text file associated with that page, and the relevant text will be stored in that text file. Hence, it scrapes the raw text of a chapter.
I plan to make new stuff that would deal with that raw text. I could've downloaded the whole page source and made a script to edit that, but I didn't feel the need to do so.

scrapia-world's People

Contributors

Stargazers

Watchers

Forkers

4rcan3

scrapia-world's Issues

new login method

Previously WW had it's own login page. Now we must click on the login button after visiting the homepage.
There we need to click on a button

Thankfully, it's the first button of the webpage.
Hence, below code will return the button we need.

document.getElementsByTagName("button")[0];

We can click it

document.getElementsByTagName("button")[0].click();

And then click on the login button.

Using this script

let btnList = document.getElementsByTagName("button");

for (let btn of btnlist) {
  if (btn.innerText.toLowerCase() == "log in") {
    btn.click();
  }
}

And so, the final script will become.

document.getElementsByTagName("button")[0].click();
let btnList = document.getElementsByTagName("button");

for (let btn of btnlist) {
  if (btn.innerText.toLowerCase() == "log in") {
    btn.click();
  }
}

We just need to get to the homepage, and run this script in the console. After that the method to enter email and password should be the same.

option to add novel information

Create an option in the CLI to add novel information

Redundant and Missing Text

Well, as the title says, there is redundant text at the end of a few chapters. This probably has to do with that redundant text being present withing the same <div> elements as the main story content.
The other problem is that a few chapters have spoiler titles in them, that prompts the site owners to hide the chapter title and make it visible only when the user clicks on it. This problem can be sorted quite easily though.
- We can write a program that checks whether the text file contains the story title, and if it doesn't, we can simply put the name of the file name as the chapter file (because files are saved with the page title; which is coincidentally the chapter title: as there name).

Let's fix this in the new releases...whenever that'll be

dot-env dependency | password and email in json

I intend to remove this dependency. Let's work it out in favor of...config files?
Also, it seems the code doesn't use dot-env now =_=

But it's still being called. My mistake.

The password and a few other info/path(s) are better off in a single config file. Rather than a json. No particular reason for why, I just prefer .cfg now.

tl:dr; Clean the code

A few chapters won't download.

Well, as the title says...they won't...it's because the internal count of the code will get messed up if you change to a different chapter during the sleep time. This can be a pain, because then you will have to first run the whole code for that particular "missing" chapter and then close it to then again make better changes in the database.

Create a novel "profile" instead

Rather than going to the novel page and then searching for the chapter to click everytime the script is run, we can save the links to every chapter of the novel.

We can go to the required chapter using that link after logging in.

From then, we can simply click the "next" button and keep track of the current_chapter.

For this, we need to first create a "profile" of the novel to scrape.

Create a custom index that maps to each chapter.
Each index will have a link to the chapter associated with it

For we first need to open the accordian for every chapter

Which can be done by finding all accordian div elements

find using

driver.find_elements(By.XPATH, "//div[contains(@class, 'grid') and contains(@class, 'grid-cols-1') and contains(@class, 'md:grid-cols-2') and contains(@class, 'w-full')]")

for a

<div class="grid grid-cols-1 md:grid-cols-2 w-full"></div>

element

We might need to open the accordians as well

let spanList = document.getElementsByTagName("span");
for (let span of spanList){
  if (span.innerText.startsWith("Volume")){
    span.click();
  }
}

Each of these elements has hrefs to the chapters of the novel. Store them with the indexing.

can we switch to sqlite?

What the title says. I never knew we could use sqlite (I don't think I even knew about it much) when I started with this project.
I'm aiming for working with as less dependencies as I can.

My rationale: It improves portability (in a weird, probably-unnecessary way).

tl:dr; Switch from mysql to sqlite

JSON files are not updating

There must be something wrong with the piece of code where the dictionaries are being updated.
My bet is on the popFirstElementUpdateOtherDict function

Code Quality

My code probably smells.

There are probably a bunch of things I can do to change this.
We could refactor code with python dataclasses for starters.

Formatting the saved pages

Is your feature request related to a problem? Please describe.

Opening the html page is a nightmare. It's ugly.
There's no darkmode.
The svgs that exist in the html code are oversized.
Someone probably used tailwind/bootstrap for styling the page.

Describe the solution you'd like

(Default) Dark mode for the text
A way to clean the unnecessary html code

Describe alternatives you've considered

Create one more CLI (lol) to work with the html code.
Use bs4 maybe

Additional context
None

laughingclouds / scrapia-world Goto Github PK

scrapia-world's Introduction

Scrapia-World

scrapia-world's People

Contributors

Stargazers

Watchers

Forkers

scrapia-world's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs