GithubHelp home page GithubHelp logo

laughingclouds / scrapia-world Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 1.0 126 KB

A web scraper for scraping wuxiaworld. Written in python, using selenium and python cmd for an interactive shell experience with a command line utility to work with text along with a database to store information.

License: MIT License

Python 99.27% JavaScript 0.73%
wuxiaworld webnovel python3 selenium-python selenium geckodriver chromedriver scraper web-scraper

scrapia-world's Introduction

Scrapia-World

A web scraper for scraping wuxiaworld, written in python, using selenium and both gecko and chrome drivers.

Note:

  1. I don't have any releases setup in pypi, and I probably don't want to use that space for this project (I just prefer that). So you will have to do with either the latest-release or the latest pre-release. (I will do my best to keep even the pre-release; only the latest one: as functional as I can keem them).
  2. This readme might not always be upto date so I'd rather you just go to the latest release (or pre-release)

Setting up:

  1. Latest releases of scrapia-world use only firefox as the earlier requirement of using vivaldi as well has been made void due to a few improvements. Open novel_page_info.json and make changes to the different paths in accordance to your wishes. I assure you the latest release won't break because of any hotch potch in the paths.
  2. The .env file is required for storing the password to the database. You can eazily make changes in the code (remove the load_dotenv function call) to use any other method to use virtual environments. The email and password for logging in should not be moved though. If they are, please make necessary changes in the source code (in the class InteractiveShell in scrapia_shell.py).
  3. You need a database!!: Significant changes have been made in later releases in the way the database is used (or structured). For now here's how things should be:
  • Set the value of DATABASE in novel_page_info.json and create a database with that name.
  • Set the value of TABLE in novel_page_info.json and create a table with that name.
  • This is how the table structure should be:
abreviated_novel_name1 abreviated_novel_name2 abreviated_novel_name3 ...
chapter no. chapter no. chapter no. ...

Something like this: database table structure

It is recommended to set the default value of every column to the integer value of the first chapter number of a novel.

  1. For the other stuff, I'll add in a requirements.txt which you can pip install -r requirements.txt within a virtual environment.
CREATE TABLE "novel" (
 "ATG" INTEGER DEFAULT 0,
 "OG" INTEGER DEFAULT 0
);

SQL code for creating the table.

Webdrivers:

Browser Recommended Driver ...
Vivaldi chromedriver
Chromium chromedriver
Firefox geckodriver
  1. You will need to link to vivaldi's binary file instead of chromes' to use it. This stackoverflow question might help you out. For me binary's path was /opt/vivaldi/vivaldi (I use linux btw)
  2. Chromedriver version for vivaldi:
  • In the url area, enter vivaldi://about/.
  • The version of chromium your vivaldi is based on should be visible in the "user agent" field.
  • Install chromedriver for this specific version.
  1. If you use linux and want to work with vivaldi, you can just copy the code from the v0.1.0-alpha release.
  2. Using the drivers for chromium and firefox should be easy.

Things to add:

  1. I have taken to adding a docstring at the top of the source files, might not be a good practise...but meh...I'll see what I can do later on.

Issues:

  1. You can track any known issues from the [issues tab] (https://github.com/r3a10god/Scrapia-World/issues).
  2. If you find any issues then feel free to raise them.

Current capability and a few thoughts:

  1. I wanted to read the novel, that's it. And that's what this script helps me with. Therefore, I made it to scrape only two things from a page. The page title, and the relevant text. The title of the page is what will become the name of the text file associated with that page, and the relevant text will be stored in that text file. Hence, it scrapes the raw text of a chapter.
  2. I plan to make new stuff that would deal with that raw text. I could've downloaded the whole page source and made a script to edit that, but I didn't feel the need to do so.

scrapia-world's People

Contributors

dependabot[bot] avatar laughingclouds avatar

Stargazers

 avatar

Watchers

 avatar

Forkers

4rcan3

scrapia-world's Issues

can we switch to sqlite?

What the title says. I never knew we could use sqlite (I don't think I even knew about it much) when I started with this project.
I'm aiming for working with as less dependencies as I can.

My rationale: It improves portability (in a weird, probably-unnecessary way).

tl:dr; Switch from mysql to sqlite

Create a novel "profile" instead

Rather than going to the novel page and then searching for the chapter to click everytime the script is run, we can save the links to every chapter of the novel.

We can go to the required chapter using that link after logging in.

From then, we can simply click the "next" button and keep track of the current_chapter.

For this, we need to first create a "profile" of the novel to scrape.

  • Create a custom index that maps to each chapter.
  • Each index will have a link to the chapter associated with it

For we first need to open the accordian for every chapter
image

Which can be done by finding all accordian div elements
image

find using

driver.find_elements(By.XPATH, "//div[contains(@class, 'grid') and contains(@class, 'grid-cols-1') and contains(@class, 'md:grid-cols-2') and contains(@class, 'w-full')]")

for a

<div class="grid grid-cols-1 md:grid-cols-2 w-full"></div>

element

We might need to open the accordians as well

let spanList = document.getElementsByTagName("span");
for (let span of spanList){
  if (span.innerText.startsWith("Volume")){
    span.click();
  }
}

Each of these elements has hrefs to the chapters of the novel. Store them with the indexing.

dot-env dependency | password and email in json

I intend to remove this dependency. Let's work it out in favor of...config files?
Also, it seems the code doesn't use dot-env now =_=

But it's still being called. My mistake.

The password and a few other info/path(s) are better off in a single config file. Rather than a json. No particular reason for why, I just prefer .cfg now.

tl:dr; Clean the code

new login method

Previously WW had it's own login page. Now we must click on the login button after visiting the homepage.
There we need to click on a button
lel

Thankfully, it's the first button of the webpage.
Hence, below code will return the button we need.

document.getElementsByTagName("button")[0];

We can click it

document.getElementsByTagName("button")[0].click();

And then click on the login button.
lel

Using this script

let btnList = document.getElementsByTagName("button");

for (let btn of btnlist) {
  if (btn.innerText.toLowerCase() == "log in") {
    btn.click();
  }
}

And so, the final script will become.

document.getElementsByTagName("button")[0].click();
let btnList = document.getElementsByTagName("button");

for (let btn of btnlist) {
  if (btn.innerText.toLowerCase() == "log in") {
    btn.click();
  }
}

We just need to get to the homepage, and run this script in the console. After that the method to enter email and password should be the same.

Read this

PEP257
PEP8

Motive behind this issue?
Make better docstrings. And maybe later on use pythons documentation tools to create a documentation.

Redundant and Missing Text

  • Well, as the title says, there is redundant text at the end of a few chapters. This probably has to do with that redundant text being present withing the same <div> elements as the main story content.

  • The other problem is that a few chapters have spoiler titles in them, that prompts the site owners to hide the chapter title and make it visible only when the user clicks on it. This problem can be sorted quite easily though.

    • We can write a program that checks whether the text file contains the story title, and if it doesn't, we can simply put the name of the file name as the chapter file (because files are saved with the page title; which is coincidentally the chapter title: as there name).

Let's fix this in the new releases...whenever that'll be

Formatting the saved pages

Is your feature request related to a problem? Please describe.

  • Opening the html page is a nightmare. It's ugly.
  • There's no darkmode.
  • The svgs that exist in the html code are oversized.
  • Someone probably used tailwind/bootstrap for styling the page.

Describe the solution you'd like

  • (Default) Dark mode for the text
  • A way to clean the unnecessary html code

Describe alternatives you've considered

  • Create one more CLI (lol) to work with the html code.
  • Use bs4 maybe

Additional context
None

JSON files are not updating

There must be something wrong with the piece of code where the dictionaries are being updated.
My bet is on the popFirstElementUpdateOtherDict function

Page source being saved before the content loads

As you can see...
image

My guess is the code under the function that does 3 things (scrape, sleep, goto next page) is not doing things in proper order.
Maybe scrape -> then goto next page -> then sleep.
That way the content gets enough time to get loaded.

Code Quality

My code probably smells.

There are probably a bunch of things I can do to change this.
We could refactor code with python dataclasses for starters.

A few chapters won't download.

Well, as the title says...they won't...it's because the internal count of the code will get messed up if you change to a different chapter during the sleep time. This can be a pain, because then you will have to first run the whole code for that particular "missing" chapter and then close it to then again make better changes in the database.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.