GithubHelp home page GithubHelp logo

free-programming-books-parser's People

Contributors

brogan20 avatar davorpa avatar dpekata avatar eshellman avatar github-actions[bot] avatar leoouyang24 avatar nrfq avatar pkelly10439594 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

free-programming-books-parser's Issues

Improve title text extraction

According to current code

const [link, ...otherStuff] = listItem; // head of listItem = url, the rest is "other stuff"
entry.url = link.url;
entry.title = link.children[0].value;
// remember to get OTHER STUFF!! remember there may be multiple links!

first node children[0] is used as resource titles without check if there are more meaningfull tokens. So the rest is stripped making sometimes difficult to do a search by title of resources.

image

Therefore a escape in resources title links part is needed when submitting and make a rebuild Markdown here is mandatory

Context

See EbookFoundation/free-programming-books#7086
Related with #2 (same workarround)

Improve Index section detection

Resolve this TODOS across localized files

// find where Index ends
// probably could be done better, review later
let i = 0,
count = 0;
for (i; i < tree.length; i++) {
if (tree[i].type == "heading" && tree[i].depth == "3") count++;
if (count == 2) break;
}
tree.slice(i).forEach((item) => {
// Start iterating after Index
try {
if (item.type == "heading" && item.children[0].value == "Index") return;
if (item.type == "heading") {

part of EbookFoundation/free-programming-books#6988 (comment)

Index word is not translated according to file locale. E.g.:

Parser doesn't take into account bold format in notes

Having current code:

} else {
// for now we assume that all previous ifs are mutually exclusive with this, may polish later
if (i.type === "emphasis") {
// this is the emphasis, add it in boldface and move on
s += "*" + i.children[0].value + "*";
} else if (i.type === "link") {
// something has gone terribly wrong. this book must be viewed and edited manually.
entry.manualReviewRequired = true;
break;
} else {
// hopefully this is the end of the note
let rightParen = i.value.indexOf(")");
if (rightParen === -1) {
// we have to go AGAIN
s += i.value;
} else {
// finally, we have reached the end of the note
entry.notes.push(stripParens(s + i.value.slice(0, rightParen + 1)));
s = "";
// this is a copypaste of another block of code. probably not a good thing tbh.
leftParen = i.value.indexOf("(");
while (leftParen != -1) {
rightParen = i.value.indexOf(")", leftParen);
if (rightParen === -1) {
// there must be some *emphasis* found
s += i.value.slice(leftParen);
break;
}
entry.notes.push(i.value.slice(leftParen + 1, rightParen));
leftParen = i.value.indexOf("(", rightParen);
}
}
}

image

If a bold format is found, the i.value is undefined and then the program crash. It should check if i.type == "strong" or have i.children.

Resources affected:

In general, we should extends the fix in depth to other inline formats like emphasis (already exists), bold, code, image...

Parser don't take into account resources organized in sublists (fascicles/parts)

Improve file media type extraction from directory name

It should that function in charge of extract the file type doesn't work well.

/**
* Retrieves the folder name from a string representing a directory and file
* @param {String} dir A string representing a directory in the format "./directory/file"
* @returns {String} The extracted directory name
*/
function getMediaFromDirectory(dir) {
const slash = dir.lastIndexOf("/");
let mediaType = dir.slice(2, slash);
return mediaType;
}

Always returns "fpb" instead of "books", "courses"....

See https://raw.githubusercontent.com/EbookFoundation/free-programming-books-search/main/fpb.json

image

Even worst if not sanatized path is provided or the parser is executed with customized inputs.

image

Tasks

  • Sanatize input to be independent of OS.
  • Extract right slug for both cases: if input parameter is file or is directory.

Improve extraction of section texts from Markdown headings

According to current code...

if (item.type == "heading") {
if (item.depth == 3) {
// Heading is an h3
currentDepth = 3;
let newSection = {
section: item.children[0].value, // Get the name of the section
entries: [],
subsections: [],
};
sections.push(newSection); // Push the section to the output array
} else if (item.depth == 4) {
// Heading is an h4
currentDepth = 4;
let newSubsection = {
section: item.children[0].value, // Get the name of the subsection
entries: [],
};
sections[sections.length - 1].subsections.push(newSubsection); // Add to subsection array of most recent h3
}
} else if (item.type == "list") {
item.children.forEach((listItem) => {

it seems that the parser not supports HTML anchor aliases neither Markdown syntax. It takes for granted that childrens[0] will be plain text.

image
image

It's necesary make a ES6 Array.reduce of the heading item.children taking into account all cases in order to rebuild the desired text. Cases type:

  • html, link: ignore record. maybe an anchor alias or a back to top/upper section link
  • text: append value
  • emphasis: append children values wrapping between _ (italic Markdown tokens)
  • strong: append children values wrapping between ** (bold Markdown tokens)
  • ...
    image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.