JavaScript 100.00%

free-programming-books-parser's People

Contributors

Stargazers

Watchers

Forkers

davorpa balasubrahmanyamd tech1media dheerajsingh002

free-programming-books-parser's Issues

Improve title text extraction

According to current code

free-programming-books-parser/index.js

Lines 92 to 95 in dc53b8c

 const [link, ...otherStuff] = listItem; // head of listItem = url, the rest is "other stuff" 

 entry.url = link.url; 

 entry.title = link.children[0].value; 

 // remember to get OTHER STUFF!! remember there may be multiple links!

first node children[0] is used as resource titles without check if there are more meaningfull tokens. So the rest is stripped making sometimes difficult to do a search by title of resources.

Therefore a escape in resources title links part is needed when submitting and make a rebuild Markdown here is mandatory

Context

See EbookFoundation/free-programming-books#7086
Related with #2 (same workarround)

Improve Index section detection

Resolve this TODOS across localized files

free-programming-books-parser/index.js

Lines 178 to 192 in 5eaf00b

 // find where Index ends 

 // probably could be done better, review later 

 let i = 0, 

 count = 0; 

 for (i; i < tree.length; i++) { 

 if (tree[i].type == "heading" && tree[i].depth == "3") count++; 

 if (count == 2) break; 

 } 

 tree.slice(i).forEach((item) => { 

 // Start iterating after Index 

 try { 

 if (item.type == "heading" && item.children[0].value == "Index") return; 

 if (item.type == "heading") {

part of EbookFoundation/free-programming-books#6988 (comment)

Index word is not translated according to file locale. E.g.:

Índice for -es.md files
Índice for -pt_BR.md files
目录 for -zh.md files
other variants at EbookFoundation/free-programming-books@78913af

Parser doesn't take into account bold format in notes

Having current code:

free-programming-books-parser/index.js

Lines 140 to 172 in dc53b8c

 } else { 

 // for now we assume that all previous ifs are mutually exclusive with this, may polish later 

 if (i.type === "emphasis") { 

 // this is the emphasis, add it in boldface and move on 

 s += "*" + i.children[0].value + "*"; 

 } else if (i.type === "link") { 

 // something has gone terribly wrong. this book must be viewed and edited manually. 

 entry.manualReviewRequired = true; 

 break; 

 } else { 

 // hopefully this is the end of the note 

 let rightParen = i.value.indexOf(")"); 

 if (rightParen === -1) { 

 // we have to go AGAIN 

 s += i.value; 

 } else { 

 // finally, we have reached the end of the note 

 entry.notes.push(stripParens(s + i.value.slice(0, rightParen + 1))); 

 s = ""; 

 // this is a copypaste of another block of code. probably not a good thing tbh. 

 leftParen = i.value.indexOf("("); 

 while (leftParen != -1) { 

 rightParen = i.value.indexOf(")", leftParen); 

 if (rightParen === -1) { 

 // there must be some *emphasis* found 

 s += i.value.slice(leftParen); 

 break; 

 } 

 entry.notes.push(i.value.slice(leftParen + 1, rightParen)); 

 leftParen = i.value.indexOf("(", rightParen); 

 } 

 } 

 }

If a bold format is found, the i.value is undefined and then the program crash. It should check if i.type == "strong" or have i.children.

Resources affected:

https://github.com/EbookFoundation/free-programming-books/blob/fc4b0c5c139b952de979aa54a4de5141ea280906/books/free-programming-books-fr.md?plain=1#L73

In general, we should extends the fix in depth to other inline formats like emphasis (already exists), bold, code, image...

Parser don't take into account resources organized in sublists (fascicles/parts)

There are a kind of resource that are not covered by parser.

Examples:

As we can see are listed with title without link and links are in a sublist apart or using multiformat syntax.

Discovered fixing #8 because resources after it appears in fpb.json.

Improve file media type extraction from directory name

It should that function in charge of extract the file type doesn't work well.

free-programming-books-parser/index.js

Lines 171 to 180 in ce6be65

 /** 

  * Retrieves the folder name from a string representing a directory and file 

  * @param {String} dir A string representing a directory in the format "./directory/file" 

  * @returns {String} The extracted directory name 

  */ 

 function getMediaFromDirectory(dir) { 

 const slash = dir.lastIndexOf("/"); 

 let mediaType = dir.slice(2, slash); 

 return mediaType; 

 }

Always returns "fpb" instead of "books", "courses"....

See https://raw.githubusercontent.com/EbookFoundation/free-programming-books-search/main/fpb.json

Even worst if not sanatized path is provided or the parser is executed with customized inputs.

Tasks

Sanatize input to be independent of OS.
Extract right slug for both cases: if input parameter is file or is directory.

Improve extraction of section texts from Markdown headings

According to current code...

free-programming-books-parser/index.js

Lines 192 to 212 in 5eaf00b

 if (item.type == "heading") { 

 if (item.depth == 3) { 

 // Heading is an h3 

 currentDepth = 3; 

 let newSection = { 

 section: item.children[0].value, // Get the name of the section 

 entries: [], 

 subsections: [], 

 }; 

 sections.push(newSection); // Push the section to the output array 

 } else if (item.depth == 4) { 

 // Heading is an h4 

 currentDepth = 4; 

 let newSubsection = { 

 section: item.children[0].value, // Get the name of the subsection 

 entries: [], 

 }; 

 sections[sections.length - 1].subsections.push(newSubsection); // Add to subsection array of most recent h3 

 } 

 } else if (item.type == "list") { 

 item.children.forEach((listItem) => {

it seems that the parser not supports HTML anchor aliases neither Markdown syntax. It takes for granted that childrens[0] will be plain text.

It's necesary make a ES6 Array.reduce of the heading item.children taking into account all cases in order to rebuild the desired text. Cases type:

html, link: ignore record. maybe an anchor alias or a back to top/upper section link
text: append value
emphasis: append children values wrapping between _ (italic Markdown tokens)
strong: append children values wrapping between ** (bold Markdown tokens)
...

ebookfoundation / free-programming-books-parser Goto Github PK

free-programming-books-parser's People

Contributors

Stargazers

Watchers

Forkers

free-programming-books-parser's Issues

Improve title text extraction

Context

Improve Index section detection

Parser doesn't take into account bold format in notes

Parser don't take into account resources organized in sublists (fascicles/parts)

Improve file media type extraction from directory name

Tasks

Improve extraction of section texts from Markdown headings

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs

	const [link, ...otherStuff] = listItem; // head of listItem = url, the rest is "other stuff"
	entry.url = link.url;
	entry.title = link.children[0].value;
	// remember to get OTHER STUFF!! remember there may be multiple links!

	// find where Index ends
	// probably could be done better, review later
	let i = 0,
	count = 0;
	for (i; i < tree.length; i++) {
	if (tree[i].type == "heading" && tree[i].depth == "3") count++;
	if (count == 2) break;
	}

	tree.slice(i).forEach((item) => {
	// Start iterating after Index
	try {
	if (item.type == "heading" && item.children[0].value == "Index") return;

	if (item.type == "heading") {

	} else {
	// for now we assume that all previous ifs are mutually exclusive with this, may polish later
	if (i.type === "emphasis") {
	// this is the emphasis, add it in boldface and move on
	s += "" + i.children[0].value + "";
	} else if (i.type === "link") {
	// something has gone terribly wrong. this book must be viewed and edited manually.
	entry.manualReviewRequired = true;
	break;
	} else {
	// hopefully this is the end of the note
	let rightParen = i.value.indexOf(")");
	if (rightParen === -1) {
	// we have to go AGAIN
	s += i.value;
	} else {
	// finally, we have reached the end of the note
	entry.notes.push(stripParens(s + i.value.slice(0, rightParen + 1)));
	s = "";
	// this is a copypaste of another block of code. probably not a good thing tbh.
	leftParen = i.value.indexOf("(");
	while (leftParen != -1) {
	rightParen = i.value.indexOf(")", leftParen);
	if (rightParen === -1) {
	// there must be some emphasis found
	s += i.value.slice(leftParen);
	break;
	}
	entry.notes.push(i.value.slice(leftParen + 1, rightParen));
	leftParen = i.value.indexOf("(", rightParen);
	}
	}
	}

	/**
	* Retrieves the folder name from a string representing a directory and file
	* @param {String} dir A string representing a directory in the format "./directory/file"
	* @returns {String} The extracted directory name
	*/
	function getMediaFromDirectory(dir) {
	const slash = dir.lastIndexOf("/");
	let mediaType = dir.slice(2, slash);
	return mediaType;
	}

	if (item.type == "heading") {
	if (item.depth == 3) {
	// Heading is an h3
	currentDepth = 3;
	let newSection = {
	section: item.children[0].value, // Get the name of the section
	entries: [],
	subsections: [],
	};
	sections.push(newSection); // Push the section to the output array
	} else if (item.depth == 4) {
	// Heading is an h4
	currentDepth = 4;
	let newSubsection = {
	section: item.children[0].value, // Get the name of the subsection
	entries: [],
	};
	sections[sections.length - 1].subsections.push(newSubsection); // Add to subsection array of most recent h3
	}
	} else if (item.type == "list") {
	item.children.forEach((listItem) => {