erc-dharma / odia-mahabharata Goto Github PK
View Code? Open in Web Editor NEWସାରଳା ଭାରତ (Sāraḷā Bhārata)
ସାରଳା ଭାରତ (Sāraḷā Bhārata)
this replacement can be made with confident in DHARMA-ISO.
To help identify problems, I created four additional files. malten_full.txt, iast_full.txt and ori_full.txt hold the full text of the respective versions, concatenated into a single file, and without markers. missing_full.txt is a new version of missing_alpha.txt.
There are about 40 less lines in malten_full than in iast_full and ori_full, probably because some markers from malten_full where incorrectly transliterated. In malten_full itself, there are quite a few weird character sequences, like "coÔøΩkhaÔøΩ", which are preserved in iast_full but obviously should have been transliterated to something more meaningful.
Concernins scans, they take up a lot of space, so I suggest we host them somewhere else than in a git repository. I will make PDFs out of them and send you the files.
Now some specific points:
(1) OK for ẏ and ḷ.
(2) We have both "b" and "w" transformed to "ବ". Is this normal? Should I also modify the "iast" version to use a single of these characters?
(3) For "ṛ", there are indeed several cases where it is preceded by a vowel. I found the following:
jṛ
tṛ
hṛ
nṛ
pṛ
Should I replace these with jr̥, etc. in the IAST version, globally? For other instances of "ṛ", where it represents ଡ଼, should this symbol be treated as a consonant viz. if a vowel follows, should it be represented as an initial vowel or as a vowel mark in Oriya script?
(4) The transliteration issues you spotted in Text_files_problems.txt are harder to tackle, I need more specific transformation rules. For instance we have *pUboGge > pUrbe but also *eboGg > ebaM, I do not know the criteria to decide whether boGg > rb or boGg > aM.
I'm pretty sure that this means double quotation mark ("), though at present it is hard for me to match the occurrences with specific pages in the printed edition so it's hard to confirm. See #2.
Dear @ajit-github !
I have just added the folllowing two lines to todo.txt:
In <P1> contexts, @...@ seems to be used to wrap what is in Roman script.
But we have found at least one case of @...@ used in <H> context, where is does not have the same meaning.
Could you analyze the ways in which this wrapper is used throughout the ISO folder, so we can differentiate the wrappers in preparation for subsequent conversion to XML?
if and when it's possible, it would be helpful if the textual divisions were all shown in some way.
e.g., instead of
jaẏatu dadhi-maṅkala bighnarāja
yāhāra prasanne, siddha huai sarbakāryya | 1 |
perhaps
jaẏatu dadhi-maṅkala bighnarāja
yāhāra prasanne, siddha huai sarbakāryya | 01.01.01 |
at the very least, I need to be able to see for any line of text in which Parva I am, in order to be able to check the scans of the printed edition. Ideally, page breaks would also be indicated.
Dear @ajit-github !
There seem to be many cases where the typist has entered "ch" while "c" only is required.
Example
<P>mana pachārai caitana kahai
This should be
<P>mana pacāraI caItana kahaI
(I am also disambiguating the cases of ambiguous ai.)
We will need a strategy for cleaning up the relevant cases of "ch".
@michaelnmmeyer : please modify todo.txt using the following notes.
I am not familiar with the s,x,y,g
code but I assume it means change x to y. I confirm all conversions are correct, except that the single case of Z was a typing error which I have now corrected, so s,Z,ś,g
can be removed.
Regarding the cases you were unsure about:
s,J,j,g
s,y1,ẏ,g
— are there any other cases of digits within text?<P1>108|1 meñjana–byañjana 10|1 puāi–anna byañjanādi prāsta karāi
<P1>108|1 meñjana–byañjana
<P1>110|1 puāi–anna byañjanādi prāsta karāi
Looking only at the file MBO02(2).txt, I find that the typist has used the string bong in at least three ways.
Are you able to analyze for the whole ISO folder to what actualy Odia or Roman strings "bong" correspond, and to propose a strategy for correcting what the typist has done?
There are some frequently occuring words (like pārvatī, sarva, pūrva, vaṁśa, parva) that we could perhaps correct first, before tackling the remaining cases of lesser frequency?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.