erc-dharma / odia-mahabharata Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 67.95 MB

ସାରଳା ଭାରତ (Sāraḷā Bhārata)

Makefile 12.19% Python 87.81%

odia-mahabharata's People

Contributors

Watchers

odia-mahabharata's Issues

any vowel after space >> upper-case vowel

this replacement can be made with confident in DHARMA-ISO.

Various points

To help identify problems, I created four additional files. malten_full.txt, iast_full.txt and ori_full.txt hold the full text of the respective versions, concatenated into a single file, and without markers. missing_full.txt is a new version of missing_alpha.txt.

There are about 40 less lines in malten_full than in iast_full and ori_full, probably because some markers from malten_full where incorrectly transliterated. In malten_full itself, there are quite a few weird character sequences, like "coÔøΩkhaÔøΩ", which are preserved in iast_full but obviously should have been transliterated to something more meaningful.

Concernins scans, they take up a lot of space, so I suggest we host them somewhere else than in a git repository. I will make PDFs out of them and send you the files.

Now some specific points:

(1) OK for ẏ and ḷ.

(2) We have both "b" and "w" transformed to "ବ". Is this normal? Should I also modify the "iast" version to use a single of these characters?

(3) For "ṛ", there are indeed several cases where it is preceded by a vowel. I found the following:

jṛ
tṛ
hṛ
nṛ
pṛ

Should I replace these with jr̥, etc. in the IAST version, globally? For other instances of "ṛ", where it represents ଡ଼, should this symbol be treated as a consonant viz. if a vowel follows, should it be represented as an initial vowel or as a vowel mark in Oriya script?

(4) The transliteration issues you spotted in Text_files_problems.txt are harder to tackle, I need more specific transformation rules. For instance we have *pUboGge > pUrbe but also *eboGg > ebaM, I do not know the criteria to decide whether boGg > rb or boGg > aM.

ÔøΩ >> "

I'm pretty sure that this means double quotation mark ("), though at present it is hard for me to match the occurrences with specific pages in the printed edition so it's hard to confirm. See #2.

analyze meaning of @...@

Dear @ajit-github !

I have just added the folllowing two lines to todo.txt:

In <P1> contexts, @...@ seems to be used to wrap what is in Roman script.
But we have found at least one case of @...@ used in <H> context, where is does not have the same meaning.

Could you analyze the ways in which this wrapper is used throughout the ISO folder, so we can differentiate the wrappers in preparation for subsequent conversion to XML?

more structure in display

if and when it's possible, it would be helpful if the textual divisions were all shown in some way.

e.g., instead of

jaẏatu dadhi-maṅkala bighnarāja
yāhāra prasanne, siddha huai sarbakāryya | 1 |

perhaps

jaẏatu dadhi-maṅkala bighnarāja
yāhāra prasanne, siddha huai sarbakāryya | 01.01.01 |

at the very least, I need to be able to see for any line of text in which Parva I am, in order to be able to check the scans of the printed edition. Ideally, page breaks would also be indicated.

analyze "ch"

Dear @ajit-github !

There seem to be many cases where the typist has entered "ch" while "c" only is required.

Example

<P>mana pachārai caitana kahai

This should be

<P>mana pacāraI caItana kahaI

(I am also disambiguating the cases of ambiguous ai.)

We will need a strategy for cleaning up the relevant cases of "ch".

addenda/corrigenda for todo.txt

@michaelnmmeyer : please modify todo.txt using the following notes.

I am not familiar with the s,x,y,g code but I assume it means change x to y. I confirm all conversions are correct, except that the single case of Z was a typing error which I have now corrected, so s,Z,ś,g can be removed.
Regarding the cases you were unsure about:

J: s,J,j,g
C: just two cases iso_full, but I cannot determine their location; can you point me to the files concerned?
Q: just cases; I managed to track down one in pdf and replaced it by ṭṭa; the remaining case if debatQ and almost certainly means debatā, but I haven't determined the place in pdf yet
q: 16 occurrences; in almost all cases, it seems the q must simply be deleted — leave them for later cleaning
f: 17 occurrences — leave them for later cleaning

Regarding cases of y1: s,y1,ẏ,g — are there any other cases of digits within text?
Regarding structure, I found this case:
<P1>108|1 meñjana–byañjana 10|1 puāi–anna byañjanādi prāsta karāi
10|1 is a misprint for 110|1. Silently correcting the misprint, should I further manually change the structure to the following?
<P1>108|1 meñjana–byañjana
<P1>110|1 puāi–anna byañjanādi prāsta karāi
Can you identify others cases with two notes on one line computationally?

analyze "bong"

@ajit-github !

Looking only at the file MBO02(2).txt, I find that the typist has used the string bong in at least three ways.

bong = ର୍ବ including the vowel — these cases should be replaced by rva

line 47 madhyapabongra pariśiṣṭa >> madhyaparvara pariśiṣṭa
line 89 pabongta >> parvata
line 219 pūbong >> pūrva

bong = ର୍ବ but without the vowel — these cases should be replaced by rv

line 82 nibongāṇa >> nirvāṇa
line 518 gandhabonge >> gandharve

bong = ବଂ — these cases should be replaced by vaṁ

line 946 yadubongśa >> yaduvaṁśa

Are you able to analyze for the whole ISO folder to what actualy Odia or Roman strings "bong" correspond, and to propose a strategy for correcting what the typist has done?

There are some frequently occuring words (like pārvatī, sarva, pūrva, vaṁśa, parva) that we could perhaps correct first, before tackling the remaining cases of lesser frequency?

erc-dharma / odia-mahabharata Goto Github PK

odia-mahabharata's People

Contributors

Watchers

odia-mahabharata's Issues

any vowel after space >> upper-case vowel

Various points

ÔøΩ >> "

analyze meaning of @...@

more structure in display

analyze "ch"

addenda/corrigenda for todo.txt

analyze "bong"

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs