GithubHelp home page GithubHelp logo

odia-mahabharata's People

Contributors

ajaniak avatar ajit-github avatar arlogriffiths avatar michaelnmmeyer avatar

Watchers

 avatar  avatar  avatar  avatar

odia-mahabharata's Issues

Various points

To help identify problems, I created four additional files. malten_full.txt, iast_full.txt and ori_full.txt hold the full text of the respective versions, concatenated into a single file, and without markers. missing_full.txt is a new version of missing_alpha.txt.

There are about 40 less lines in malten_full than in iast_full and ori_full, probably because some markers from malten_full where incorrectly transliterated. In malten_full itself, there are quite a few weird character sequences, like "coÔøΩkhaÔøΩ", which are preserved in iast_full but obviously should have been transliterated to something more meaningful.

Concernins scans, they take up a lot of space, so I suggest we host them somewhere else than in a git repository. I will make PDFs out of them and send you the files.

Now some specific points:

(1) OK for ẏ and ḷ.

(2) We have both "b" and "w" transformed to "ବ". Is this normal? Should I also modify the "iast" version to use a single of these characters?

(3) For "ṛ", there are indeed several cases where it is preceded by a vowel. I found the following:

jṛ
tṛ
hṛ
nṛ
pṛ

Should I replace these with jr̥, etc. in the IAST version, globally? For other instances of "ṛ", where it represents ଡ଼, should this symbol be treated as a consonant viz. if a vowel follows, should it be represented as an initial vowel or as a vowel mark in Oriya script?

(4) The transliteration issues you spotted in Text_files_problems.txt are harder to tackle, I need more specific transformation rules. For instance we have *pUboGge > pUrbe but also *eboGg > ebaM, I do not know the criteria to decide whether boGg > rb or boGg > aM.

ÔøΩ >> "

I'm pretty sure that this means double quotation mark ("), though at present it is hard for me to match the occurrences with specific pages in the printed edition so it's hard to confirm. See #2.

analyze meaning of @...@

Dear @ajit-github !

I have just added the folllowing two lines to todo.txt:

In <P1> contexts, @...@ seems to be used to wrap what is in Roman script.
But we have found at least one case of @...@ used in <H> context, where is does not have the same meaning.

Could you analyze the ways in which this wrapper is used throughout the ISO folder, so we can differentiate the wrappers in preparation for subsequent conversion to XML?

more structure in display

if and when it's possible, it would be helpful if the textual divisions were all shown in some way.

e.g., instead of

jaẏatu dadhi-maṅkala bighnarāja
yāhāra prasanne, siddha huai sarbakāryya | 1 |

perhaps

jaẏatu dadhi-maṅkala bighnarāja
yāhāra prasanne, siddha huai sarbakāryya | 01.01.01 |

at the very least, I need to be able to see for any line of text in which Parva I am, in order to be able to check the scans of the printed edition. Ideally, page breaks would also be indicated.

analyze "ch"

Dear @ajit-github !

There seem to be many cases where the typist has entered "ch" while "c" only is required.

Example

<P>mana pachārai caitana kahai

This should be

<P>mana pacāraI caItana kahaI

(I am also disambiguating the cases of ambiguous ai.)

We will need a strategy for cleaning up the relevant cases of "ch".

addenda/corrigenda for todo.txt

@michaelnmmeyer : please modify todo.txt using the following notes.

  1. I am not familiar with the s,x,y,g code but I assume it means change x to y. I confirm all conversions are correct, except that the single case of Z was a typing error which I have now corrected, so s,Z,ś,g can be removed.

  2. Regarding the cases you were unsure about:

  • J: s,J,j,g
  • C: just two cases iso_full, but I cannot determine their location; can you point me to the files concerned?
  • Q: just cases; I managed to track down one in pdf and replaced it by ṭṭa; the remaining case if debatQ and almost certainly means debatā, but I haven't determined the place in pdf yet
  • q: 16 occurrences; in almost all cases, it seems the q must simply be deleted — leave them for later cleaning
  • f: 17 occurrences — leave them for later cleaning
  1. Regarding cases of y1: s,y1,ẏ,g — are there any other cases of digits within text?
  2. Regarding structure, I found this case:
    <P1>108|1 meñjana–byañjana 10|1 puāi–anna byañjanādi prāsta karāi
    10|1 is a misprint for 110|1. Silently correcting the misprint, should I further manually change the structure to the following?
    <P1>108|1 meñjana–byañjana
    <P1>110|1 puāi–anna byañjanādi prāsta karāi
    Can you identify others cases with two notes on one line computationally?

analyze "bong"

@ajit-github !

Looking only at the file MBO02(2).txt, I find that the typist has used the string bong in at least three ways.

  1. bong = ର୍ବ including the vowel — these cases should be replaced by rva
  • line 47 madhyapabongra pariśiṣṭa >> madhyaparvara pariśiṣṭa
  • line 89 pabongta >> parvata
  • line 219 pūbong >> pūrva
  1. bong = ର୍ବ but without the vowel — these cases should be replaced by rv
  • line 82 nibongāṇa >> nirvāṇa
  • line 518 gandhabonge >> gandharve
  1. bong = ବଂ — these cases should be replaced by vaṁ
  • line 946 yadubongśa >> yaduvaṁśa

Are you able to analyze for the whole ISO folder to what actualy Odia or Roman strings "bong" correspond, and to propose a strategy for correcting what the typist has done?

There are some frequently occuring words (like pārvatī, sarva, pūrva, vaṁśa, parva) that we could perhaps correct first, before tackling the remaining cases of lesser frequency?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.