Replies: 15 comments
-
I don't understand, I can see that both of them are pretty much the same. Let's take أَبَأَ for example, what is the difference between what is in shamela.ws and the screenshot you provided? |
Beta Was this translation helpful? Give feedback.
-
@salehmu The content itself is not different, the structure however of the document, the document in shamela colors the words differently from the definitions, but it also colors other things too? |
Beta Was this translation helpful? Give feedback.
-
nevermind! I think i found a better way to do this |
Beta Was this translation helpful? Give feedback.
-
@blocr have you reached to anything? |
Beta Was this translation helpful? Give feedback.
-
@salehmu not much progress to be honest, the sources are either incomplete or hard to scrape |
Beta Was this translation helpful? Give feedback.
-
@blocr can you explain how the wiki source is incomplete? It seemed complete to me |
Beta Was this translation helpful? Give feedback.
-
@salehmu I just assumed it is since it mentions here that it's not and in the discussion page too otherwise i'm not sure. |
Beta Was this translation helpful? Give feedback.
-
I see, what about IslamPort? @blocr |
Beta Was this translation helpful? Give feedback.
-
@salehmu let me check that and report back |
Beta Was this translation helpful? Give feedback.
-
@blocr Okay, I started any way writing a utility for scrapping from it, 90% of islamport copys are just the same as shamelas', so it's pretty solid OCR and since the website is so small scrapping won't be a big issue. Also I'm going to consider deserializing the bok extension, I've looked many projects that did that. https://github.com/ojuba-org/thawab-lite Have a look that thawab-lite, since it's written in python maybe you would have more understanding than me of the way it works, if you are able to reuse it to generate RTF it would be great. |
Beta Was this translation helpful? Give feedback.
-
@salehmu alright, but i'm not sure why we should go for such a proprietary format like RTF. |
Beta Was this translation helpful? Give feedback.
-
@blocr i was looking at islamport now http://islamport.com/w/lqh/Web/1156/8.htm the format is pretty bad, I think I will consider using |
Beta Was this translation helpful? Give feedback.
-
@salehmu The problem i found with shamela's version though is that it doesn't have any structure, like how can you seperate the word from its definition? |
Beta Was this translation helpful? Give feedback.
-
@blocr Well, I think we are able to at least serialize JSON objects of chapters, i.e. scrapping the data first, after that we will see how we can make use of it (separate the word from its definition). We should consider preserving the text highlighting from shamela and replacing it with more intelligent one to be used as marks. |
Beta Was this translation helpful? Give feedback.
-
@salehmu that's a good idea, but i found the highlighting more or less semantically meaningless, they look like some weird OCR artifacts but i don't know |
Beta Was this translation helpful? Give feedback.
-
While there are many digitized sources for the book itself, most of them provide no way to distinguish words from definitions which makes parsing extremely challenging
https://shamela.ws/book/1687/23
The original book makes the structure very clear however:
Beta Was this translation helpful? Give feedback.
All reactions