I'm one of a small team that is maintaining our church's web site. The site has audio, transcripts, devotionals, etc. to help you with your Bible study. As you can imagine, as time flies and different teams maintain the data, we had a big data problem (not "big data", just a large problem with data) on our hands.
One of the things we needed to do was to scrape our transcripts to find all the scripture references in the text. That's easier said than done since the rules for writing a Bible reference is a bit all over the place. Add to that multiple ways to abbreviate the books of the Bible, and we've got a non-trivial problem.
There are also parsing methods that don't use the CultureInfo parameter, and use the standard Parse/throw FormatException approach.
There's still room for improvement. I don't handle references that span chapters for example.
The collection has each and every reference it could find in the text. If this is a transcript, you might have the same book and chapter called out several times, but a slightly different set of verses. To get the smallest number of unique references we have an extension method that works on any enumerable of references:
After all of that, sometimes you just want to dump the references back into a list of strings. The ToString() methods for all the objects handle these rules just as well.
One of the things we needed to do was to scrape our transcripts to find all the scripture references in the text. That's easier said than done since the rules for writing a Bible reference is a bit all over the place. Add to that multiple ways to abbreviate the books of the Bible, and we've got a non-trivial problem.
Bible Utilities
The Bible parsing code lived as part of the church's source code until one day when a young Norwegian college student needed help with the same problem. I helped him out initially with the source code, but since this is a common enough problem I made it an official Nuget package: DHaven.BibleUtilities. You can see the source code on GitHub, which is the official place to post any problems. The package is internationalized, but it only has support for English and Norwegian at the moment.
Book Parsing
For those of you not familiar with the Bible, the protestant cannon has 66 books split across 2 testaments. There are multiple correct ways of writing the same book, and a few common typos that we also need to consider.
- Full name: Galatians, Malachai
- Standard abbreviation: Gal., Mal.
- Thompsan Chain abbreviation: Ga, Mal
Something I didn't know until I worked with BluelO22 is that the Pentateuch (the first five books of the Bible) were simply named 1 Mosebok through 5 Mosebok. That meant I needed to handle spelled out ordinals all the way to 5.
The long and short of it is, to parse a book name you can do it this way:
Book book; if (Book.TryParse("5 Mosebok", new CultureInfo("nb"), out book)) { // I just got Deuteronomy in Norwegian! }
There are also parsing methods that don't use the CultureInfo parameter, and use the standard Parse/throw FormatException approach.
Reference Parsing
If you thought we were done with irregularities, you are mistaken. There's several common conventions for how to reference a set of verses:
- One verse: 1 Timothy 2:8
- A range of verses: Heb. 12:1-5
- Comma separated verses: Mt 15:3,6,8
- A combination: Mark 5: 1, 4-6
- Reference a chapter but no verse: John 4
- Books with only one chapter don't use the chapter number: Philemon 4
Reference reference; if (Reference.TryParse("2 Tim. 2:2", new CultureInfo("en"), out reference)) { // I just got a Reference object with 2 Timothy, chapter 2, verse 2 }
There's still room for improvement. I don't handle references that span chapters for example.
Reference Scanning!
Since the problem I had was scanning text documents for scripture references the library wouldn't be complete with the ability to scan and reduce the references to the smallest number of unique references. The scanner had to be smart enough to peek inside parentheses and handle semicolon separated lists, both of which don't always have spaces around them. The API is really simple:
ICollectionreferences = Reference.Scan(mySuperLongText, new CultureInfo("en"));
The collection has each and every reference it could find in the text. If this is a transcript, you might have the same book and chapter called out several times, but a slightly different set of verses. To get the smallest number of unique references we have an extension method that works on any enumerable of references:
// NOTE: this is a new collection, we don't modify the existing collection. ICollectionreducedSet = references.Reduce();
After all of that, sometimes you just want to dump the references back into a list of strings. The ToString() methods for all the objects handle these rules just as well.
Gee...I had no idea we were placing stumbling blocks in the transcripts. Reckon we can continue on our merry way since you've done this, eh?
ReplyDeleteThese are common problems with writing about the Bible in general. I did the best I could based on the set of transcripts and devotionals I had to work with, but I know there's more gotchas out there. At least there's a utility we can get around to improve over time.
DeleteIf there's anything that it doesn't do correctly, please file a bug report on GitHub for me.