I want to explore some ideas I have been working on in collaboration with the Emma B. Andrews Diary project. Emma B. Andrews was and is an important person in the field of Egyptology and Archaeology. To learn more about Emma B. Andrews, I recommend you listen to a lecture from Dr. Sarah Ketchley. Dr. Sarah Ketchley and her students digitised and transcribed the dairies of Emma B. Andrews. My interest in the project arises in connection to a computational problem. That is, can we use Natural Language Processing to visualize the social networks of Emma B. Andrews?
The answer to the previous question is yes, we can. But how? Dr. Ketchley and her students have encoded the diaries of Emma B. Andrews in TEI-XML format. The Text Encoding Initiative (TEI) is a long standing encoding schema. The schema uses the XML (eXtensible Markup Language) format, by providing its own definitions for the semantics of documentary editing. Consequently, we can data mine the TEI schema. By data mining the the TEI schema, we can construct a database from which we can then use some Natural Language Process techniques to explore the social networks of Emma B. Andrews. While we are certainly interested to visualize the social networks of Emma B. Andrews, we are all the more interested in the ability to query a graph database to explore these relationships. In this post, I would like to demonstrate the utility of the Text Encoding Initiative schema. In another post, we will dive into how we can use the Natural Language Toolkit.
As mentioned above, the dairies of Emma B. Andrews have been digitized and encoded in a TEI-XML format. Here is an example entry from Andrews’ Dairies:
<text> … <div xml:id="EBA-1910-11-14" type="entry"> <dateline> <date when="1910-11-14">Monday. Nov. 14th.</date></dateline> <p>Rather a busy week, settling ourselves and things on the boat, seeing sights – visiting and receiving friends – consulting <persName ref="#Draper_Mr">Mr. Draper</persName> about the little mooring garden. Lunched one day with the <persName ref="#Gorst_Group">Gorsts</persName>, and another day went to a reception there. <persName ref="#Trefusis_Walter">Capt. Trefusis</persName>, and <persName ref="#Carter_Bonham_Edgar"><choice><sic>Mr. Carter Bonham</sic><corr>Mr. Bonham Carter</corr></choice></persName>, <persName ref="#Rathbone_Elena">Elèna</persName>'s friend, dined with us one night, and <persName ref="#Lovatt_Mr #Lovatt_Master">Mr. Lovatt and son</persName> – another night – and <persName ref="#Lovatt_Mr #Lovatt_Master">Mr. Lovatt and son</persName> and <persName ref="#Gay_Walter">Mr. Walter Gay</persName> came to tea on our first afternoon on the boat. We had tea with the <persName ref="#Gay_Walter_Mr #Gay_Matilda">Gays</persName> one afternoon – <persName ref="#Gay_Matilda">Mrs. Gay</persName> very attractive.</p> </div> … </text>
The TEI tag
persName will aid us greatly as we think about a data model to explore the social network of Emma B. Andrews. Our interest is not merely to list all the people with whom she encountered. Our interests are broader and comprehensively more complex. We want to know how Emma related to the people about whom she wrote. Did she consult with them? Dine with them? Converse with them? What topics did they talk about? When did they talk? How frequently did they meet? How can these relationships be visualised and studied?
We need to create a data model whereby we can easily query the data to answer the above questions. So, we need to read the TEI file with an algorithm and data mine the TEI tags for the relevant information. Here, we see immediately the value of the TEI tags. In the example above, Emma flipped the family and given name of an acquaintance. The encoding corrects this mishap by providing the correction
<corr>Mr. Bonham Carter</corr> and the correct name is provided in the attribute of the
<persName ref="#Carter_Bonham_Edgar">…</persName>. Thus, we will want to mine the attribute fields for the
persName, but how can we do this?
We can read the TEI-XML file and parse the document with BeautifulSoup4. By parsing the TEI document with BeautifulSoup, we can create a list containing each journal entry. Since each entry date has been supplied in the attribute field, we can extract the contents of the attribute. We can furthermore extract the
persName. Since there are sometimes more than one
persName, we will need to then iterate over the list of names, but maintaining the journal entry date. But why? The reason we want to keep the date with the person is to facilitate time related research questions in the future. As an example, we might want to explore questions of frequency. How often did Emma B. Andrews have social interactions with Mr. Bonham Carter? But we obviously want to specify what “social interaction” means. Can we use Natural Language Processing techniques to provide greater granularity to the types of social interactions Emma B. Andrews had with her social network? We absolutely can. In a future post, we will take the next step and discuss how we can use Natural Language Processing to aid our understanding of the social networks of Emma B. Andrews.
ps. This post is based on the first stage of development. To explore the prototype, I encourage you to read the Jupyter Notebook.